Here are the steps to install python 3 on docker.

docker pull centos

2. Start the docker container:

docker run -it -name boto3-centos centos

The -name parameter ensures that the container has the name boto3-centos . You could name it whatever you find reasonable

Install GCC compiler and other dependencies for python:

yum install gcc openssl-devel bzip2-devel libffi-devel zlib-devel -y

4. Download python 3.7.x:

curl https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz --output Python-3.7.9.tgz

5. Extract the tgz file

tar xzf Python-3.7.9.tgz

6. Install Python

cd Python-3.7.9
./configure --enable-optimizations
yum install make -y
make altinstall

7. Verify the version of…


As I mentioned in part 2 of this story, the async-parallel version is actually not really async. Really the speedup is more due to using the multithreading facilities of asyncio (bascially invoke your blocking sync call in loop.run_in_executor(...) ) rather than actually concurrently running async tasks. So the search was on for a truly async-parallel solution. This is actually an feature being requested by quite a few people, but mostly misunderstood in various forums. The general response for solving the problem of async-parallel is:

Why do you need to parallelize async code? doesn’t it already give you the speedup you…


In Part 1 of this story , I shared with you how I managed to get a nice speedup of s3 processing (with a running example of s3 copy) using multiple threads and finetuning the level of parallelism. However can we do better?

In terms of tuning code for processing datasets, here are some rules of thumb, I find useful:


AWS S3 is the go to cloud based object storage widely used in the industry today. But as you become a power user of s3, some of the standard ways of copying , deleting , listing objects show their limitations.

For example: How does one copy objects from one s3 prefix to another? Here is the sample code using python and boto3:

import boto3

s3_client = boto3.client('s3')
kwargs = {'Bucket': 'src_bucket', 'Prefix': 'src_prefix'}
continuation_token = 'DUMMY_VAL'
while continuation_token:
if continuation_token != 'DUMMY_VAL':
kwargs['ContinuationToken'] = continuation_token
objects_response = s3_client.list_objects_v2(**kwargs)…


As a software engineer and a semi regular computer science guy, I like to take various software challenges that get presented at work and dig a little bit deeper. One such issue was that of exploring context free grammars. We sometimes need to express a CFG and then use a ready made parsing library to check the validity of the string to see if it is accepted by a specific CFG. One of the well known parsers to do CFG parsing is the Earley Parser. …


I happened to research on how arbitrary boolean expressions can be parsed at runtime to compute true / false values. Since I wanted to implement this in Scala, I did some research and found this interesting nugget in StackOverflow. It was a way to idiomatically expressing the grammar in Scala. And then using the generated parser to parse the expression and finally you had to hand write the expression evaluation logic yourself. I was not comfortable with how the DSL was expressed and the use of custom Parsers. See the code snippet from that SO post

object ConditionParser extends JavaTokenParsers…


In part 1 I discussed the use case of assigning session id to closely occurring events generated by a particular user’s activity. We discussed the various options available out of the box from spark and why there is a gotcha with each option. Now in this post, I will discuss a more robust solution which satisfies our use case.

Read a good intro about what is the difference aggregation and windowing in spark, see here. It was getting clear that a Custom Window function was needed to process the user events. Amazingly this very interesting blog solves this exact problem…


I have been looking at all kinds of interesting use cases for data processing for events. Then recently came upon this interesting problem:

Lets say you have a series of events emitted as user clicks from a web application. Simplistically the event object would look like this:

{
"userid": "u1",
"emit-time": "2018-01-22 13:04:35",
"action": "get",
"path": "/orders"
}

Basically this event represents a user clicking on a link to see their orders. Lets assume the user interacts with the web service on a continual basis and we get an event stream which records his/her activity. Typically if the user continues…


I have used spark on an adhoc basis for work related things. It is great fun to write powerful data processing flows in spark. However I wanted to challenge myself more. Here is an interesting way to challenge yourself. Try writing some of the common coding interview puzzle questions in spark code. It is fun and exercises your spark muscles nicely. After I completed the setup of my IDE to enable local spark development, I went in hunt of a kinda easy coding puzzle which can be solved using the parallelization of spark.

The problem

I found a relatively easy coding puzzle…

lazy coder

A lazy coder who works hard to build good software that does not page you in the middle of the night!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store