Here are the steps to install python 3 on docker.

  1. First download the appropriate docker image:
docker pull centos

2. Start the docker container:

docker run -it -name boto3-centos centos

The -name parameter ensures that the container has the name boto3-centos . You could name it whatever you find reasonable

Install GCC compiler and other dependencies for python:

yum install gcc openssl-devel bzip2-devel libffi-devel zlib-devel -y

4. Download python 3.7.x:

curl --output Python-3.7.9.tgz

5. Extract the tgz file

tar xzf Python-3.7.9.tgz

6. Install Python

cd Python-3.7.9
./configure --enable-optimizations
yum install make -y
make altinstall

7. Verify the version of…

As I mentioned in part 2 of this story, the async-parallel version is actually not really async. Really the speedup is more due to using the multithreading facilities of asyncio (bascially invoke your blocking sync call in loop.run_in_executor(...) ) rather than actually concurrently running async tasks. So the search was on for a truly async-parallel solution. This is actually an feature being requested by quite a few people, but mostly misunderstood in various forums. The general response for solving the problem of async-parallel is:

Why do you need to parallelize async code? doesn’t it already give you the speedup you need with simply running things in the event loop? …

In Part 1 of this story , I shared with you how I managed to get a nice speedup of s3 processing (with a running example of s3 copy) using multiple threads and finetuning the level of parallelism. However can we do better?

In terms of tuning code for processing datasets, here are some rules of thumb, I find useful:

  1. If code is cpu intensive, but the data can be processed in parallel, then this would be a good candidate for processing using multiple threads and getting the necessary speedup. This helps with task parallelism
  2. If code is IO intensive, but data can be accessed in parallel , either from disk or over the network, then this is a good candidate for using async code. This helps with efficient use of precious CPU resources by allowing the code to yield control of the CPU until the IO call completes. This helps with task…

AWS S3 is the go to cloud based object storage widely used in the industry today. But as you become a power user of s3, some of the standard ways of copying , deleting , listing objects show their limitations.

For example: How does one copy objects from one s3 prefix to another? Here is the sample code using python and boto3:

import boto3

s3_client = boto3.client('s3')
kwargs = {'Bucket': 'src_bucket', 'Prefix': 'src_prefix'}
continuation_token = 'DUMMY_VAL'
while continuation_token:
if continuation_token != 'DUMMY_VAL':
kwargs['ContinuationToken'] = continuation_token
objects_response = s3_client.list_objects_v2(**kwargs)
# Check if objects response really has some objects in it
for object in objects_response['Contents']:
object_key = object['Key']
copy_source = {
'Bucket': 'src_bucket',
'Key': object_key
dest_bucket = s3_client.Bucket('dest_bucket')
dest_bucket.copy(copy_source, …

As a software engineer and a semi regular computer science guy, I like to take various software challenges that get presented at work and dig a little bit deeper. One such issue was that of exploring context free grammars. We sometimes need to express a CFG and then use a ready made parsing library to check the validity of the string to see if it is accepted by a specific CFG. One of the well known parsers to do CFG parsing is the Earley Parser. …

I happened to research on how arbitrary boolean expressions can be parsed at runtime to compute true / false values. Since I wanted to implement this in Scala, I did some research and found this interesting nugget in StackOverflow. It was a way to idiomatically expressing the grammar in Scala. And then using the generated parser to parse the expression and finally you had to hand write the expression evaluation logic yourself. I was not comfortable with how the DSL was expressed and the use of custom Parsers. See the code snippet from that SO post

object ConditionParser extends JavaTokenParsers with PackratParsers {

val booleanOperator : PackratParser[String] = literal("||") | literal("&&")
val comparisonOperator : PackratParser[String] = literal("<=") | literal(">=") | literal("==") | literal("!=") | literal("<") | literal(">")
val constant : PackratParser[Constant] = floatingPointNumber.^^ { x => Constant(x.toDouble) }
val comparison : PackratParser[Comparison] = (comparisonOperator ~ constant) ^^ { case op ~ rhs => Comparison(op, rhs) }
lazy val p1 : PackratParser[BooleanExpression] = booleanOperation | comparison
val booleanOperation = (p1 ~ booleanOperator ~ p1) ^^ { case lhs ~ op ~ rhs => BooleanOperation(op, lhs, rhs) }…

In part 1 I discussed the use case of assigning session id to closely occurring events generated by a particular user’s activity. We discussed the various options available out of the box from spark and why there is a gotcha with each option. Now in this post, I will discuss a more robust solution which satisfies our use case.

Read a good intro about what is the difference aggregation and windowing in spark, see here. It was getting clear that a Custom Window function was needed to process the user events. Amazingly this very interesting blog solves this exact problem! The author sets up a Custom window function using catalyst expression. This is very cool and probably the appropriate thing to do. …

I have been looking at all kinds of interesting use cases for data processing for events. Then recently came upon this interesting problem:

Lets say you have a series of events emitted as user clicks from a web application. Simplistically the event object would look like this:

"userid": "u1",
"emit-time": "2018-01-22 13:04:35",
"action": "get",
"path": "/orders"

Basically this event represents a user clicking on a link to see their orders. Lets assume the user interacts with the web service on a continual basis and we get an event stream which records his/her activity. Typically if the user continues to interact with the web service, the user session will continue to be extended. Lets say the user session expires if there is no user interaction beyond 10 hours (session timeout). If the user resumes activity after the session has timed out, a new session is created for the user. Our aim is to annotate the events data with such session data. The logic would be to label each group of user activity that occurred within 10 hours of each other with a unique session id. This then allows us to extract more useful info about the user’s usage trends. …

I have used spark on an adhoc basis for work related things. It is great fun to write powerful data processing flows in spark. However I wanted to challenge myself more. Here is an interesting way to challenge yourself. Try writing some of the common coding interview puzzle questions in spark code. It is fun and exercises your spark muscles nicely. After I completed the setup of my IDE to enable local spark development, I went in hunt of a kinda easy coding puzzle which can be solved using the parallelization of spark.

The problem

I found a relatively easy coding puzzle in Hackerrank. It is the Flatland space station problem. Quoting from the problem…


lazy coder

A lazy coder who works hard to build good software that does not page you in the middle of the night!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store