Speeding up S3 processing — Part 2

In Part 1 of this story , I shared with you how I managed to get a nice speedup of s3 processing (with a running example of s3 copy) using multiple threads and finetuning the level of parallelism. However can we do better?

In terms of tuning code for processing datasets, here are some rules of thumb, I find useful:

  1. If code is cpu intensive, but the data can be processed in parallel, then this would be a good candidate for processing using multiple threads and getting the necessary speedup. This helps with task parallelism
  2. If code is IO intensive, but data can be accessed in parallel , either from disk or over the network, then this is a good candidate for using async code. This helps with efficient use of precious CPU resources by allowing the code to yield control of the CPU until the IO call completes. This helps with task concurrency

Thus the two are orthogonal concepts and not mutually exclusive. However combining both techniques will not be covered in this post. We will focus just on making our code async in this post.

So I went in search of figuring out how to use boto3 in an async manner. And the guidance posted this blog post is not the way to do it :-). The speedup achieved there despite the use of async keyword is due to the multithreaded nature of the program and not because any async python feature is used.

Moving on, I found there is an async version of boto3 available! It’s called aioboto3. So this was quite promising and I coded in the async version of the s3 copy.

It looks something like this:

async def copy_object_async(src_bucket, src_prefix, dest_bucket, dest_prefix, s3_key):
s3_filename = s3_key[(s3_key.index('/') + 1):]
dest_key = f"{dest_prefix}/{s3_filename}"
async with aioboto3.client("s3") as s3_async:
task = await s3_async.copy_object(Bucket=dest_bucket,
CopySource={'Bucket': src_bucket, 'Key': s3_key})

And disaster! the lambda function basically blew out both in terms of memory and time. The limits for a 5 minute timeout and a memory allocation of 128 MB were easily hit and basically the lambda errored with a timeout error.

The issue seems to be that memory overhead of 2000 tasks seems to dominate the memory usage of the system. So 128 MB is not enough headroom for managing so many tasks.

I decided to max out the memory allocated to the lambda to 3008 MB. On doing this I finally got a breakthrough! The async tasks did complete in less time than the serial sync calls to s3 copy. The following table has the comparison of the various strategies:

| strategy | runtime(secs)| memory(MB) |
| serial | 154.9 | 902 |
| parallel | 24.2 | 902 |
| async | 52.4 | 903 |
|async-parallel| 23.9 | 904 |

These are indicative numbers and there is some variation on repeated runs of this test. But the table stands as a good approximation of the performance of the various strategies. Let us discuss the differences of the various strategies:

  1. Serial: This uses the standard boto3 sync api to run the s3 copy calls on an enumeration of the s3 objects. So the calls are run in sequence , only after the copy of the previous object is completed will the copy of the next s3 object begin. And the cost of running 2000 such calls in a sync fashion is reflected in the runtime of approx 155 secs, which is the longest among all the strategies
  2. Parallel: This strategy uses the same boto3 sync api, but, parallelizes the calls across multiple threads. The number of threads has been set to cpu_count+4 which gives the best results. This has one of the best runtimes of 24 secs while still not doing worse memory wise
  3. Async: In this strategy , we use the ‘asyncified’ version of boto3 (aioboto3) . This still runs as a single threaded application whereby all the async calls to s3 are put on a single asyncio event loop. While the runtime is much better at 52.4 seconds compared to the serial strategy it is still almost double of the parallel strategy. This asyncio apis can boost performance, especially for single threaded applications
  4. Async-parallel: This is not what you think it is. Basically this is running sync code by submitting function objects to an asyncio facility by invoking loop.run_in_executor As can be imagined, this is no different than using the ThreadPoolExecutor facility available in the concurrent.futures module in python. Its not clear to me why asyncio makes this facility available when the one in concurrent.futures seems perfectly adequate. Feel free to leave some comments if you know why. As expected the performance of this strategy closely resembles the performance of the parallel strategy. So we get a runtime of 23.9 secs vis-a-vis 24.2 secs for the parallel strategy. The difference in times is negligible statistically and we can conclude they have the same performance

So in that sense async-parallel is not what it really sounds like. Rather it is running normal blocking (non-async) code in multiple threads through the asyncio facility. Can we truly have multiple asyncio event loops? The answer is yes, and that will be the topic of another blog post.

Hope you found this blog post interesting. Would love to hear your comments on this topic

A lazy coder who works hard to build good software that does not page you in the middle of the night!