As I mentioned in part 2 of this story, the async-parallel version is actually not really async. Really the speedup is more due to using the multithreading facilities of asyncio (bascially invoke your blocking sync call in loop.run_in_executor(...) ) rather than actually concurrently running async tasks. So the search was on for a truly async-parallel solution. This is actually an feature being requested by quite a few people, but mostly misunderstood in various forums. The general response for solving the problem of async-parallel is:

Why do you need to parallelize async code? doesn’t it already give you the speedup you need with simply running things in the event loop?

And the answer is an emphatic NO!.

The internets is littered with posts that confuse concurrency and parallelism. What is worse is that they suggest either multithreading OR asyncio as options for speedup, but NOT both. So I was lucky to find the following Stackoverflow post which finally had the answers, I was looking for. So without further ado, here is the solution I setup for being truly async parallel (inspired by this SO answer):

async def run_async(s3_objects, src_bucket, src_prefix, dest_bucket, dest_prefix):
tl_loop = None
tl_loop = asyncio.get_running_loop()
except RuntimeError:
print("Got runtime error, ignoring...")
print(f"{threading.get_ident()}: got running loop: {str(tl_loop)}")
if not tl_loop:
tl_loop = asyncio.new_event_loop()
print(f"{threading.get_ident()}: made new running loop: {tl_loop}")
tasks = []
for s3_key in s3_objects:
tl_loop.create_task(copy_object_async(src_bucket=src_bucket, src_prefix=src_prefix, dest_bucket=dest_bucket,
dest_prefix=dest_prefix, s3_key=s3_key)))
await asyncio.gather(*tasks)

def run_sync(s3_objects, src_bucket, src_prefix, dest_bucket, dest_prefix):, src_bucket, src_prefix, dest_bucket, dest_prefix))

async def copy_async_parallel2(src_bucket, src_prefix, dest_bucket, dest_prefix, num_items):
print("running in async with threads mode v2")
loop = asyncio.get_running_loop()
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=parallelism) as executor:
for s3_objects in s3_lister(src_bucket, src_prefix, page_size=50, num_items=num_items):
# for src_key in s3_objects:
loop.run_in_executor(executor, run_sync, s3_objects, src_bucket, src_prefix, dest_bucket, dest_prefix))
print(f"before awaiting {len(futures)} futures...")
await asyncio.gather(*futures)
print(f"after awaiting {len(futures)} futures...")

Basically I take batches of s3 objects and then submit them to threadpool which runs the run_sync method with this batch. This in turn adds each of the s3 objects as async tasks on the event loop of the respective executor thread. Thus now you have batches of async copy tasks running in parallel threads. We got truly async-parallel!

So how was the performance? Promising, but not class leading. Over various runs. I got a runtime of 38.2 seconds and a memory usage of 260 MB. So while this betters the the pure async mode time of 52.4 secs, it is the drop in memory usage that is eyepopping! From about 903 MB usage down to 260 MB which is an awesome improvement!. Here is the table again (the truly async-parallel mode is creatively named asasync-parallel2)

| strategy | runtime(secs)| memory(MB) |
| serial | 154.9 | 902 |
| parallel | 24.2 | 902 |
| async | 52.4 | 903 |
|async-parallel | 23.9 | 904 |
|async-parallel2| 38.2 | 260 |

I will leave the memory profiling for some other day, but suffice to say that there is a lot of new things to discover and understand when doing performance tuning for these tasks.

If you are interested in the full code for my experiment, you will find it here:

Look forward to some feedback from you on what other nuances you have seen with regards to concurrency and parallelism in python

A lazy coder who works hard to build good software that does not page you in the middle of the night!