In part 1 I discussed the use case of assigning session id to closely occurring events generated by a particular user’s activity. We discussed the various options available out of the box from spark and why there is a gotcha with each option. Now in this post, I will discuss a more robust solution which satisfies our use case.
Read a good intro about what is the difference aggregation and windowing in spark, see here. It was getting clear that a Custom Window function was needed to process the user events. Amazingly this very interesting blog solves this exact problem! The author sets up a Custom window function using catalyst expression. This is very cool and probably the appropriate thing to do. However the code logic is quite non-obvious precisely due to this reason.
In the intro to window functions, the following line
For aggregate functions, users can use any existing aggregate function as a window function
gave some food for thought. The guidance for how to create a custom a aggregate function is quite useful to build out our own custom aggregate function. It is also way more decipherable to build one compared to using catalyst expressions. So I gave it a shot. After a few iterations of building out a custom aggregate function, I am happy to report that it succeeded!
Here is the
The entire code repository is available here