It takes a village to scale machine learning pipelines!
Necessity and curiosity give rise to innovation. This is what I’ve experienced during my two years of work at Riverus, which I can rightfully say, is an embodiment of this observation. My last project allowed me to work on a piece of technology that I wanted to explore for a long time and I couldn’t have been more happier that it worked just as I envisioned.
This article will give you a peek into how we managed to increase the speed of our pipeline and how we intend to provide value to our customers.
To give you a perspective on the time saved, if we were to run our Pipeline sequentially on 100 files, it would’ve taken 10 hours optimistically. With our novel architecture, we brought it down to under 17 minutes!
Why did we do it in the first place?
As Ryan Holiday rightly suggests in his book Perennial Seller, to ask “What for?” before beginning any task, we set out with a couple of goals in mind –
- Same architecture should be able to process batch and stream data
- Separate API for each data point – Modularity
- Robust exception handling – Fault tolerant
- Workload management – Efficient computation
- Helpful error reporting
These weren’t some arbitrary features I envisioned for our Pipeline Architecture. These arose from the problems we faced on a day-to-day basis with our previous architecutre.
We achieved the above using a combination of AWS Lambda functions, EC2 instances, S3 buckets, SNS notification system and CloudWatch.
How did we achieve it?
Batch and Stream data
Since Riverus is trying to be a pioneer in the legal domain, which is both versatile and historically inundated in terms of cases, it is imperative to cater to 2 basic needs –
- Batch/Historical Data – Processing, extracting and analyzing a bunch load of existing legal cases in one go.
- Stream data – Doing the above tasks as and when the legal case becomes available in our repository.
Introducing parallelism across files allows us to use the same architecture for stream as well as historical bulk runs.
With the fast, iterative and continuous improvement of data extraction systems happening overtime, it becomes imperative to provide these enhancements to customers as quick as possible without hindering their capabilities to use these features.
Sometimes it may happen that a few critical datapoints improve and a bulk run needs to be done over a ton of files. The separation of each data points into it’s own Lambda Function enables us to do exactly that and with very minimal changes to the code base!
Workload management (Efficiency)
Since speed is the name of the game, we have identified those sweet spots which we can leverage to achieve our goal. There are multiple areas where concurrency and multi-processing is possible which is responsible to reduce the overall time of the pipeline –
- File-level concurrency: This architecture allows us to spawn multiple processes simultaneously to satisfy hundreds of files at the same time.
- Datapoint-level concurrency: The extraction of datapoints commence at the same time i.e. they are fired simultaneously.
- Multi-processing within the datapoints: This depends on the processes responsible for the extraction. And techniques can be employed to make that extraction faster using multiple cores available in the lambda functions.
Fault Tolerant and Error Reporting
While tending to customers’ speedy needs it is equally important to manage internal errors and ensure that the customer receives only a soft blow.
Amazon CloudWatch comes to rescue in these trying times and lets us be aware of the impending downtimes and erroneous areas in the pipeline. It lets us monitor and salvage the situation before it goes out of hand thus increasing customer satisfaction.
With great power comes great restraints! Lambda function is a great candidate for hosting your APIs if you’re looking for ease, flexibility and reliability. But in our use cases, we found it lacking in one area – Power-hungry Deep Learning models – which leaves us with no choice but to go back to EC2 servers and host our own APIs.
Unfortunately Lambda functions don’t give us access to GPU computation which would’ve solved our problems. It’s alternative to load the model in memory also hits the limits set by Amazon for Lambda.
- Harness the sweet GPU nectar – we could probably use Algorithmia – which allows us to deploy models using GPUs.
- Make our models smaller – Techniques such as Dynamic Quantization which use fp16 data types instead of fp32 can make the models generally smaller and faster.
- Make our models faster – Microsoft has been developing ONNX runtime for quite a while now and they vouch to accelerate Pytorch and Tensorflow models in production!
Written by Niraj Pandkar, Data Scientist @ Riverus (Alumnus)