Friday, 25 September, 2015 UTC


Summary

In our last post, we talked about the algorithm we use to do sub-second CRISPR searches on billions of DNA bases. Our initial search infrastructure was too expensive and unscalable, so one of our summer interns, Harini, focused on rebuilding this infrastructure using AWS Lambda. In this post, we’ll focus on how we use AWS Lambda to parallelize our CRISPR searches.

The Problem

CRISPR Cas9 is a revolutionary genome editing technique that allows scientists to modify parts of a genome with extreme precision. It’s already being used by thousands of scientists worldwide to build disease models, cure genetic disorders, and more¹.
A CRISPR search involves finding a 20 base string of DNA (a “guide”) that occurs in the target region of the genome and nowhere else. Given a particular guide, we need to find all locations in the genome that are “close enough” to the guide.
We want to be able to do this search across dozens of genomes in a way that’s fast, cheap, and dynamic. In our case, dynamic means it should be easy to add new genomes. Since our algorithm doesn’t require any indexing or preprocessing of the genome, we should also be able to support genomes that users upload.
When analyzing our infrastructure, we’ll keep the following in mind:
  • We currently run a few hundred thousand CRISPR searches a month. These searches happen in bursts, and are not uniformly distributed across the day.
  • The speed should not depend on how many concurrent searches are running.
  • We currently have 30 genomes, each 3GB in size; we should be able to support an arbitrary number of genomes.
  • We have an algorithm (implemented in C++) that takes in a string of DNA and list of queries, which runs at about 1 second per 100MB.
  • To reduce complexity, we’d like this to run easily in our current AWS architecture.
From these requirements, we have an idea of what our infrastructure needs:
  • Dynamic server allocation — we want to serve requests quickly even at peak times, and it would cost too much to allocate for the maximum load.
  • Shared file system or separate servers per genome — we have 90GB (and growing) of genome data, so storing this on the disk of each (dynamically allocated) server is not feasible.

Original Infrastructure

Our old CRISPR infrastructure used several servers to process CRISPR search tasks. Each server stored all of the genomes on disk. Using our existing Python Celery infrastructure, a single CRISPR search was split into several subtasks and parallelized across the servers. Each subtask read the corresponding genome region from disk, performed the search, and returned the results. The results were combined by a master task and returned to the user.
There were several issues with this approach:
  • All genomes were stored on each of the servers, making maintenance difficult.
  • We couldn’t spin up new servers fast enough if demand was high.
  • We were paying for the servers even when they were not being used.
These are all common problems with scaling, and Amazon provides several possible solutions. Auto Scaling is one such service, but is too slow for our spiky requests. Each CRISPR search takes around 1 second, so waiting 1 minute for an EC2 instance to start is too much. Instead, we went with Amazon’s new serverless platform, AWS Lambda. In the following sections, we’ll discuss the infrastructure we built on AWS Lambda, and the various complications we ran into.

AWS Lambda

AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.
AWS Lambda provides all of the infrastructure around dynamic server allocation. To get started, you upload a ZIP file with the code you need to run, and specify the entry point for the request. For any request, Lambda will run the code, creating new servers if necessary. You pay for the amount of time used and memory allocated for each request. We use Lambda to offload long-running and parallelizable work from our web servers.

New Infrastructure

Our goal is to split up a CRISPR search across several Lambda “tasks” to reduce costs and increase scalability. The CRISPR search problem is easily parallelized by splitting the genome into smaller regions that can be processed separately. The general approach is fairly straightforward:
  1. A web server receives a request to do a CRISPR search on a specific genome.
  2. The web server splits up the genome into smaller regions.
  3. The web server invokes the Lambda function for each region.
  4. Once all of the Lambda tasks complete, the results are combined and returned to the user.
This approach completely solves our dynamic scaling issue. We don’t need to maintain several servers to perform searches, as AWS will handle all of the allocation for us. That was easy!
But, as always, there were…

Complications

AWS Lambda imposes several limitations on the size and types of requests it can process. Here’s a quick list of the main issues we ran into, and how we solved them.

1. Running C++ code with AWS Lambda

Lambda only supports Javascript (Node.js) and Java, so we needed a way to run our C++ code. Porting the code over to Javascript or Java would have resulted in a significant performance hit.
Node.js provides a way to call into C++ libraries directly with addons. We used node-gyp to compile our C++ code into a binary that Node.js can call directly.
Lambda also only runs on Amazon Linux, so we needed to compile our C++ code on an Amazon Linux machine. To deploy a new AWS Lambda function, our process is:
  1. Spin up an Amazon Linux machine
  2. Compile the search code using node-gyp
  3. Upload the compiled code as a new AWS Lambda function.
Although this process is fairly cumbersome, it hasn’t been a problem, as the core search functionality has remained relatively unchanged. We can automate this process if necessary.

2. Passing genome data to the Lambda function

The Lambda function needed to easily access regions of the 3GB genome. Amazon has several size limitations that made this hard:
  • Maximum size of the uploaded zipped code: 50MB
  • Maximum size of the unzipped code: 250MB
  • Maximum request size: 6MB
  • Maximum response size: 6MB
We considered several options:
  1. Pass the genome bases as an argument to the Lambda function
  2. Include the genome region in the zipped code
  3. Download the genome region from S3 on each Lambda invocation
Option 1 was too slow, and option 2 would have been hard to maintain with dozens of genomes. We went with option 3.
We stored each of the genomes as its own file in S3. We then specified the genome region to the Lambda function by passing in the genome name with start and end indices.
function CRISPRSearch(genome, startIndex, endIndex) {
# Download genome region from startIndex to endIndex
# Run the C++ search algorithm
# Return the results
}
The S3 API supports downloading specific regions of a file, so we only download the bases we need. The bandwidth between S3 buckets and EC2 instances in the same region is around 50MB/s¹, so this was fast with small enough genome regions.
A large issue we ran into was the Lambda Resource Model.
AWS Lambda then allocates CPU power proportional to the memory by using the same ratio as a general purpose Amazon EC2 instance type, such as an M3 type.
Each Lambda function has an allocated memory size. The only way to increase CPU speed is to pay for more memory. Download speeds with EC2 are directly proportional to the CPU power, and thus the memory. This meant that even though our algorithm required very little memory, we had to pay for 1GB to get 50MB/s download speeds.

Analysis

We wanted our infrastructure to be fast, cheap, and dynamic. Let’s look at how our new Lambda infrastructure compares with the old server infrastructure.

1. Speed and Scalability

Our old server infrastructure was built to run CRISPR searches in around 1 second. It unfortunately did not scale well as demand fluctuated, and would have costed a lot as overall demand increased.
Our new Lambda infrastructure can theoretically run at arbitrary speeds, since we can choose the amount of parallelization. In reality, there are other problems as the number of Lambda tasks gets large, so we currently run it at around 1 second per CRISPR search. It also scales perfectly with demand, meaning that even at peak times we can perform CRISPR searches quickly.

2. Cost

Our old server infrastructure cost thousands of dollars each month just for server costs.
Using the new Lambda infrastructure, we pay for the number of Lambda invocations, the total duration of the requests, and the number of S3 requests. This comes out to $60/month for hundreds of thousands of CRISPR searches!

3. Dynamic User Uploads

Our old server infrastructure required us to add new genomes to the disk of each server. It also made analyzing user uploaded genomes difficult, as we would have needed to copy the uploaded genome to several different servers.
The Lambda infrastructure gets around this problem easily, since the genome is only stored in one place. We can now support user-specified genomes by uploading them directly to S3. As CRISPR becomes more widespread, this will become more useful for companies and users that have custom genomes to analyze.

Conclusion

This was our first foray into using AWS Lambda. We were able to use the serverless architecture to parallelize long running CRISPR searches. It was significantly cheaper, and will scale well as our CRISPR usage grows. We’re super excited to be building tools that scientists use every day to do their research.

We’re Hiring!

And yes, Benchling is hiring engineers to join our small-but-growing team — if you’re interested in building high-quality tools to power a new generation of research, see our jobs page or contact us.
  1. Data transfer between S3 buckets and EC2 instances in the same region is free, so we only need to pay for the number of GET requests.

Powering CRISPR with AWS Lambda was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.