The Big data wave. It is not a new phenomenon that the field of genomics must deal with a data processing challenge. Moore’s law and in connection budgets cannot keep up with the increased data processing requirements.
This has been clear on paper for a while, but now people start feeling the harsh reality when hitting the boundaries of available compute, storage systems and competent people to manage. This often leads to under utilized sequencing capacity and expanding IT infrastructure to be able to surf this data wave.
All is fueled by a decrease in sequencing cost, and at the same time, research reaching a ceiling on statistical significance, when using smaller/existing cohorts. The spark to work with new and larger cohorts in NGS studies to increase the ability to find new indicators with higher confidence drives down sequencing cost, yet reaches an IT cost point, which slowly becomes an inhibitor. And rest assured, the quest for cheaper and higher capacity sequencing technologies is still in full swing.
The need for innovation
There is only one-way out: Innovation! In this space, every institute and every person finds himself subject to Archimedes law. The deeper one is buried in the data wave, the more pressure one has to (mentally) let go of the legacy weight of existing methods which keep one under and float up to grab a spotted lifesaver ring.
I don’t think sequencing service provider competition will accelerate this path. The user of NGS data must stand up and push for innovation by demanding better service and lower price levels. The end-user is always paying for the entire food chain: if you buy a sandwich, you pay the farmer.
When one is down deep enough, one would grab any lifesaver ring in sight. For those who are still not squeezed for oxygen, it makes sense to consider the data flow landscape and make an informed selection instead of using the first available escape.
We know that the data stream from the sequencer is ‘fat’ up to the point of the VCF. I say VCF and not gVCF, as the latter is a beast as well. We’ve seen gVCF’s reaching a size of 7-50GB. Right now, this fat data stream is the most significant analysis concern. So if we ‘solve’ the part up to the (population) VCF, we are floating.
There are voices which wave the ‘pain’ at this first stage as less important and focus on LIMS systems to automate sequencing production farms. That might be of operational importance in large labs, still it does not take away the scope and cost of the issue at hand. Automating the hammer hitting your finger will only amplify the pain.
Producing the VCF
Architecture For NGS Data ProcessingThe vast majority of genetic data users are just interested in the ‘result’ of the heavy data lifting, which is the VCF. Having the reads as background for quality assurance, removal of doubt and undisputed proof for new discoveries and publications is a requirement. But hardly any of the researchers or medical doctors I know are equipped for, or interested in, Big data processing. The number of ‘services’ that pop-up in the market as spinouts of hospitals or research institutes and the interest of investors in ventures of this nature is a clear signal.
Back to our main subject: the intuitive approach to deal with the Big data challenge in NGS would be to segment the IT infrastructure into a high performance Big data section and a non-Big data section. This is perfectly legitimate and will co-locate the sequencer and first stage data processing in a Big data environment, which transforms into a non-Big data environment with lower specifications for in depth analysis. Such analysis could and should be done completely separate from the Big data environment.
Some say that the ultimate solution would be to do all Big data processing inside the sequencer and only produce a (g)VCF. It is true that this will reduce the data-stream coming out of the sequencer substantially (except as said before for a gVCF). There is, however, a set of consequences of this approach that we need to consider.
Considerations for NGS Big Data Processing
A sequencer is a general-purpose device, which takes biological material (single or multi-sample) and transforms this in a very clever way to a digital representation. The samples may be a mix from different species and/or a mix of DNA and RNA. In other words, the content is most likely heterogeneous for a single run, and very likely heterogeneous for multiple runs.
To produce a (g)VCF inside a (short read) sequencer, one would have to do mapping and variant calling against a provided reference. This turns the sequencer into a ‘specific’ device tightly integrated with the data profiles and the analysis environment. One would have to load a reference (one per species) for each sequencing run, and specify which part of the ‘run’ is RNA or DNA. It is true that in the end, this is just a logistic and information problem. But it does add another level of integration, automation and complexity to the sequencing process. I further don’t think it is likely that private references will be sent to (offshore) sequence service providers.
When a sequencer would only produce a (g)VCF, one cannot re-align nor inspect the originating reads for migration and verification purposes. This will not work with current regulations (which of course will change over time). But what may be more important, it is impractical. There is no possibility to run another pipeline on the (raw) data, get a second opinion on the results, or take a different approach to ensure quality, or play with sensitivity when dealing with specific (tumor) data sets, and one cannot do assembly.
Then there is a performance and scalability consideration, specifically for the high throughput machines, which are causing the Big data headache. The internal methods of the dominant (Illumina) high throughput machines require all cycles to complete before alignment and variant calling can start. This is known as the modality problem. For each sample (with current technology) it takes in the order of 30 minutes per sample to produce a VCF, or 8 hours for a 16 sample run. To speed this up, the sequencer needs to be equipped with ‘Moore’ processing power. Again all doable, and eventually this boils down to a conversation with IT departments between investing in dedicated or general-purpose equipment.
The ‘Aligned Reads’
This brings us full circle to where we started our conversation, which is the drive to work with larger cohorts to increase the quantity and the quality of identified and relevant mutations. It is clear that large cohorts cannot be processed inside a sequencer. Larger cohorts are being processed on a grid of general-purpose computers to make it scalable and economical.
The common foundation of all downstream processing is ‘Aligned Reads’. VCF or (g)VCF does not provide enough meat to be a viable solution. The footprint of properly compressed high quality reads is even smaller than gVCF, which makes this almost a no-brainer.
When it is clear that ‘Aligned Reads’ is our foundation, it makes perfect sense that high throughput sequencers connect to a co-located grid of gateways and storage units (depending on performance requirements) which produce low footprint ‘Aligned Reads’, and then push these to a grid (or cluster) to produce single sample or population VCF’s (the end-user product).
Producing ‘Aligned Reads’ or even a VCF in a sequencing device would make sense, if the sequencer was setup in isolation. The Ion-Torrent is such device, providing a convenient all-in-one solution, which works well in constraint environments. Even then, the sequencer as device should allow for choice to accommodate the desire to work with many samples in context.
There are two main take-aways from our contemplation:
- Low footprint ‘Aligned Reads’ must be the ‘product’ of a Big data sequencing environment;
- In a Big data sequencing environment, a sequencer is best left as a general-purpose device.
This has the best of all worlds:
- We have the detailed reads for inspection and migration;
- We can use second pipelines for verification and quality control;
- We have the smallest footprint;
- The sequencer remains a flexible general-purpose device;
- We can innovate and exchange our pipeline(s) independent of sequencer choice;
- We can do assembly;
- We can switch to new sequencing technology when it becomes viable.
- Sustainable Architecture
So, to build and maintain Sustainable Architecture for high volume NGS data processing, we need to set proper perimeters, in which we can optimize the sub-components to excel. This is where it starts: the sequencer itself remains a general-purpose device, the Big data processing environment close to the sequencer does produce (at least) low footprint ‘Aligned Reads’ and this NGS data moves to downstream analysis environments (end user), which could be big or small, proximal or distant depending on focus, cohort sizes and location of service.