No compromises. No choices. This was the challenging starting point when we designed the new algorithms and configurations for our ultra-fast Next-Generation Sequencing (NGS) read alignment and variant calling solution; GENALICE MAP.
Improve and accelerate DNA research with existing hardware
MAP transforms DNA-Seq and RNA-Seq research projects, in terms of both speed and quality. Our groundbreaking NGS data processing solution is particularly suitable for processing large scale, e.g. Whole Genome Sequencing (WGS) projects. MAP runs on commodity Intel Xeon E5 processors and is delivered in a turnkey appliance, the GENALICE VAULT, which also contains real-time monitoring, workflow management software and an embedded Oracle database.
WGS: Big data deluge?
GENALICE MAP is the only next-generation alignment and variant calling solution capable of processing a complete human genome (WGS – coverage 37x) in 30 minutes or less with an output file of 4GB instead of 400GB. The GAR file (GENALICE Alignment Read) has a VCF file as its final output format.
GENALICE MAP has been validated using various datasets in close collaboration with some of the world’s leading university medical centers and crop improvement companies,to further substantiate the choices that have been made to optimize file size reduction.
Described here are WGS data processing results obtained using MAP.
For example: GENALICE used human data from the 1,000 Genomes Project and golden standard data from several centers. Additionally, two tomato species used in the validations were kindly provided by KeyGene. A direct comparison between GENALICE MAP and one of world’s most widely used alignment tools – the Burrows- Wheeler Aligner (BWA) – and the Broad Institute’s GATK variant calling software was performed. The hardware configurations, input data and DNA reference used in this comparison were kept exactly the same for both pipelines.
The design approach for GENALICE MAP featured new algorithms that are optimally geared towards modern hardware architecture.
As a result, MAP aligns NGS short reads up to 100 times faster than conventional aligners like BWA. For Whole Genome Sequencing of Human 1,000 genomes data the exact speed was 115.8 Megabases per second in comparison to 1.1 Megabases per second for BWA (left chart).
In a benchmark study between GENALICE MAP and BWA-MEM/GATK* the processing times of a FASTQ fie for a whole human genome sequenced with 50x depth were compared. The results in the adjacent graph show that MAP is over a 100 times faster than the conventional pipeline. During the study GENALICE MAP aligned 790 million read pairs, called the high quality variants and completed the task in less than 45 minutes whereas it took BWA-MEM/GATK over 80 hours to finish the job.
*GATK version 3.1. Hapotype Caller
The ultra-high speed variant caller inside GENALICE MAP very accurately detects SNPs. A study using the high confidence SNP call set (NIST v2.18) from the “Genome in a Bottle Consortium” (GIAB) shows that MAP discovers a higher proportion of SNPs than BWA-MEM/GATK. To be specific, the concordance with GIAB is close to 98%, whilst the conventional pipeline only identifies around 95%. Additionally MAP reports 2.7% no calls.
An even more significant quality improvement can be seen in INDEL alignment. This is related to the ability of the GENALICE product to map long INserts and (indefinitely) long DELetion correctly. Most other alignment algorithms have severe difficulties when the indel is long (see graphs below). This makes MAP a highly accurate and precise alignment tool.
GENALICE MAP allows researchers to produce a small footprint GAR (GENALICE Aligned Reads) files to replace the BAM and FASTQ files. The GAR file contains all the required information to run a realignment job with different settings or an updated reference. This results in an up to 100-fold storage footprint reduction in comparison to BAM/FASTQ. For WGS cultivated tomato data the exact numbers are: FASTQ – 88.1 GB, BAM – 25.0 GB and only 1.6 GB for the GAR file (left chart). The exact storage footprint gains are related to the difference between the reference and the specific sample (right chart). In all cases, significant storage cost reductions will be achieved.
GENALICE MAP has a short and simple workflow, consisting of only two processing steps: alignment and variant calling. Variant calling includes duplicate marking, INDEL realignment, genotyping and filtering in a single processing step. Conversely the BWA-MEM/GATK workflow is lengthy, complex and multi-staged; involving several software packages. Aligned reads are stored in the large storage footprint (~100GB) BAM format.
In addition aligned reads must be prepared (which involves sorting, indexing and duplicate marking) prior to variant calling. Each preparation step generates another BAM file. Moreover variant calling in GATK requires two independent processing stages: genotyping and filtering.
As NGS data processing precedes downstream analyses, it is highly important to make sure that the produced files are compatible with existing pipelines. GENALICE MAP produces a GAR file with a much more efficient storage structure than conventional BAM files. To warrant full compatibility we offer our customers a series of solutions. A GAR file can be:
- converted to a BAM/SAM file in near-time;
- directly loaded in GATK, IGV and downstream analysis tools using plugins;
- used in conjunction with GENALICE’s variant caller to produce a fully compatible VCF.
A free GAR reader/converter will also be made available to the public.
In the overview below a summary comparison is made between the key attributes of a GAR file and BAM, CRAM and reduced BAM files. The GENALICE format scores well on all items.
GENALICE MAP generates major gains in cost effectiveness. Its massive processing speed allows clients to perform the same alignment jobs using significantly less hardware and time.
Furthermore, MAP allows clients to produce a small footprint GAR file to replace the BAM and FASTQ files. As a result, significant storage cost reductions are achieved. In addition, this file size reduction allows our customers to replace the existing physical shipment of the hard disks containing BAM files with secure network transfer.
This results in significant costs savings, and reduced project risks. Lastly, better accuracy results in more cost-effective drug development, improved diagnostics and treatment.
More information about GENALICE MAP
Product Supporting Materials:
- GENALICE MAP product brochure
- GENALICE MAP cost-effectiveness flyer: A4 version or US letter version
- GENALICE MAP product specification sheet: A4 version or US letter version
- GENALICE MAP white papers:
- GENALICE MAP GCAT benchmark report
- Unofficial world record setting event
- PAG 2014 poster presentation: Ultra-fast, accurate and cost-effective NGS read alignment validated for complex plant genomes (Jos Lunenberg, Bas Tolhuis & Hans Karten)
- HiTSeq 2013 poster presentation: Ultra-fast, accurate and cost effective NGS read alignment with significant storage footprint reduction (Bas Tolhuis, Jos Lunenberg & Hans Karten)
- NBIC 2013 poster presentation: Ultra-fast, accurate and cost effective NGS read alignment with significant storage footprint reduction (Rick Karten, Bas Tolhuis & Hans Karten)
- ECCB 2012 poster presentation: A new ultra-fast and comprehensive NGS read aligner with high precision (Jos Lunenberg, Bas Tolhuis & Hans Karten)
*For more information on the validations and product availability in your country, please contact us.