This week we had an intensive encounter with the AgBio world at the Plant Genomics Congress in London. During the two days, it became clear that NGS is rolled out in this area at a rapid pace. The issues related to big data processing are similar compared to those in the human space.
Articulating the differences
There are subtle differences, however, which do make life of plant researchers more difficult. For instance, the diversity or lack of available reference material puts a much stronger requirement on your tools to use large and complex assemblies for resequencing projects.
A second difference is that plant genomes are much more complex than human genomes in the area of polyploidy, repetitive areas and gene duplication. This implies that the mapper and variant caller need to be very thorough in their attempt to make the right calls, and be cautious to not make false calls, and not to ignore calls based on diploid assumptions.
It is clear that short reads have their limitations. On the other hand, short reads are a good approximation in the balance of throughput, tooling time and cost. When the likes of PacBio and Oxford Nanopore step up to the plate, the story might be different. Till then, we have to deal with the limitations as well as we can.
A third difference is that plants are NOT human. That seems a fair statement. The implication for selecting tools for e.g. variant calling is that one must be aware of ‘training sets’. Out of the box, tools may run equipped with training sets that ensure quality of output, provided the data falls into the right category. The training set supplies a set of variants, which provokes a bias towards known mutations. Some tools may not even run correctly without the use of such training sets.
A fourth difference to mention is that plant genomics is always done on sample pools and seldom on individual samples. Having sample pools, or cohorts requires a population calling approach to be able to find markers or select the right pool, based on a marker. This is a major difference compared to the human space, where cohorts are used in research and the results applied to the individual patient. In AgBio cohorts are both used for marker find and marker based discrimination.
Committed and well-aligned
Even though not designed with these differences in mind, the GENALICE secondary analysis suite is a perfect match with the above requirements for plant genomics application. Our puristic approach of dealing with NGS (short read) data pays off in this area. We can deal with any size reference (design limit is 512Gbases), we have a strict observational (unbiased) variant reporting approach and our population caller is designed for continuous cohort growth using a true incremental sample processing approach.
We will be adding more functionality to our suite such as Structural Variant detection, advanced query operations on populations and RNA-Seq quantification and variant detection. As such we are committed to serve both the human and the AgBio community with high quality and high-speed tools that allow you to focus on content rather than data.
Are there other NGS big data challenges within plant genomics that need further attention and you would like us to focus on? Please add your comments below.