Finding the needle in the haystack

Author
Dr. Andre Kahles
ETH Zurich

Interview with one of the persons in charge of the NRP75 project „A learning storage structure for genetic information“.

What are your project goals and what have you already achieved?

The main goal of our project is to develop a database system for genomic sequencing data that allows researchers to more efficiently store, search, and compare data arising from sequencing projects. Over the past years, a large number of projects in science and medicine have focused on learning about the composition of the genomes of humans and other organisms. A central part of these efforts is to measure many millions of short DNA sequences per individual. After completion, these data are often made accessible to the research community but are so big that it is mere impossible to compare a new, unknown DNA sequence, in depth against all known sequences present in the archives. When trying to lookup a sequence in this repository, one is often searching for a needle in a haystack. Our approach uses so called “compact data structures” to smartly transform the existing sequence data into a representation that needs only very little space and can be searched in a small amount of time. In addition, we link each sequence to additional information, for instance the identifier of each sample. Now it becomes possible to not only compare existing data sets with each other and to extract new information, but also to compare newly sequenced samples against what is already known.

What are you and your team particularly proud of?

Our project has officially started in April 2016 and since than we have made very encouraging progress. Not only were we able to recruit fantastic doctoral students to work within the team, but we have also started a vivid exchange with other research projects supported by the NRP 75 framework. We have now a working prototype that we further improved over the past months. Initially, it took several hours to compress genomes of human size and now we can complete the same effort in a few minutes, using the same resources. We have learned about new applications and made our database flexible enough to work with sequencing data from any organisms and even on populations of organisms (as is important for the field of Metagenomics).

What changes does your project bring about?

One of the main contributions of our approach is that it removes burdens in genome data analysis. We can only learn from each other, if it is not only theoretically possible but also practically realizable to build on data collected by other researchers. With genomics entering the era of Big Data, this has become more and more difficult over the past years, as the amount of data to be transferred was too large and could often not even be stored by small research groups. With our compression strategies, we liberate the data and make it accessible to a wide range of researchers.

What does NRP 75 mean to you?

Without the framework of NRP 75, the team as it works together today, would not exist. However, the benefits we enjoy go far beyond the financial contribution. It is especially the exchange with other researchers that opens our eyes for new applications and unsolved problems but also discusses important societal implications such as the ethical questions around genomic big data.

What would be missing if your project did not exist?

Genome sequencing data has become a central part of biomedical research. More and more, the collection of DNA and RNA sequences is not only seen as a research instrument but will soon become a standardized tool inside the medical repertoire. It is important that we perform the necessary research already now to utilize this data not only to the immediate benefit of research but to the ultimate benefit of diagnosis and treatment. Our project contributes an important stepping stone to this development.

About the project

Related links