Big genetic data: powerful indexing

Authors
Prof. Gunnar Rätsch and Dr. Andre Kahles
ETH Zurich

Interview with the researchers of this NRP75 project.

What was the aim of your project?

The main goal of our project was to provide technical solutions to the problem of accessing and working with the ever-growing amount of biological sequencing data accumulating in public and controlled access repositories.

Results of the project?

Directly addressing the problem of inaccessibility of sequences in large public archives, we developed a modular software framework (MetaGraph) that can index arbitrary sequencing data on a petabase scale. Using this framework, we computed and made publicly available an index comprising more than four million sequencing samples. We not only share the pre-computed indexes with the general public but also developed an interactive platform to directly query these indexes via a web interface (https://metagraph.ethz.ch/search) or a publicly accessible API (https://metagraph.ethz.ch/static/docs/api.html).

From all contributions of our project, the following highlights are from our point of view the most influential and relevant to the field of biomedical data science:

  1. We have indexed and made publicly accessible more than 3 petabases of high-throughput sequencing data, allowing, for the first time, a full-text search on large parts of the contents of NCBI’s sequence read archive (SRA). We envision this contribution to be the steppingstone for continuous and complete indexing of all generated sequencing data.
  2. Instead of providing only pre-computed outcomes, we have developed and carefully documented a fully modular software framework for the indexing and analysis of petabase-scale sequencing data. The MetaGraph framework allows users to not only replicate our experiments, but also to apply the same methodology to their own data, put their findings into context in interactive analyses, or to extend the framework’s functionality based on their own needs.
  3. At the basis of many MetaGraph features stand our theoretical contributions to the field of sequence bioinformatics. Especially our work on the compression of sparse binary matrices, the concepts for multi-colouring a de Bruijn graph, alignment-free distance estimation based on approximative methods, and methods for the sequence-to-graph alignment are important contributions for which we envision also more general applications outside the field of biomedical data science.

What are the main messages of the project?

  • The potential of Big Data (in a biomedical context) can only be used effectively, if data adheres to compatible formats, is valid and low in errors, is semantically interoperable, can be efficiently shared and computed on, and is constantly maintained and improved. It is important to stop seeing data repositories as final storage and to start appreciating them as a living scientific resource. For this, data needs to be FAIR: findable, accessible, interoperable, and reusable. The MetaGraph framework developed in this project addresses these needs and provides a general solution towards making big sequence data FAIR.
  • The rapid growth of global sequencing capacity generates petabytes of new data every month. Our project addressed this problem and provided sophisticated algorithms and succinct data structures to efficiently compress and search sequence collections at a petabase scale. Achieving compression ratios of up to 1000-fold through the reduction of redundancies, the resulting data is not only more accessible but also much more cost-efficiently to store.
  • Developing strategies for search and representation of biomedical sequencing data is necessary and timely. Having the results of this project available prior to the Covid-19 pandemic would have added a powerful tool to the portfolio of sequence analysis methods that would have helped identify genomic lineages and make the data of genomics surveillance searchable and accessible.

Does your project have any scientific implications?

The amount of biomedical sequencing data will remain increasing at an exponential rate. Very soon not all data that is measured can be stored for future use. We thus recommend strengthening research on the abstraction of sequencing data as well as on streaming and online methods for relevance detection. These can be compression approaches but also probabilistic methods that identify relevant data from a given stream at a very high probability. This of course happens under the assumption that the amount of relevant data among all measured data is sufficiently small.

Does your project have any policy recommendations?

Working with Big Data is different from the traditional concept of scientific data use. Especially working with data at the scale of several petabytes requires a disproportionately large amount of heavy engineering and infrastructural work. This aspect should be considered for future funding schemes to contain sufficient resources for integrating data and software engineers into the project plans. If doctoral or postdoctoral researchers alone need to cover this load, the research outcomes are likely much lower than in a setting where specialised personnel solve these tasks.

Resonating with similar points made in other sections, we suggest extending existing or creating new funding opportunities but also creating new incentives for the work on the (automated) curation and maintenance of existing data collections. Only with complete and trustable metadata and the possibility to continuously curate existing data elements will (public) data repositories be able to unfold their true potential for scientific research. This explicitly also includes the funding of efforts to transform existing data that is held at institutions into domains accessible by the research community (or the public). Such efforts include data curation, formatting, coding, standardisation, and other measures.

About the project

Related links