Genome comparison: faster analysis

Authors
Prof. Nicolas Salamin, Prof. Christophe Dessimoz, Prof. Marc Robinson-Rechavi, Prof. Bastien Chopard
Université de Lausanne, Université de Genève

Interview with the principal investigators of this NRP75 project.

What was the aim of your project “efficient analysis of genomic data”?

The aim of our project was to develop new computational approaches capable of processing genomic data of variable quality to compare the genomes of different organisms. Modelling the interactions between genes with the help of machine learning methods will make it possible to understand, for example, the evolution of groups of genes involved in metabolic processes.

Results of the project?

Our project “efficient analysis of genomic data” achieved two main results.

  1. To efficiently and robustly process protein sequences derived from newly sequenced genomes, we developed OMAmer, a novel alignment-free protein family and subfamily classification method suited to phylogenomic databases with thousands of genomes. We also demonstrated the applicability of this approach to real-life problems of comparative genomics. Such datasets are becoming ever more abundant with new efforts for large-scale comparative genomics, and we expect OMAmer and derived tools to play an important role in making them useful to answer biological questions.
  2. We made progress in the use of Big Data to identify subtle signals of co-evolution in biological sequences using cutting-edge artificial intelligence methods. Though this project focused on co-evolution, the machine learning approach that we developed can be applied to other molecular processes such as the detection of selection. The next step is to move into this direction using the Selectome database (a resource for positive selection in vertebrate genomes) that we updated within the scope of this project.

What are the main messages of the project?

  • Efficient algorithms can take advantage of the increasing amount of genomic data available for non-model organisms to better understand some of the key questions in evolutionary biology.
  • Machine learning techniques are increasingly used to help face the large-scale data produced by new techniques to sequence genomes of organisms, but scientists should ensure that the algorithms are correctly tailored to answer the specific questions and new developments will have an important impact in the field of evolutionary biology.
  • Our project identified some important concepts, in terms of algorithms to develop and novel approaches to study and created computational resources that will provide new directions for future work to fully encompass the very large amount of data which will become available for non-model organisms in the coming decade and beyond.

Does your project have any scientific implications?

Extending support for non-model organisms in the current efforts to generate and make sense of genomic data is essential. Although advances in human genetics have produced new tools and resources that are taken by the research community at wide, there are important differences between the type of data that are gathered for human studies and for other organisms. Yet these organisms include the targets of conservation biology, the new models for diseases or environmental problems, and more generally all of biodiversity. Computational methods are a key element to enable an efficient transfer of knowledge between these different field of research and our project have outlined some interesting strategies to use existing large-scale data and use it to harness the information that will be available from the huge efforts to sequence the biodiversity on Earth at large.

Do you have policy recommendations?

Covid-19 has shown the importance of rapid and accurate evolutionary analyses on very large datasets. Biodiversity genomics is going to pose even larger computational challenges. We recommend that Switzerland support the biodiversity genomics (ERGA) and pathogen genomics initiatives, including with computational methods adapted to making sense of very large comparative data, and with the corresponding IT infrastructure.

Big Data is a very vague term. Can you explain to us what Big Data means to you?

Biology shares with some social sciences or even internet companies a need to make sense of data which is both very voluminous and generated from a variety of experimental procedures, for a variety of aims, unlike e.g. the big data of high energy physics. We have shown how a backbone of high-quality information can be used to order and make sense efficiently of this volume of lower quality data.

About the project

Related links