BioSODA: An intuitive search function for bioinformatics databases

Bioinformatics databases: queries in natural language

Author
Prof. Kurt Stockinger
Zurich University of Applied Sciences, University of Lausanne and SIB Swiss Institute of Bioinformatics

Interview with the principal investigator of this NRP75 project.

What was the aim of your project?

Rapid advances in DNA sequencing are transforming biosciences into a highly data-intensive discipline. Vast quantities of bioinformatics data are stored in complex databases which are built on powerful technologies, but also demand a great deal of background information technology expertise when it comes to retrieval. New search technologies are needed to efficiently analyse dozens of bioinformatics databases. The aim of your project was to develop novel Google-like search options that allow researchers to query databases intuitively and concentrate on scientific questions.

Results?

Before the Bio-SODA project started, accessing the major bioinformatics databases required end users to be proficient in the query language SPARQL and to know the underlying structure of the databases. Since most end users did not have sufficient skills, they could not effectively query the troves of information sources or needed help from a few specialists to access their data. This process was both time consuming and inefficient since precious time of researchers was spent on data wrangling rather than pursuing scientific research.

The Bio-SODA project successfully laid the foundations for applying the developed system and the research approach well beyond life sciences. For instance, Bio-SODA is now also applied in the project INODE – Intelligent Open Data Exploration (www.inode-project.eu) – funded by the European Union’s Horizon 2020 programme. The goal of Bio-SODA in INODE is to enable natural language querying across datasets from three different scientific domains, namely Cancer Biomarker Research, Research and Innovation Policy Making, and Astrophysics.

Three main messages

1) Digitalisation efforts across all fields of knowledge have advanced rapidly in recent years. However, to achieve the full potential of digitalisation – empowering domain experts to routinely extract insights and scientific findings from big data – we have to improve data sharing and integration, and user-friendly interfaces to query this data.

2) The Bio-SODA project has shown how bioinformatics datasets from traditionally disconnected fields of comparative genomics can be made interoperable. The project illustrated through real-world use cases the benefits of data integration in enabling more powerful semantic queries than previously possible.

3) Bio-SODA made a significant contribution in talking to databases almost as if to a human by enabling intuitive natural language access to complex bioinformatics databases – while also highlighting the considerable potential for further improvement when it comes to performing complex queries across multiple resources.

Do you have policy recommendations?

Governments, funding agencies, journals and conferences need to give a stronger incentive to applied, interdisciplinary research that goes well beyond theoretical research. Only by building solid research prototypes that are vetted in real-world settings, can technology transfer to industry be successful.

One concrete recommendation is to have a stronger emphasis on funding applied research and to add a dedicated track on applied research or experience papers to the major conferences and journals. Moreover, there should also be a strong incentive to make the source code available such that the research community can more easily reconstruct research results. Having access to the source code can also foster technology transfer and industry collaboration.

Big Data is a very vague term. Can you explain to us what Big Data means to you?

Big Data is often defined by the three Vs, namely volume (large data), velocity (fast data) and variety (heterogeneous data). The first two Vs have already been sufficiently addressed both in academia and industry by building scalable systems that leverage large amounts of modern processor technologies. However, the third V, namely variety is still a widely unsolved issue – which is also the main focus of Bio-SODA.

The reason is that integrating and querying different types of data sets with heterogeneous ontologies is very difficult to automate since basically every data integration problem is slightly different. Hence, there is no easy way of training machine learning systems to perform this task automatically. As also pointed out by Turing Award Winner Michael Stonebraker, these challenges, which are often significantly underestimated by academia, are considered as the “800 pound gorilla in the corner” and requires a joint effort by academia and industry.

In order to tackle this problem, new algorithms need to be developed that can learn from small amounts of training data or even with no training data at all. Transfer learning or self-supervised learning could be a promising avenue. However, unlike in image processing, where large amounts of benchmark data exists, there is no such large-scale benchmark for real-world data integration and data cleaning problems.

Moreover, new algorithms would need to get the human in the loop to bootstrap the problem and iteratively improve over time. The major challenge is how to minimize the time of a human to kickstart an algorithm and provide the right information such that the algorithms can learn jointly with the help of humans. In short, a combination of artificial and human intelligence is required – both for data integration as well as question answering and natural language understanding.

We believe that Bio-SODA made an important contribution in tackling the most important challenge of the 3 Vs in Big Data – namely variety.

About the project

Related links