Loosely structured data: new tools for integration

Author
Prof. Philippe Cudré-Mauroux
Université de Fribourg

Home > More efficient big data infrastructures > Loosely structured data: new tools for integration

12. July 2022

Interview with the principal investigator of this NRP 75 project.

What was the aim of your project?

The aim of this project was to devise new techniques for the automatic or semi-automatic integration of data. Because the data structure is often not defined in advance, the central challenge for this research was to understand it retrospectively, by reconstructing patterns using the available data.

Results of the project?

The project resulted in a number of novel next-generation integration algorithms, as well as several deployments over real data. In particular, new techniques were developed to integrate and query microposts, as well as new human-in-the-loop methods to analyse them. Significant progress in terms of improving graph integration and analysis was also made; new embedding techniques in that context that are one to two orders of magnitude faster than previous approaches, as well as new imputation techniques to improve the quality of knowledge graphs used for data integration resulted of this project as well. Approaches over two real use-cases were developed: one to analyse and integrate pdfs for the Swiss Federal Archives, and a second one to integrate loosely structured data for cancer diagnosis.

What are the main messages of the project?

The first key message is that we can successfully tackle tough questions in terms of Big Data in Switzerland, and that we can successfully deploy next generation solutions made in Switzerland in that context – given the poor track record of many Swiss sectors in that context, this is a message worth broadcasting
The second key message is more technical and relates to the fact that dedicated embedding methods, tailored to a specific problem, can be extremely effective at solving vertical tasks
Finally, the third key message relates to our human-in-the-loop approach and highlights the fact that we should find better cooperation modes between experts (e.g., data analysts, medical doctors) and models when deploying AI pipelines.

Does your project have any scientific implications?

The project had significant scientific implications as summarised above. The main points are as follows: i) tailored data-driven applications can be extremely effective for vertical domains ii) entities that are lagging behind in terms of digital infrastructures (e.g., federal entities or hospitals) can rapidly adopt new data-driven tools and processes if supported along the way by experts and iii) current cooperation modes between experts and data-driven models are poor; More focus should be put on human-in-the-loop approaches and on designing novel methods for human-AI collaboration.

Does your project have any policy recommendations?

In the broader context of the programme and following our own experience running this project, I can make the following policy recommendations:

Switzerland should drastically shift its focus in terms of data policies; today, administrations, doctors and scientists suffer from rigid and heavy procedures when it comes to data collection, processing or sharing. At the opposite end of the spectrum, large companies have free reins and are abusing users’ data repeatedly without suffering any consequence. Switzerland should follow the EU leadership in that context and be much more stringent when it comes to data used by large corporations, instead of adopting a “light GDPR” with almost no sanctions as it is currently doing.
In the context of medical and scientific data, the current ethical, approval and administrative processes (which are exceedingly slow and complex) should be streamlined, simplified, and most importantly *properly digitalised* (all processes are currently based on pointless and impractical forms and on natural language text).
Self-determination and sovereignty are two key Swiss values. However, in terms of data, we are totally dependent on large foreign companies that are dictating their processes through their own platforms. The fact that the Swiss Confederation chose American and Chinese providers to power its own cloud is particularly telling in that context. I understand their choice unfortunately, as Swiss cloud providers lack most of the advanced features from the leading cloud providers. Switzerland should invest massively in this domain to become less reliant on foreign platforms and to develop Swiss-based infrastructures that are technically sensible.

Big Data is a very vague term. Can you explain to us what Big Data means to you?

The project has explored the “Variety” angle of Big Data, which is I believe the least understood facet of the field. The project significantly contributed to the understanding and advancement of Big Data in science by making fundamental advances and by applying some of those advances to the analysis of rich documents and medical data.