The road ahead – an outlook

Society must anticipate the potentially disruptive changes that big data and machine learning applications can bring about. An overview of key opportunities and challenges that lay ahead is given below.

The main achievements of the National Research Programme “Big Data” (NRP75) are having reinforced Swiss capabilities in the technologies, applications, and societal aspects of big data. NRP 75 has advanced technologies that underpin big data infrastructure and has brought together big data scientists and domain experts to realise specific applications. It has also raised awareness of the societal challenges that accompany large-scale data production and analysis, and has helped develop a “big data culture” to reap the benefits of big data responsibly.

The programme’s 37 funded projects covered only a subset of this rapidly expanding field. This chapter goes beyond to provide a more general view of the opportunities and risks that big data presents, particularly those that could become more prominent in the coming years. The following analysis is based on the knowledge gained from NRP 75’s research and on the collective insights of the programme’s Steering Committee members. It addresses both the prospects of a greater use of big data in industry and the public sector, and the challenges of sustainability, privacy and accountability.

Applications impact more domains

Many more applications of big data are expected to be developed and deployed in the coming years. New private sectors – beyond e-commerce – and public administrations will become data-ready in the hope of staying competitive, by developing new capabilities and reducing costs while increasing efficiency. As shown by NRP 75 research projects, developing real-world applications requires the right combination of expertise across several domains. It needs robust data strategies including privacy-by-design approaches, analytic know-how among sector experts, and knowledge among the work forces. A crucial ingredient is the availability of data scientists who understand the relevant application domain and domain specialists familiar with data science. This underlines the importance of equipping the new – and older – generations with the knowledge and tools needed to tackle big data applications.

Here follows a selection of domains that could be strongly affected by big data applications.

Production: improving output, optimising maintenance

Many manufactured products include integrated sensors that can be connected to the Internet of things (IOT). Products can send performance information in real time, allowing manufacturers to identify components that need replacing or improving, or boosting safety and customer satisfaction.

In agriculture, autonomous robotic systems use image recognition to remove weeds, detect diseases and pests, harvest fruit, apply fertiliser locally, and monitor whole fields with drones. Such robots could help reduce labour shortages, lower fertiliser use and avoid pesticide.

State: improving infrastructures and supporting the energy transition

Governments can use big data to enact evidence-based policy making, such as allocating resources, carrying out strategic planning or monitoring public infrastructure (conclusion 5). Analysis of big data can improve transport planning (Optimising transport management); manage traffic congestion; improve the planning, construction and operation of utilities providing water, electricity, and lighting, for example; and carry out environmental monitoring (Soil erosion, Flood detection). Sophisticated analytics will help reduce our carbon footprint by ensuring flexible energy supply, storage and distribution, and in particular enable electricity grids to handle decentralised and intermittent renewable sources of energy such as solar panels or wind turbines (Renewable energy potential).

Services: automating actions in finance and cybersecurity

Financial institutions can use real-time transaction analytics and market predictions for fast automated trading, which needs efficient infrastructures (Fast prediction algorithms, Graph analytics and mining). Quantifying individual risks allow insurance companies to better tailor their policies, but potentially threatens the principle of solidarity- that underlies insurance (Big data in insurance). Tracking systems in vehicles and elsewhere could reward risk-mitigating behaviour, so putting the emphasis on risk prevention rather than risk protection.

Analytics can help prevent cyberattacks by looking for anomalies in real time data transfer and then automatically blocking threats (Data streams). Image recognition can be used to automatically detect physical security breaches and other irregularities.

Healthcare: assisting physicians and personalising medicine

It is widely expected that machine learning will significantly improve healthcare (conclusion 4), having already been used to identify anomalies in clinical imaging. New technologies could enable major progress in prevention, diagnoses, and targeted therapies by bringing together huge datasets from laboratory testing, health records and genetics. In particular, advanced natural language processing (Language models) allows automatic extraction and interpretation of information from unstructured texts in health records. By integrating data streams from various clinical devices in real time it is also possible to measure patients’ health and detect emergencies (Intensive care units).

However, using big data for medical applications requires certain infrastructure. In particular, innovative methods are needed to to extract reliable results from small subsets of data – given that a single patient can generate terabytes-worth. For genomic data, this can be done through adequate pre-processing (Big genetic data).

E-commerce and entertainment: consumer involvement and synthetic art

Collecting, analysing and exploiting customer information will likely play an increasing role in e-commerce. Online companies already use personalised recommendations and trend predictions, but new data-driven applications will probably incorporate customer expectations into the process of product design itself.

Language models are improving very quickly, getting better at comprehending meaning, intention and context, and extracting valuable information from text, as well as generating synthetic reports or conversations by chatbots. Algorithms can generate music based on the styles of specific composers. Computers using text prompts create convincing-looking synthetic images and videos and software are soon expected to be able to generate films indistinguishable from the real thing – complete with natural-looking people and settings. Such systems may complement or replace current media and entertainment products, but also pose major challenges for democracy by enabling realistic computer-generated image, audio and video hoaxes, but also for intellectual property.

Open research: accelerating discoveries

Scientists are increasingly making their datasets available to others for free, in order to accelerate discovery and enhance reproducibility (conclusion 6). But like any other data repository, those supporting open research must comply with certain standards – such as the “FAIR principles” of findability, accessibility, interoperability, and reusability. These call for machine-readable, standardised metadata containing the necessary explanations and descriptions – all part of a new paradigm that the academic world must adapt to (Big data: open data and legal strings).

Decreasing the footprint of big data

While big data will certainly play an important role in tackling climate change and reducing our carbon footprint, it also contributes to the problem. Storing and processing large datasets utilises substantial energy: 3.6% of Switzerland’s total electricity consumption in 2019 was due to data centres, a rise of 30% in 6 years.

Managing big data is more than mere collection and storage; the data must also be protected from unauthorised access, corruption and loss. This requires access control, backup protocols and solutions to correct damaged, incomplete or inaccurate data. Databases must be preserved by being continuously adapted to new standards of storage, compression and analysis. This requires the work of data and domain experts, and adds to the costs of big data applications. Frugal, or lightweight, artificial intelligence aims to reduce energy consumption, for example by being able to work with smaller datasets and using synthetic training data that save on resources. This new and growing field calls for further research effort (Coresets).

Balancing privacy

Numerous big data applications, such as those used in finance, engineering or environmental monitoring, do not raise new questions about privacy as they do not use personal information. But many other applications do, and the ever-growing amount of data they collect about individuals raises ethical and legal concerns. Users generally have little idea about what data of theirs is being collected, who can access it and to what end. The fact that online service providers control these things has led to the concept of digital divide and asymmetry.

Although providers are currently obliged to notify users and ask for consent when they collect data, these steps are not enough to protect privacy since most users agree without further thought and with little knowledge of the consequences. The main problem is that users shoulder the burden of understanding the implication of their consent, even though they gain no immediate benefit from the data collection. Authorities will have to decide how much to regulate this practice (conclusion 8).

Complete anonymisation is often unattainable

Until recently, it was considered safe to share data that contained information on individuals once it had been anonymised – by removing information that could directly identify someone, such as their name, birth date and address. However, it is has become increasingly clear that linking data from different sources, even though anonymised, can enable individuals to be re-identified of. Certain types of data such as whole genomes or a smartphone’s GPS traces contain such a high level of sensitive personal information that absolute anonymisation is not realistic. The release of data with personal identifiable information removed must therefore be considered a continuum, requiring that the privacy lost be balanced against the value created on a case-by-case basis.

Several approaches can hinder re-identification. Differential privacy, for example, obfuscates data by adding random noise, but at the expense of accuracy (Stream analytics). Another option is to suppress certain data points or combine them into broader categories, as is done in so-called k-anonymity.

Analysing data without accessing it

Sensitive data can be stored in enclaves with sophisticated control of access. This ensures that only local analyses can be performed, and that only aggregated results, which protect privacy, are sent outside the enclaves. Another option being developed is federated analytics, where the data is kept in multiple local systems with no sharing. The computations, including the training of machine-learning algorithms, are performed locally and collaboratively. Here too the only things shared are partial and aggregated results or intermediate model updates, while the original data is never distributed. This helps solve difficult issues of cross-border data transfer, which requires legal solutions at the international level (conclusion 9). Research teams developing applications for big data should consider the ethical and legal framework of data processing early on (conclusion 2).

Making algorithms accountable

Big data applications often employ machine learning algorithms able to make predictions based on models trained with certain data. While these algorithms are often very good at predicting, it is often unclear exactly how they arrived at the predictions.

The risk of discrimination

Normal software follows a strict series of instructions that have been (largely) designed by humans. Programmers and testers can, in principle, guarantee that it works as expected. But the situation is different with many machine-learning algorithms: their results are based on models with huge numbers of parameters, whose values are generated automatically from training data. Their behaviour does not follow human-coded rules.

This makes it difficult to work out whether such results comply with established ethical standards or if they might, for example, discriminate against certain population groups. This may happen if the training data is non-representative, biased, outdated or erroneous, which can be the case when using data from the web. Machine learning models are dependent on training data, so their results may reproduce biases within that data. For example, removing the parameter “gender” from the training data might not prevent discriminatory results, since a trained model might use proxies from other inputs to recreate the gender category. Such behaviour can escape detection during early testing but emerge later on.

Understanding machine learning

As already stated, results produced by deep neural networks and other machine learning techniques can be very 71 difficult for humans to understand because the billions of trainable parameters that make up their models obfuscate the mechanisms leading to particular results. There is currently no accepted solution to fully overcome this “black box” problem of artificial intelligence.

Theorists are trying to better understand these automated systems in order to improve the explainability and traceability of their decisions. These are crucial aims in demonstrating that algorithms are non-discriminatory, accountable and trustworthy.

A typical person or company affected by a potentially biased algorithm has neither the knowledge nor ability to convincingly argue that the system has made a mistake or discriminated against them. One possibility is to reverse the burden of proof, so that the entity responsible for the system has to demonstrate that it behaves correctly. This could involve a certification process developed by a public or private organisation (conclusion 3), and in particular the deliberate alteration of test datasets to see whether the output complies with ethical regulations.

Who is responsible for the algorithms?

The rapid advance of machine learning raises the question of liability, as widely discussed for self-driving vehicles. Who, in that case, should be held responsible for an accident? The owner of the vehicle? The manufacturer? Nobody? This is an evolving area of law and policy, and there is currently no agreement on the answers to these types of questions. While manufacturers must design their cars to minimise risks in typical driving situations, they cannot foresee all possible circumstances. It is essential to define responsibilities precisely so that legal uncertainty doesn’t obstruct innovation.

Towards new regulation

The law often lags behind rapid advances in machine learning and everexpanding data collection. Legislation has so far focused on individual rights and avoiding negative effects on individuals, rather than impacts on society as a whole.

The EU is currently drafting an act to regulate AI applications. It would ban applications considered unacceptably risky, such as manipulative algorithms or social scoring systems, while restrict those thought to be high risk, such as- those managing critical infrastructure or law enforcement. China has also formulated an ethics policy for AI, favouring social security over individual rights. This policy excludes the public sector, which is free to carry out facial recognition and social profiling.

The rapid evolution of technology, driven largely by international companies, poses a difficult problem for the law. Switzerland should proactively draft legislation (conclusion 7), ensuring that rules can be applied concretely and that compliance is monitored.