Uncertainty in big data applications: lessons from climate simulations

Author
Prof. Reto Knutti
ETH Zurich

Interview with the principal investigator of this NRP 75 project.

What was the aim of your project?

The goals of the project were to produce

  • a prototype of a climate-impact model using Big Data approaches to study the potential and limitations of such methods and quantify their uncertainty in current events and trends in extreme weather and impacts,
  • a typology of the uncertainties and underlying arguments,
  • criteria for the transferability of the results to other scientific fields.

Results of the project?

Several pioneering studies in this project have explored both the conceptual and practical opportunities and challenges in using Big Data tools and data of unknown quality in climate modelling and impact studies, paving the way for future applications. With the availability of much more data and computing capacity, this field is growing quickly, but important questions remain on how to combine Big Data with process understanding, and how make progress on interpretable machine learning methods. In cases where repeated verification is not possible, the process of establishing confidence is challenging and often relies on understanding the relevant processes and drivers. Machine learning methods are inherently limited in this respect, yet are powerful in extracting information patterns that otherwise would be inaccessible. The key will be to combine the best of both worlds, and this project has laid important groundwork for that.

What are the main messages of the project?

  • Approaches related to Big Data and data science such as new forms and sources of data and modelling approaches relying on machine learning can be most successfully applied if they are combined with more traditional scientific approaches and if they are evaluated based on domain-specific background knowledge, i.e. process understanding. This can also be read as a call for more interdisciplinary research, combining the subject-level expertise of e.g. climate scientists with the technical expertise of data scientists.
  • Whether or not new forms and sources of data or modelling approaches based on machine learning are useful for a specific question has hardly ever a clear “yes” or “no” as an answer. Rather, data and modelling approaches are best evaluated in terms of their fitness-for-purpose. “Fitness-for-purpose” should be understood as a gradual and multidimensional concept. This means that an approach can be fit-for-purpose to a larger or smaller extent (rather than being outright fit-for-purpose or not) and that it depends on the context what degree of fitness-for-purpose is required. Furthermore, there are usually multiple dimensions that are relevant for determining the fitness-for-purpose of an approach (e.g., representational accuracy, ease-of-use, computational power, economic costs, …), and which dimensions are relevant and how they are weighted against each other also depends on the context.
  • Big Data adds new tools to the researchers’ toolbox for addressing certain questions. The changes brought about by this development are, however, gradual and are not expected to lead to completely new methods and approaches. In this regard, most novel data-science approaches are (deeply) rooted in statistics and computer science. Despite new possibilities of applying shelf-ready and easy-to-use machine learning toolboxes for (climate) research, it appears crucial that traditional statistical expertise (in particular regarding limitations and assumptions of methods) and background knowledge are continuously trained, taught and enforced, as these are key to the interpretation, and ultimately, usefulness, of data science approaches for the purpose of understanding.

Does your project have any scientific implications?

  • “Big Data” should not be thought of as an all-or-nothing development, but as a range of tools that can fruitfully be employed by researchers for specific questions and problems.
  • Contrary to early beliefs, Big Data will not be the end of theory in science. On the contrary, the work in this project showed that Big Data elements are most fruitfully applied in scientific research when they are combined with and embedded in theory-based approaches.
  • With the increasing availability of new forms of data, data science skills become increasingly relevant for research. Because of the importance of domain-specific background knowledge, interdisciplinary collaborations between domain experts and data scientists will become ever more relevant
  • Uncertainties in using new forms of data and data science skills need to be analysed on an appropriate conceptual basis and assessed in relation to the purpose of their use.

Does your project have any policy recommendations?

  • As our results indicate, data science skills and interdisciplinary collaborations will become increasingly relevant for scientific research in the future. This has direct implications for science policy: Our results indicate the importance of funding instruments that allow for such interdisciplinary collaborations. Furthermore, data science skills can also be provided by research infrastructures such as the Swiss Data Science Center. Hence, adequate long-term funding should be provided to research infrastructures in order to help researchers incorporate data science skills into their research projects.
  • The work on uncertainty of Big Data predictions has highlighted that a proper uncertainty assessment requires a thorough understanding of the target system at hand, as well as the modelling technique and the data employed. With the increasing use of decision algorithms in society (e.g. in cases of predictive policing), this uncertainty can have ethical implications. Thus, before such tools are applied in society, the uncertainties of the predictions made should be assessed. Although the tools developed in this project were aimed at cases in climate research, they provide a good starting point for such analyses also in other settings.

Big Data is a very vague term. Can you explain to us what Big Data means to you?

The lack of clear-cut definition of the term “Big Data” was the starting point of the project. To be able to tackle the questions at the heart of the project, the term “Big Data” itself needed to be clarified. For this, we developed a conceptual framework that distinguishes between three different components, the measurements, the datasets, and the models. We showed that constructing and using theory-based models and Big Data analysis differ with respect to all three of these components.

We then used this framework with its three components to categorise case studies from climate research. By doing this, we were able to show that there are Big Data elements that are employed by researchers quite frequently, but that most of the case studies categorised lie in between the two “extreme” cases described above. For example, we identified many studies that use classical datasets from climate science, i.e., fixed datasets with measurements of theory-based variables, and analyse them with machine learning. In these approaches, the modelling is based on automatically detected correlations. However, the measurements and data are still derived from theory.

In other cases, social scientists used the frequency of Google search results to proxy missing variables to create indicators of the vulnerability of European cities to extreme heat. While the indicator was based on theory, some of the measurements are based on everyday intuitions.

This overview of studies highlighted two points: First, Big Data does not enter scientific research in an all-or-nothing way. Rather, individual elements such as new forms of measurements, data streams, and modelling based on machine learning, are applied in combination with other, more traditional approaches.

Second, “full-blown” Big Data analysis, from the point of view of our conceptual framework, is based on an unstructured stream of measurements based on everyday reasoning and modelling based on automatically detected correlations. Results obtained in such a way need to be constantly evaluated against new data and adapted if necessary because there is no reason to believe that they extrapolate to new cases. Such a constant evaluation against new data is often not feasible for scientific inquiry.

In fields like climate science where research is also targeted at future long-term developments, such an evaluation is impossible. Hence, the confidence in results obtained from such approaches needs to be established by arguing for the constancy of the identified relationships and results. Such arguments can only be provided by referring to relevant background knowledge about the target system under investigation, i.e., by rooting the research in domain-specific theory.

Thus, our approach highlighted that it is not only descriptively true that Big Data elements are often combined with more theory-based approaches. It is also a necessity to obtain results that can usefully be extrapolated to further cases. Based on this overview, the further steps of the project targeted specific questions in relation to individual Big Data elements identified based on this framework.

While our work has not resulted in the suggestion of a definition of the term “Big Data”, the descriptive analysis of categorising case studies based on a conceptual framework certainly provides a fruitful starting point for such work in the future.

About the project

Related links