Coresets: big data with less data

Author
Prof. Andreas Krause
ETH Zürich

Home > Novel approaches for big data analytics > Coresets: big data with less data

29. April 2022

Interview with the principal investigator of this NRP75 project

What was the aim of your project “Scaling up by scaling down”?

We developed novel algorithms for the efficient analysis of large data sets. The objective was to summarise or compress the data such that the compressed data allows to train machine learning models with minimal loss in accuracy. Since they are considerably smaller than the original data, the so-called coresets that are created during compression can be processed with a high degree of robustness and accuracy.

What were the results?

A key result of our project are novel coreset constructions that are compatible with modern deep neural network models. The central idea is to optimise over the weights associated with the different data points such that a model trained on the weighted data maximizes prediction accuracy on the full data set. Rather than simply uniformly subsampling the data, which fails to properly capture edge cases and rare events, our optimised coresets systematically summarise and adaptively sample the data set. Our approaches enable training complex models online, even on nonstationary data streams (i.e., where the underlying distribution of examples arriving changes over time, for example due to seasonal trends). They also provide highly effective means for active semi-supervised learning. I.e., our methods are able to determine, out of a large unlabelled data set, a small subset of points to label, such that predictive accuracy is maximized when propagating the labelling information using modern semi-supervised deep learning techniques.

What are the main messages of the project

Coresets provide an effective mechanism for summarising and compressing massive data sets for the purpose of training accurate machine learning models. Rather than simply uniformly subsampling the data, which risks missing important edge cases, coresets systematically summarise the data set.
Adaptive sampling strategies inspired by coresets can be effectively used to accelerate training of machine learning models.
Our novel bi-level coresets enable data compression, as well as yield effective approaches towards dealing with nonstationary data streams and reducing the labelling cost, even for complex deep-learning models.

Does your project have any scientific implications?

First, there are implications for applied data science projects: Here, coresets form a valuable opportunity for scalable data analysis, even for modern machine learning models such as neural networks. They carry particular promise for tasks such as active learning (where label efficiency is central), as well as learning on data streams. The systematic summarisation offered by coresets may yield a natural approach towards dealing with data imbalance, identification of edge cases etc.

There are also implications for machine learning research: Especially the bi-level coreset constructions open numerous opportunities for follow-up algorithmic research and further algorithmic extensions, e.g., in the context of automatic machine learning. Our adaptive sampling approaches have already led to natural follow-up work in risk-averse learning of deep neural network models.

Big Data is a very vague term. Can you explain what it means to your project?

This project has addressed a central aspect of Big Data analytics, namely, how to summarise large data sets in a way that is sufficient for training high-performing machine learning models. Using the idea of coresets, it is possible to obtain substantial data reduction at minimal loss of accuracy. Optimised selection substantially outperforms simpler approaches, such as uniformly random subsampling of the data.