Glossary – when is data “big”?

Big data is an evolving concept as it describes datasets whose properties challenge current technologies.

Their volume (size) typically exceeds gigabytes (GB) to reach terabytes (1000 GB) or even petabytes (1000 TB), calling for very powerful storage and processing infrastructure. The velocity of the data (the rate of production or transfer, or the analysis speed) can exceed one GB per second, which demands very fast hardware and efficient software.

Applications often combine heterogeneous types of data (text, numbers, coordinates, images, sound, video, etc.) with very different characteristics – a GPS trace being very precise while textual semantics are often ambiguous. This variety requires algorithms that can handle multiple data formats and types.

Data is rarely error-free, truthful, accurate, representative or complete – properties captured by the term veracity. Many big data applications are based on more or less accurate models or on machine learning techniques that first learn on training datasets of varying quality, which influence the validity of results.

Additional V’s are sometimes used to describe a big data application, including data variability, vulnerability, visualisation or value.

Key technology concepts

Anonymisation│Removing all data points that could reveal a person’s identity, so making data anonymous – or rather “pseudonymous”.

Artificial intelligence│Algorithms and machines demonstrating “intelligent” behaviour, as well as the underlying methods and real applications.

Machine learning│Computing techniques allowing algorithms to learn autonomously, such as via training data.

Metadata│Information about a data point such as where and when it was acquired, its type or categorisation.

Re-identification│Combining several anonymised datasets to identify people.

Supervised learning│ A machine learning approach in which algorithms learn from labelled training data.

Unsupervised learning│A machine learning approach in which algorithms uncover features within datasets without using labelled training data.

Societal and ethical issues

Access │People should be able to access and delete their personal data stored by service providers.

Bias │Data is not neutral: it reflects existing biases in society, such as limited representation of minorities or correlations of a discriminatory nature.

Blackbox │The result produced by a machine learning algorithm often cannot be explained. This impairs reliability and trust.

Business practices │Big data applications require data sharing, which raises issues of business confidentiality.

Fairness │Algorithms trained on biased data will likely produce unfair results.

Innovation │Innovation requires clear, stable and balanced regulation.

Power asymmetry │ Citizens, companies and governments are, in practice, often unable to change providers.

Privacy │Individuals should be protected against undue access to their private data, and against others sharing and analysing it.

Regulation │Even algorithms bearing great responsibility are largely unregulated, in contrast to physical medical products or vehicles. National regulations differ, hindering transparency and transnational research projects.

Societal agency │The development of big data is led mainly by corporations, with little control by citizens or public authorities.

Trust │Society needs to have confidence in big data applications. This requires trusting the whole process of data generation and use, in terms of issues surrounding the data itself (privacy, access, and bias) the algorithms (reliability and fairness) and the implementation of big data systems (intentions, ethics, control, etc.).

User agency │Users should be able to control which personal data is collected, how and for which purposes – beyond simply giving or withholding permission for the use of certain cookies.