Stream analytics: fast processing and privacy-preserving tools

Authors
Prof. Michael Böhlen and Prof. Abraham Bernstein
University of Zurich

Interview with the principal investigators of this NRP75 project.

What was the aim of your project “Privacy-preserving, stream analytics for non-computer scientists”?

The goal of our project is to build a petabyte-scale analytics system that enables non-computer scientists to analyse high-performance data streams. Our solution supports real-time advanced statistical operations and ensures the privacy of the data. To evaluate the robustness and functionality of the system, the processing pipeline for the Australian Square Kilometer Array Pathfinder radio telescope was replicated, which generates up to 2.5 gigabytes per second of raw data. To evaluate privacy preservation, the TV viewing habits of around 3 million individuals were analysed.

What were the results?

First, new algorithms were developed that make it possible to use the Fourier transformation to incrementally process high-velocity and high-volume data streams. The first algorithm, the Single Point Incremental Fourier Transform (SPIFT), exploits twiddle factors to reduce the complexity of processing a single new observation arriving in the data stream. SPIFT proposes circular shifts to reduce the complexity from a quadratic to a linear number of multiplications. The second algorithm, MPIFT (Multi Point Incremental Fourier Transform), processes batches of observations.

Second, declarative high-level languages were advanced with functional extensions for linear algebra operations. Specifically, we elegantly extended, respectively, the relational algebra and SQL with linear algebra operations and build a system that integrates the functional extension into the kernel of the MonetDB column store.

Third, SihlQL was developed, a SPARQL inspired query language for the privacy-preserving querying of RDF data streams. The starting point was to propose an easy-to-understand probabilistic parameter for systems that are based on differential privacy and allow domain experts to easily specify the desired level of privacy. Then, starting from SihlQL, a compiler was developed to transform the queries to Apache Flink workflows. The resulting system, SihlMill, is published as an open-source project and implements privacy-preserving algorithms from the state of the art, as well as new mechanisms designed to extend the expressiveness of SihlQL.

What are the main messages of the project?

Insight #1. In domains, where the algorithmic requirements are well understood, big data technology is largely available to the public.

This leads to two major messages: First, given that the access to technology is not the main competitive factor (due to its open-source availability), Switzerland needs to ensure a sufficient supply of a skilled labour force that can use the democratised Big Data technology. Otherwise, it is likely to fall back. Second, Switzerland needs to investigate if there are critical domains, where the relevant algorithms do not exist and find the means to incentivise their development.

Insight #2. Privacy-preserving data processing techniques are available but require an adequate understanding of their functionality and parameters.

Consequently, Switzerland needs to ensure that people using Big Data know about the various privacy-preservation techniques. Furthermore, people employing these techniques need to be made aware of the pitfalls of the parameter’s meaning and provided with the toolset to determine adequate parameter values. Alternatively (or additionally), Switzerland should invest into the development of privacy-preserving techniques with more intuitive metrics.

Insight 3. Techniques for privacy-preserving data processing are available but require a sensitivity to the trade-off between privacy preservation and result quality.

Switzerland needs to develop a culture in which the “price” of privacy (in terms of effort during systems design, implementation, and maintenance as well as possible impact on answer quality) is understood and accepted. To that end, a public debate on privacy vs result quality and processing simplicity as well as the tradeoffs involved is needed.

Does your project have any scientific implications?

Scientific Implication #1: Switzerland needs to investigate what critical domains require specialised algorithms that do not exist and find the means to develop them.

Supporting rationale: We found open-source infrastructure for Big Data processing with adequate default settings for many applications readily available (leading us to abandon one investigative path of the project). Additionally, it seems that for many domains, public domain algorithms are also available. As evidenced by our radio-astronomy investigation, however, some domains still require an enormous amount of hand-coding and efficient algorithms still need to be developed. One, hence, needs to identify and address these scientific gaps via research to lower the hurdle for big data usage.

Scientific Implication #2: We require more research into easy-to-use privacy-preserving techniques.

Supporting rationale: A number of privacy-preserving data processing frameworks have been published in recent years (including our SihlQL). Hence, it is likely that the mere inclusion of privacy-preserving techniques will become easier. Understanding its parameters settings has, however, not become easier. One of the central parameters of differential privacy (i.e., 𝜀) is difficult to grasp intuitively. Akin to the availability of big data infrastructures, we, therefore, believe that one needs to research the development turnkey solutions to privacy preservation, which are sufficiently simple to use to drive adoption.

Does your project have any policy recommendations?

Policy Implication #1: Switzerland needs to ensure a sufficient supply of a skilled labour force that can use the democratised Big Data technology.

Supporting rationale: As already mentioned, we found open-source infrastructure for Big Data processing adequate for many applications. Consequently, it seems that access to technology is not the main competitive factor. Much more it is the access to people who can use the technology.

Policy Implication #2: There needs to be a public debate regarding the tradeoffs involved when using privacy-preserving techniques.

Supporting rationale: Whilst the privacy-preserving techniques are available (albeit difficult to tune as mentioned above), people need to understand the tradeoffs involved in using those. Even when using techniques that have understandable parameters, it is still unclear how to determine adequate goals. Is a 5% chance of leaking information an acceptable risk? Is it an acceptable risk when leaking a person’s marital status, course grade, or HIV infection status? What about the COVID infection status during a pandemic (or during “normal” times)? These questions are not primarily technical but require a societal discussion. As evidenced by the discussions on processing health data during COVID, Switzerland needs to have an active debate on these issues.

About the project

Related links