Data streams: monitoring in real-time

Authors
Prof. Dmitriy Traytel and Prof. David Basin
University of Copenhagen, Denmark, and ETH Zürich

Home > More efficient big data infrastructures > Data streams: monitoring in real-time

12. August 2021

Interview with the principal investigators of this NRP75 project.

What was the aim of your project “Big Data Monitoring”?

Rules are integral to our social and business reality. In many domains, the rules are sufficiently precise that scalable and efficient monitoring algorithms can be used to either prove compliance to the rules or identify situations when the rules are violated.

The aim of the project was to develop algorithms that continually monitor incoming data for rule violations. The more complex the rule, the greater the challenge of checking it efficiently against enormous volumes of data. The expressiveness of the input language influences the possible complexity of rules and therefore the efficiency of the monitoring algorithm. Our aim was to find efficient monitoring algorithms for highly expressive and hence practically useful input languages.

Results?

The efficiency of monitoring algorithms can be greatly improved if one considers alternative modes of operations for monitors. This is particularly relevance for high-velocity data streams with millions of events arriving at the monitor per second. Our project demonstrates how such high-velocity data streams can be handled algorithmically.

The scalability of monitoring algorithms can be greatly improved by black-box data parallelization. Black-box means here that one does not have to modify existing monitoring algorithms. Instead, our project demonstrates how to improve scalability by splitting the input event-stream across different (identical) instances of the monitor without compromising the monitor’s correctness. As a concrete example, using 16 monitor instances allowed us to increase the number of events we can process in real-time by a factor of 10.

Does your project have any scientific implications?

Formal methods are often dismissed as impractical because they “do not scale”. This “excuse” is no longer valid: Runtime verification is a lightweight formal method that scales, as our project has demonstrated.

Does your project have any policy recommendations?

In realms where policies can be made sufficiently precise (so that they can be falsified) and the events being monitored can be observed or logged by IT systems, it is possible, feasible, and desirable to replace error-prone human-based auditing and compliance processes with computer-supported ones.

Big Data is a very vague term. Can you explain to us what Big Data means to you?

The term Big Data should always be considered relative to the algorithmic problem at hand. Some problems are algorithmically easy. For example, finding frequent items in data streams can be done very efficiently and thus the Big Data research in this space focuses on scenarios in which one cannot even inspect every item and thus resorts to approximate counts.

In contrast, the monitoring problem that we are studying is comparatively hard. It requires correlating different items in the event stream and potentially keeping them in the monitor’s memory. The monitor’s memory usage also influences the data quantities a monitor can handle in real-time.

From this point of view, Big Data means having a data input that overwhelms the known algorithms for a specific problem. This is a rather pessimistic definition because any algorithm can be eventually overwhelmed. Thus, there is no “solution” to this Big Data problem. What we can do is to push the boundary by devising better algorithms or parallelizing the data processing.

By precisely doing that, our project has contributed to changing the meaning of Big Data for the monitoring problem.