Big Data Monitoring

Author
Dr. Dmitriy Traytel
ETH Zürich

Home > Uncategorized > Big Data Monitoring

17. December 2019

Interview with the PI of the NRP75 project “Big Data Monitoring”.

What can you tell us about the status of your project?

Under “Big Data Monitoring” we understand the detection of complex temporal patterns in high-volume and high-velocity streams of data. A typical monitoring task from the domain of financial services reads as follows: given a stream of money withdrawal and limit-raising events produced by a bank, ensure that the following property always holds: “the total amount of money withdrawn by any customer in the last 30 days does not exceed 5000 CHF, except if the customer has previously received a higher credit limit”. Banks and other highly regulated companies are often faced with hundreds or thousands of such properties that they must comply to and demonstrate compliance to.

Our project is separated into a Theory and a Practice track. In the Theory track, we explore the trade-offs in efficiency, scalability, property specification language expressiveness, and the monitoring setting. In the Practice track, we aim to scale up our monitoring tools to cope with huge quantities of data. Despite their efficiency, our monitoring tools are single-threaded programs, which naturally limits their throughput. We are improving their scalability by linking them with existing Big Data technology for data parallel computation (e.g., the Apache Flink stream processing framework).

What about first results?

An exciting development in the Theory track is a new approach to monitoring, called multi-head monitoring, which can achieve a significantly better performance, both asymptotically and in practice, than the traditional approaches. A multi-head monitor may read each input event multiple times; the actual number only depends on the monitored property, but not on the number of processed events. This mode of operation allows the monitor to use only little additional memory, which also only depends on the monitored property, not on the number of processed events. Thus, the multi-head monitor suits perfectly the Big Data Monitoring setting, in which the properties are typically small and fixed, whereas the number (and rate) of events is huge.

In the Practice track, we developed a framework for slicing the event stream into substreams that can be monitored independently (in parallel). To this end, we adapted hash-based partitioning techniques from databases, which were tailored for processing database queries, to the monitoring setting. This resulted in an algorithm that can automatically select a slicing strategy that distributes events evenly across the parallel submonitors. We implemented this automatic slicing, combining MonPoly (our state-of-the-art monitor) with Apache Flink (a framework for data parallel computation) and experimentally demonstrated a significant performance improvement: while retaining sub-second latency, 16-way parallelization increased the throughput of our parallel monitor by one order of magnitude. Furthermore, the used slicing strategy must account for the statistics of the event stream to be effective. For example, the relative event frequencies and frequently occurring data values influence the choice of the slicing strategy. These statistics may rapidly change. We extended our parallel online monitor to adapt to such changes and showed empirically (on synthetic data) that adaptation of the slicing strategy can achieve up-to a tenfold improvement in run-time.

Keyword “technology transfer”: In your opinion, who could be possible users of your project? Who could benefit from this?

Rules are integral to our social and business reality. In many domains, the rules are sufficiently precise that monitoring can be used to either prove compliance or identify situations when the rules are violated.

The algorithms and tools developed in our project can thus benefit virtually anyone with large amounts of data, which must behave according to given, precise rules. We are in the process of establishing industrial collaborations with several partners. From our experience, the main challenge for the usage of monitoring tools in industrial collaborations is the fact that the monitored rules must be known and sufficiently precise. For example, if a regulation demands that some item is “deleted”, we need to know precisely what “deletion” means (e.g., “deletion of the item from a database”, or “deletion of the item from a database and all backups”, or “deletion of the item and all derived items from …”, etc.) and moreover all the information necessary to detect violations of such properties must be present in the event stream.

Furthermore, sharing data with us (as a third party) is difficult for most companies, which significantly complicates collaborations. (Although, this is less of a problem in knowledge transfer, if the company decides to use our open source tools on its premises.).

Big data is a very vague term. Can you explain to us what Big Data means to you?

In my opinion, the term Big Data should always be considered relative to the algorithmic problem at hand. Some problems are algorithmically easy. For example, finding frequent items in data streams can be done very efficiently and thus the Big Data research in this space focuses on scenarios in which one cannot even inspect every item and thus resorts to approximate counts. In contrast, the monitoring problem that we are studying is comparatively hard. It requires correlating different items in the event stream and potentially keeping them in the monitor’s memory. The monitor’s memory usage also influences the data quantities a monitor can handle in real-time.

From this point of view, Big Data means having a data input that overwhelms the known algorithms for a specific problem. This is a rather pessimistic definition, because any algorithm can be eventually overwhelmed. Thus, there is no “solution” to this Big Data problem. What we can do is to push the boundary by devising better algorithms, parallelizing the data processing, or resorting to approximate solutions.

Big Data Monitoring

What can you tell us about the status of your project?

What about first results?

Keyword “technology transfer”: In your opinion, who could be possible users of your project? Who could benefit from this?

Big data is a very vague term. Can you explain to us what Big Data means to you?

About the project

Related links