Scala programming language: enabling big data analytics

Author
Prof. Martin Odersky
EPFL

Home > More efficient big data infrastructures > Scala programming language: enabling big data analytics

26. August 2021

Interview with the principal investigator of this NRP75 project.

What was the aim of your project “Programming language support for Big Data”?

The Scala programming language has been under development at EPFL since 2003.

It has been used by hundreds of thousands of developers worldwide. Scala has several attractive features that make it the implementation language of choice for a new generation of Big Data “frameworks” (software libraries).

With our NRP 75 project we wanted to improve combinations of programming languages and databases. Following Scala’s philosophy of being a versatile language, we wanted to look for ways to better express and export fundamental programming abstractions (ways of formulating essential tasks) that are used in the interfaces between databases and programming languages.

Results?

The project has achieved its main goal: Integrate several new technologies into a coherent set of abstractions for interfacing with data and validate its usefulness in open-source projects. The implementations of these abstractions are of sufficiently high quality to be integrated in Scala 3, the next major version of Scala released in July 2021. Like Scala 2, Scala 3 is meant to be a production ready platform for major applications, not just a research language.

What are the main messages of the project?

We are about to converge on a well-understood set of abstractions for meta programming.
We have demonstrated the utility of these abstractions for serialisation and database access, which are key elements of digitalization.
This points the way for a closer integration in the future of programming languages, databases, particularly in the areas of streaming and big data.

Does your project have any scientific implications?

Our work in the scope of this project has shown that scientific research can have a large-scale impact on practical open-source projects. With the widespread adoption of open-source software in universities and industry, this gives a faster feedback cycle between scientific results and their evaluation in practice.

Does your project have any policy recommendations?

I believe that policy should acknowledge this impact for society in their funding structure. Concretely, impact through large adoption in open source should no longer be ranked second to impact through citations. Scientific work builds on previous science, and citations are a useful yardstick for that. But its impact for society and its evaluation in practice are equally important, and we now have new means to measure that.

Relatedly, I believe that a larger part of public funding should go towards open-source infrastructure. With the rise of open source, we no longer need to buy expensive software licenses. But someone still has to fund the developers to provide all the new great software, and open source is currently critically under-funded. Taking a lead in such funding would not only provide an important public service, but it would establish Switzerland as an even more attractive place for the very best developers and would help the digitalisation of Switzerland overall.

Big Data is a very vague term. Can you explain to us what Big Data means to you?

Big Data means the confluence of large data sets and programs operating over them. It can be stationary or be processed as a stream. Big Data breaks the mould of traditional single computer databases. It provides more heterogeneity and requires a deeper integration of databases with general programming languages. The results in our project have pioneered techniques to provide that integration.