Tech

Research in Data Science: Big Data, Big Risks?

Staff Reporter
First Posted: Jul 09, 2013 04:04 PM EDT

Big data is hot news. The opportunities for analysing huge amounts of unstructured data are highly valued in industry and science, yet there is also concern about data protection. ETH information technology professor Donald Kossmann researches and teaches in the field of big data and is convinced that the benefits will outweigh the risks, conveying his views in this interview originally published on the ETH website in June:

Mr Kossmann, what do you think has been the most fascinating application of big data to date?
It’s difficult to say, let me think for a moment... oh yes, Google Translate. Over the years, linguists have tried to develop functional models for language without any great success. Today, Google Translate delivers better quality than any of these models, simply based on experience by comparing existing translations from the Internet.

One often gets the impression that big data is interpreted in different ways by different people. How would you define big data and what is new about it?
Big data is first and foremost the automation of experience. Conventional IT automates processes: you first consider how something might work at its best and then develop a programme that automates precisely this process. With big data, it doesn’t stop there –the process is continually adapted in line with experience.

What are the technical fundamentals required for this?
It is becoming more affordable to save huge amounts of data and computers are becoming more and more powerful. At the same time, companies such as Google developed completely new software structures in the 1990s, meaning that they were no longer reliant on a mainframe computer to analyse large amounts of data and were instead able to fall back on hundreds or thousands of small computers. Practices that previously took place in Google’s labs have become accessible to everyone over the last few years as a result of open source developments.

You also give lectures on big data. To what extent has big data changed teaching and research at ETH?
From a technical perspective, big data is not a revolution; its fundamental technologies have been known for a long time. The courses offered have not completely evolved in accordance with this. However, we have considerably expanded areas such as “Machine Learning” at ETH, i.e. the algorithmic and mathematical foundations for big data analyses. Initial attempts to develop completely new data science courses are currently being carried out at American universities. We believe, however, that broad, well-grounded IT training is still very much in demand even in the age of big data. The continued high level of industry demand for our graduates proves our point.

What has changed in terms of research with big data?
There are more and more collaborations with industry in this area. At the same time, there has been a considerable increase of interest from other scientific disciplines, in biology, for example, where we are supporting the SystemsX.ch systems biology initiative or in sociology where we are involved in the FuturICT project.

In 2010, you founded the big data spin-off Teralytics together with an ETH graduate. What do you offer your clients?
A platform for big data analyses, i.e. software that can process and analyse very large amounts of data in real-time. Very often, these types of analyses then run on hundreds of computers at the same time.

And who are your clients?
I would rather not name them, because big data still has quite a negative connotation in the public sphere – contributed to, of course, by the current debates surrounding data analysis by the NSA in the USA. But big data wrongly has a bad image: it does after all have many useful purposes.

Such as...?
When we succeed in developing new, effective treatments by analysing anonymous health data, for example, in order to combat types of cancer that cannot currently be treated, then public support will soon grow. There are risks, of course, but sometimes the benefit of new technologies is so great that society has to simply run these risks.

Where are the biggest technical challenges?
Efficiency is an important area. The amounts of data are growing far faster than our computer and storage capacities. It is not always practical today to analyse all of the data available from an economic perspective and given the energy required to do so. The question, therefore, is how much data we have to analyse to ensure significant results. We will continue to research improved real-time data analysis. Our aim is to obtain the information that is necessary for making decisions more quickly, something that is crucial in situations of crisis, for example, natural catastrophes. And, of course, data protection is an on-going problem for us. We are developing new hardware structures to encrypt and aggregate data so that we can guarantee that even insiders are unable to draw conclusions about individuals.

Many big data applications today access data that is freely available on the Internet. Is technology that sanitises and encrypts personal data used here?
No, this is the sole responsibility of the user. In the case of services such as Facebook or Twitter, you agree to the companies checking your data. These companies can do what they like with this data. But it is, of course, up to you what you make available on the internet.

Are we on the way to a “post privacy” society and an all-encompassing public sphere, as predicted by certain authors?
No, privacy is a basic human need. Perhaps young people value data protection a little less now but this generation will also learn to better protect its privacy using the technical possibilities out there. In addition, new, different platforms will become available on the Internet that will allow users more privacy than the current options such as Facebook. Overall, I am optimistic that people will ultimately use big data to their benefit.

Interview: Samuel Schlaefli, ©2013 ETH Zurich (Federal Institute of Technology Switzerland)

Donald Kossmann has been Professor for Computer Science at the Institute of Information Systems at ETH Zurich since August 2004.

See Now: NASA's Juno Spacecraft's Rendezvous With Jupiter's Mammoth Cyclone

More on SCIENCEwr