Humans are producing more and more data. In order not to miss anything, they are not deleted. Many companies don’t even know what is stored on their servers – that’s a problem.

Anyone who talks about data or “big data” these days likes to use metaphors. Sometimes data is the “new oil” as Chancellor Angela Merkel calls it at the government’s digital retreat earlier this year. Sometimes they lead to the “revolution that will change our lives”. More skeptical contemporaries like Harvard professor Shoshana Zuboff consider them a power factor in digital society. They would pave the way for surveillance capitalism.


It is not disputed, however, that more data than ever before is being generated, collected and analyzed. Be it to feed applications for artificial intelligence and machine learning, either because the billion-dollar Silicon Valley corporations like Facebook and Google want to deliver even more tailored advertising to their customers. Companies collect a lot, and they can, because sensors or users often (more or less) voluntarily provide them. However, data collectors do not always know whether they will ever need the information.

But because they worry that something is missing, they always take action, whether it is the hair color of the user or the second in the morning when he switches on the light in the bathroom. Maybe this information can still be turned into money – if not today, then maybe tomorrow.

A survey by the US network specialist Cisco came to the conclusion that a city with one million inhabitants will already generate 200 million gigabytes of data per day in 2020 – through smart and networked houses, cars and weather sensors. In fact, only 0.1 percent of it is actually used in most areas. This means that the big data gathering orgies generate tons of data that are lying around unused for the time being: In the worst case, they become “dark data”.


Björn Schembera deals with this dark data at the high-performance computing center in Stuttgart (HLRS). He tries to shed light on the problem area that has so far received little attention. For the computer scientist, data is dark when it is only partially visible and therefore usable. If they cannot be accessed or are not adequately described, even though they were created with technical, scientific and financial effort, says Schembera.

There is no longer a drive for the diskette on which you once saved your master’s thesis

Everyone knows the forgotten USB sticks in the drawer, full of old photos or other data. And once you have saved your master’s thesis on a floppy disk, there is no drive on the computer to read it again. Even old email accounts can often no longer be accessed, either because the providers no longer exist or you have forgotten your password at some point and the account was blocked or deleted after years of inactivity. However, the data you have left there does not automatically disappear.

What can still be booked privately under bad luck or technical progress has completely different dimensions for research institutions and companies. A study by the data evaluator Splunk from 2019 comes to the conclusion that 55 percent of a company’s data is dark. This means that companies do not even know that they exist or where and how they can find, process, analyze and use them. IDC analysts predict that the volume of data worldwide will increase from 33 to 175 zetabytes today. More than half of it will be dark data.


Saving the immense amounts is also a cost factor, not only the storage media have to be paid for, there are also electricity costs. And if copies are still being made for security, the costs will continue to rise. And they cause huge amounts of CO₂. According to a study by the software company Veritas, which specializes in data management, dark data will lead to an emission of 6.4 million tons of CO₂ in 2020 because tons of data garbage and senselessly archived data will be dumped on data centers.

Dark data can also cause legal problems. The General Data Protection Regulation (GDPR) clearly regulates that personal data must be deleted immediately if “it is no longer necessary for the purposes for which it was collected or otherwise processed”. But who is responsible for so-called orphaned data, i.e. data that nobody can access anymore? Who owns this data, who is responsible and may even dare to delete it?

Technical solutions alone do not do justice to the problem

And what happens if no one knows that this data actually exists? Untraceable data is data that is not only missing from scientific studies in order to draw the right conclusions. Even if it is clear that there is no 100 percent knowledge. In his new book “Dark Data – Why what you don’t know matters”, London statistics professor David Hand summarizes the term dark data as data that one does not have. So they are similar to the concept of the dark figure.


This can be data that you don’t know you don’t have, such as missing answers on a form, or data that you don’t know you don’t have, such as the dissatisfied customers who didn’t complain . The data-driven society is based on the assumption that there is hardly any missing, distorted or inaccurate data, or that small amounts of it are not a problem. But that is wrong, these missing values ​​could be decisive. In a research project for the HLRS in Stuttgart, Björn Schembera developed concepts on how to deal with the problem of dark data, how it can be avoided or at least reduced. There is currently the fastest supercomputer in Europe called Hawk. Hawk has a maximum output of 26 petaflops, so it can perform 26 trillion (one trillion = 10¹⁵) floating point calculations per second. The HLRS also sees itself as a service provider for the automotive industry. A lot of data is generated at the HLRS through their computer-aided projects, simulations and the research work of the scientists. According to Schembera, around 22 petabytes of data are currently held there, three percent of which is dark data, which is 580 terabytes.

On the one hand, Schembera suggests technical solutions for successful data management, such as automated labeling with metadata that can also be understood by third parties. They should also be stored in a location that can also be searched. On the other hand, organizational solutions are also necessary: ​​for example, there should be a “data officer” in research institutions, and this can also be transferred to companies.

In other words, a person and not an algorithm that makes decisions, sets deadlines for how long certain data is kept, for example, how long users have to be inactive before an account is closed. And this person can and should delete the unused data. What is garbage does not have to be kept forever.