Dark Data

Dark matter and dark energy in cosmology is the matter and energy that cannot be directly observed with the techniques currently available, but we know that it must be there since it is the only explanation for the data that we observe and that cannot be explained by all the measurable matter and energy. This may be our first insight into the existence of dark data. Currently we live in a period of big data thanks to computers and the World Wide Web (part of if is called dark as well) and there are many different ways to collect and process these data. However there are also many different ways in which our analysis may lead to the wrong conclusions because part of the data are missing or wrong, or "dark" as Hand calls it. These dark data can exist for many different reasons. This book is taking a closer look at the phenomenon. Countless examples are described in the book (mostly for UK data). What type of obfuscations can darken our data? Why are some datasets dark? What are the consequences? What could possibly be done to remedy the situation?

Hand starts with a kind of taxonomy of dark data. In what kind of situations are we dealing with dark data? He describes, with many examples, fifteen different phenomena that can lead to dark data. Some are quite obvious like missing data that we know are missing (known unknowns), but there might also be data missing that we are not aware of (unknown unknowns). Some data are intentionally wrong (falsification) or unintentionally (over-simplification or rounding). Conclusions may be obtained for a whole population or over a larger period of time based on data that were only collected for part of the population or were only valid at a certain moment in time (extrapolation), etc. Quite often, the data are wrong or misused for more than one reason.

There is not a formal definition in this book, but nevertheless, using his examples, Hand explains what the different types of dark data are and how they come about, and identifies some of the concepts that he uses throughout the book. For example dark data caused by "self-selection" refers to the fact that data are corrupted because some participants, invited for and online poll, decide not to participate, or prefer not to answer some of the questions. There are problems of designing the sample (even a sample can be big data, in any case the data collected should represent the whole population for which the conclusion is supposed to hold), one has to be careful not to miss what really matters (like causality between data used and the conclusion derived), data can be corrupted by human errors, by summarising or simplifying or rounding the data. People can manipulate data in a creative way (like tax evasion) or corrupt data by deliberately feeding false data (criminal activity, insurance fraud).

Hand also has a chapter on science and dark data, not only were scientists in the course of history tricked in their conclusions by dark data, some also contributed by falsifying published data intentionally, or they may have been biassed by a general belief or intuition. John Ioannidis threw the cat among the pigeons with his 2005 paper Why Most Published Research Findings Are False. Reproducibility was recognised as a major problem and research institutions are demanding data management plans in funding applications. Data management has grown into big business interfering with problems of privacy and GDPR. Distinguishing truth from reality, has become increasingly difficult in our digital world. Encryption, verification, identification, authentication, etc. can hardly keep up chasing creative fictionalisation. Artificial Intelligence algorithms based on machine-learning try to analyse the data that are too massive for humans to deal with, but even these machines can be led astray by dark data.

Thus it has become a major problem to recognise dark data and to know how to deal with it and avoid wrong conclusions. This is what Hand is discussing in the first two chapters of part II of the book (part II has a third chapter that is also the last chapter of the book which is summarising the taxonomy that was described in the early chapters). First we need identify why some data are missing. Here Hand considers three different types: it can be a random phenomenon but related to the missing data (UDD = Unseen Data Dependent) like some may be reluctant to give their BMI when it is high. Or missing data may depends on the data previously observed (SDD = Seen Data Dependent) like a BMI not given because it has increased since the last registered observation. Finally, data are missing but that does not depend in any way on the data observed (NDD = Not Data Dependent). Recognising the mechanism behind the missing data is important because it defines how one should deal with the data, for example on whether and how to complete the missing data or not. Effects of NDD and SDD can be be cured, but UDD is more difficult to deal with.

Dark data can also be beneficial if it is detected and if that leads to a reformulation of the question that we want to answer, or it may lead to strategic elimination of some data that would bias the result. To avoid dark data, it helps to randomise the sample and even hide data from the researcher (like not revealing who did and who did not get the placebo). An obvious way to fill up missing data is to use averages, but a somewhat strange advise is to fill up these data by simulation. It is a valid way to generate data in case of a simple model with known probability distribution, but when it involves a complicated model, then these models are simplifications of reality built upon observations that may involve dark data. Similarly machine-learning techniques involve massive analysis of data that may be corrupted. In these cases it can only be hoped that over the repeated simulations or over the whole learning process the wrong data are not systematic and are averaged out over the iteration. Or one may apply techniques such as boosting and bootstrapping to reduce bias. Bayesian statistics helps to test hypotheses and thus confirm or refute intuitive assumptions. Cryptography can help to make data anonymous so that people are more willing to provide correct data or to prevent the introduction of false data by fake persons. It may even help to make some data deliberately dark, not making them available to users but still using them in computations.

As mentioned above, the description is an exploration of a major problem in data analysis with an attempt of classification, analysing causes, mechanisms, and to some extent also suggest mitigations. However most of the book consists of examples and particular cases to clearly explain the ideas. Nowhere, however is there a concrete or general well defined statistical, mathematical, or algorithmic solution given. This is mainly a wake-up call that clearly points at a major problem that any scientist has to be aware of and that he or she should think about how to deal with. Certainly statisticians, applied mathematicians, computer scientists, but in fact anyone dealing with data (big or not) should be well aware of the "darkness" of their data.

Adhemar Bultheel
Book details

This is a description of how important it can be that in our treatment of data, some of them are missing, or fake. Conclusions derived from these corrupted data can be biassed or wrong. How do we recognize the dark data? How can we deal with the phenomenon? These are answers that Hand deals with in this book. The approach is mainly descriptive with an abundance of examples, mainly from data related to the UK situation. Suggestions are given, but no concrete precise or detailed mathematical or statistical analysis or algorithms is discussed in detail.



9780691182377 (hbk), 9780691198859 (ebk), 9780691199184 (abk)
£ 26.00 (hbk)

User login