CID’s data treasure trove
Decades of data
Text: Peter de Jong
Illustration: Maaike Putman
No science without data. Data specialist Otto Lange ensures that the collected research data can also be found by other scientists.
The six cohorts of the CID have collected a huge amount of data over the past decades. One of the objectives of the CID is that these data can be interconnected and that other scientists can benefit from it. It is up to the team of the Connecting Data in Child Development (CD2) project to provide insight into this treasure trove.
It is a hell of a job – because how do you bring order to decades of data? And how do you then build a system where everyone is able to extract the data out of the CID treasure trove? The magic word: metadata. Speaking is Otto Lange (1965), technical coordinator of CD2.
How should we picture such a huge search engine?
‘Our online search engine allows interested scientists to search the databases of the six CID cohorts. The first step was to map out all the data collected by hundreds of researchers over the years. It is often hugely varied. Each discipline uses its own specialist terms, and the data is sometimes known under different headings. So, it was quite a job. This was followed by consultations on the so-called metadata – the data that makes up our search catalogue. In the end, the CD2 project took more than three years and described the developmental data of 186,400 children.’
Metadata? What exactly is that?
‘Metadata are descriptions of data. Data includes questionnaires, IQ scores, videos, brain scans, and DNA material. Metadata are the characteristics of that data, e.g. by whom was it collected, and as part of which study? It may also include background information about the participants or the device and settings used for measurement. So, basically it is data about data. This is important information that you want to know as a scientist if you want to reuse other people’s data. It also describes under what other terms this data is known; this is particularly important if you want to look up data.’
‘Metadata are descriptions of data. This is important information that you want to know as a scientist if you want to reuse other people’s data.’
Can you give an example?
‘Suppose a behavioural scientist from Groningen is researching the social well-being of adolescents from Groningen during the Covid-19 pandemic and is looking for comparison material in the rest of the Netherlands. To direct her to the right studies where she can find that data, we use harmonised search terms – metadata, in other words. This means that, in agreement with the scientists who collected the data, we use the same set of terms about child development for all cohorts. There has to be agreement on what we include under mental health: not just depression or anxiety disorder, but perhaps also happiness, resilience, you name it. This will allow the researcher to see the related data collected in the different cohorts. She can then still filter on all sorts of aspects, such as age of the participants or year in which the data was collected – useful if, for example, you only want data from adolescents during the pandemic.
In the search engine, the researcher from Groningen will probably – this is off the top of my head – find a link to down-hearted adolescents in the RADAR study or to resilient adolescents in the Generation R project. To retrieve the actual data, she will then have to turn to the data managers of RADAR and Generation R.’
Is that not cumbersome?
He laughs: ‘At the moment, yes. In the future, we hope to link this to data release portals, so that the actual data can be accessed directly via our search engine. YOUth and L-CID are already hard at work on this, so it is coming. The first important step is being able to find the data in the first place, followed by releasing that data. What makes the latter more difficult is that it involves children’s sensitive data. Perish the thought that hackers, for whatever reason, might get hold of child data. All possible risks must be eliminated.’
‘Perish the thought that hackers, for whatever reason, might get hold of child data. All possible risks must be eliminated.’
What are the next steps?
‘To make CID data available as much as possible, we are linking our catalogue to ODISSEI, the national social sciences data platform. This contains much more metadata, including data from Statistics Netherlands. This presents great opportunities for researchers. Other metadata systems around the world, such as the European CESSDA, will soon be able to retrieve our metadata. There are a lot of developments in this area at the moment, allowing us to learn from each other and to keep growing as a network.’
You have worked in computing since the beginning of the computer age. What is the current situation in terms of the amount of available scientific data?
‘It has exploded. The only question is to what extent it is used. There is room for improvement, is my impression. By nature, many researchers are primarily focused on their own research and often tend to collect new data themselves, which is a shame, because there might already be something you can benefit from. If you want to know something about the social effects of lockdowns on children, it is good to know that useful data was already collected during the pandemic. Together you can do more than alone, you just have to know how to find each other.’
Otto Lange is a member of the CD²-team and metadata expert at the university library at Utrecht University.
This article is part of a New Scientist special issue about the Consortium on Individual Development, that will appear in September 2023.