In this long read, we highlight three issues that arise out of the European Commission’s data strategy, which we believe require further thought as the EU considers how to design a positive vision of data governance that works for people and societies.
The data strategy’s aim of ‘digital transformation’ has been accelerated by the COVID-19 crisis that has consumed the intervening months, and the centrality of access to data underscored. The crisis will undoubtedly influence the direction that the Commission takes, and an adjusted timetable for the European Commission’s 2020 work programme has been published.
But the immediate lessons of the pandemic for data governance and digital policy do not obviate the need for careful consideration of the strategy proposed by the Commission, which envisions a ‘genuine single market for data’, with interconnected ‘data spaces’, common pools of sectoral data and the cross-sector use of health, mobility and agriculture data, as well as personal data. In fact, the crisis underscores our knowledge that people do not experience the benefits of digital transformation equally, and reminds us of the sociotechnical nature of data-driven systems.
1. Does the distinction between personal and non-personal data need rethinking?
The strategy envisages ‘a single European data space – a genuine single market for data, open to data from across the world – where personal as well as non-personal data, including sensitive business data, are secure and businesses also have easy access to an almost infinite amount of high-quality industrial data, boosting growth and creating value, while minimising the human carbon and environmental footprint.’
The idea that both personal and non-personal data can be better shared and accessed requires some further analysis. The Commission’s approach suggests either that there’s no tension between the different types of data regimes (personal data, non-personal data, open data), or that this tension can be easily resolved.
Under European Union data regulation there are two separate rules: one regulating personal data processing and one addressing non-personal data. The definition of non-personal data is ‘everything that is not personal’. However, personal data is a dynamic and evolving concept, and flexibility in the definition of personal data is important. New technologies, new types of processing and the ever-growing amount of data pose continual challenges to defining what’s personal data and what’s not. There are some constants though: wherever certain information leads to singling out an individual, protective measures are triggered for the preservation of fundamental rights.
All this means that operating with the personal/non-personal distinction at regulatory level, and consequently in the data spaces model, is not straightforward. Rapid technological developments and the increase of data analytics capabilities have enabled the identification of an individual with very few data points or from anonymised datasets. Research has consistently shown (see examples from Science and the Computational Privacy Group) that even when datasets do not contain direct identifiers, anonymous data can be reverse engineered making it possible to link data back to a person. In one example, a study showed that collating only 10 URLs from an anonymous dataset is enough to uniquely identify an individual. Handling anonymised data still entails huge risks for individuals and anonymisation techniques are underdeveloped.
Accepting that strictly non-personal data (e.g. data about how machines operate) exists, it’s clear that the practical challenges posed by the personal and non-personal data dichotomy and managing anonymised/aggregated data require the application of a high bar of security and protection, regardless of the type of data used. The risks of handling non-personal or anonymised data are, in increasing circumstances, as acute as handling personal data under the GDPR.
The Data Strategy states that organisations contributing data to data pools will ‘get a return in the form of increased access to data of other contributors, analytical results from the data pool, services such as predictive maintenance services, or license fees’. This seems to relate to machine data, however it is unclear whether the same approach will be taken for example for health data – or how the potential risks or benefits might be managed.
Manufacturing data and health data can be imagined at opposite ends of a scale of non-personal and personal data. Can data spaces be designed or treated in the same way? If the answer is yes, wouldn’t it be necessary to deploy the highest degree of protections irrespective of the data regulation regime? If the answer is no, then what are the guarantees against re-identification of individuals or collective harms from abusing aggregate datasets?
2. Should we focus on data quality (or relevance), rather than data quantity?
‘The Commission is convinced that businesses and the public sector in the EU can be empowered through the use of data to make better decisions.’
‘The availability of data is essential for training artificial intelligence systems, with products and services rapidly moving from pattern recognition and insight generation to more sophisticated forecasting techniques and, thus, better decisions.’
The Data Strategy veers close to positioning data availability as a panacea for societal problems and for better decision making, asserting that data is the key to ‘more prosperous and more sustainable societies’. However, having more data does not in itself guarantee better decision making. We should also look for the quality of data, which needs to be supported from three different axes: quality of information, quality of process and quality of governance.
Quality of information
The informational quality of data is important. In some cases, only a specific, relevant type of data or only a diverse or representative sample of data will yield good results for the decisions we want to reach – not just bulk, unplanned collections of data.
Also, if we overburden our systems with data en masse, this will surely entail expensive processing operations, but will not necessarily lead to efficient or bulletproof results. It is generally agreed in the AI community that feeding more data to machine learning models has diminishing returns after a certain threshold has been reached in the training process. More and more research effort is being directed today towards reducing the quantity of data required by algorithms and qualitatively improving their capabilities of deriving better understanding out of smaller data samples. This approach is not only a promising way to overcome the limitations of our current data-intensive methods, but it is also better aligned with the values the EU stands to promote. Therefore, we should aim to incentivise research and develop systems that use fewer, higher quality data instead of encouraging systems of undiscriminated mass data collection.
Quality of process
Data synchronisation, cleansing, tagging, interpretability and accuracy are important elements for ensuring qualitative data processing, and significant resources need to be invested in operationalising and standardising these processes. Unbridled support for data sharing is problematic in the context of vast differences in infrastructure, personnel, capabilities, security and understanding across European Union member states. Investment is needed in sustained, long-term support in local infrastructure and skills to make sure that data is properly processed and managed. Putting pressure on public authorities in all 27 countries to share more data, without significant local investment, could potentially lead to severe discrepancies and become a point of exploitation and abuse by malicious actors taking advantage of the weakest links in the EU data spaces ecosystem.
Quality of governance
Availability of data relies on robust data governance structures with secure, trustworthy, privacy by design and by default mechanisms for data sharing at a large scale. The European Commission asks whether there are ‘sufficient tools and mechanisms to [donate] your data’. A prerequisite to asking that question should be addressing the technical, operational and legal challenges that come with processing large-scale, potentially regional, potentially sensitive data, which is the EU Data Strategy’s aim. We need to learn important lessons from the current COVID-19 pandemic, which has shown how unprepared we are for handling data in times of emergencies and has exacerbated challenges in terms of public trust, coordination, approaches, efficacy, ethical issues and rights interferences.
The European Commission seems to be aware of the great number of unknowns when it comes to achieving the full ‘transformation towards a data-agile economy’ and prefers an ‘agile approach to governance’ which allows flexibility, testing, troubleshooting. However, certain elements should be called out from the beginning. These would include a more advanced legal framework protecting against blanket data-enabled (automated) decision making, especially in the public sector. As mentioned above, we also need to work out problems created by underdeveloped anonymisation techniques, and the personal and non-personal dichotomy that stresses the difficulties in differentiating between the two – and consequently in understanding which legal regime becomes applicable.
Framing data narratives
Our work on the Rethinking Data programme shows there is often a tendency to frame and think about data as a resource from which value and knowledge can be extracted (see, for example, Puschmann and Burgess, 2014). A positive outcome of such a framing is the perception that big data can provide insights into decision making for a better society. However, as a resource, this framing also yields a conception of data as though it is a commodity, shaping our practices around it as extractive and competitive.
Whatever conceptions are created through this framing, they will inevitably be misconceptions because this framing simultaneously highlights certain qualities of data while obscuring others. As a resource, the value of data is highlighted, but its socially constructed nature is obscured. This results in the ‘common yet false perception of value production in big data as natural, intermediary and unbiased’ (Kerssens, 2019). These framings pose meaning and value as inherent to data, and obscure the role of people in the collection, processing, analysing and understanding of data.
Data’s value does lie in the ability to help make better decisions and understand the world around us. However, this value does not lie dormant in data simply waiting to be extracted, and neither does having more data necessarily mean we will have more insight, more understanding and a more prosperous society.
3. Is investment in technologies and infrastructure enough?
The EU Data Strategy announces major investments in technologies and infrastructures that enhance data access and use. However, these investments will only be meaningful if they are matched by efforts to build data governance architectures, rules, norms and safeguards that ensure not only the availability of data, but also a functioning data ecosystem.
Data governance mechanisms should not be built solely based on economic purposes ‘to capture the enormous potential of data in particular for cross-sector data use’, but with a larger set of legal, technical and ethical considerations in mind. To make it meaningful, the EU Data Strategy should be accompanied by data governance strategic milestones. At the same time more clarity needs to be given to terms such as ‘data spaces’ or ‘data pools’. Questions arise as to whether these would literally mean common wells of data, or whether they are more like a virtual concept of data portability that enables the flow of data between different actors?
Governance would also mean more than just making available ‘high-value datasets’ free of charge, without restrictions, and via APIs as a ‘way to ensure that public sector data has a positive impact on the EU’s economy and society’. What is required is more than clear terms or licenses for access and re-use of data, open standards, interoperability, user feedback and co-participation or technical customisation, as referenced by the European Commission. These are all important elements of a good governance model, but the tension between personal and non-personal data, the use of mixed datasets, the lack of clarity about potential intellectual property rights over datasets and the overall potential exposures and harms force us to think further.
The need for a more thoughtful public debate
The European Commission starts its public consultation with the question: ‘Do you agree that the European Union needs an overarching data strategy to enable the digital transformation of the society? Yes/No’.
Developing answers to the questions in the public consultation calls for a wider and more thoughtful public debate than the restrictive questionnaire the European Commission has put forward. Most of the questions in the public consultation are closed and contain vague formulations that can only be answered with yes or no, agree or disagree, relevant or irrelevant. No meaningful commentary on the digital strategy can fit into the 200 to 500 character limit of the (few) open questions available.
However, as work at the Ada Lovelace Institute here shows, there is the capacity to have a more meaningful, thoughtful public discussion about some of these issues. Our own deliberative conversations through the Citizens’ Biometrics Council and health data citizen juries have shown how it is possible to bring a wide variety of societal actors into a dialogue about the future of Europe’s data strategy. Could public deliberation about the future, particularly in the ‘new normal’ created and shaped by COVID-19, help inform the future vision for the data strategy, and prompt a wider societal debate about some of these issues?