On 11 December 2019, the Indian Government introduced the Personal Data Protection Bill (PDP Bill) in Parliament, following an extensive and fiercely contested public debate. While the Bill sets out the legal framework for the governance of personal data (proposing the creation of a regulator, the Data Protection Authority), it makes reference to non-personal data as well, providing for mandatory sharing of non-personal or anonymised data ‘to enable better targeting of delivery of services or formulation of evidence-based policies by the Central Government’. This marks a pointed, though by no means isolated, reference to non-personal data in the Indian context.
The National Strategy for Artificial Intelligence, for instance, contemplates making some types of government data available for the ‘public good’ and mandating corporations to share aggregated data, as a means of overcoming the hurdle of limited data access within India’s AI ecosystem. Elsewhere, the 2018-2019 Economic Survey of India likened data to a natural resource and stated that personal data, once anonymised, becomes a ‘public good’ that should be utilised for public benefit.
These references, however, do not amount to a comprehensive framework and non-personal data remains unregulated in India. In July 2020, a committee of experts (hereafter ‘the Committee’), appointed by India’s Ministry of Electronics and Information Technology (MeitY), proposed a governance framework for non-personal data (NPD framework). Its recommendations, inter alia, include proposals for a new regulation and regulatory authority for non-personal data.
In this piece, I will take a closer look at the various definitions, assumptions and justifications undergirding India’s emerging non-personal data regime.
The NPD framework defines non-personal data by exclusion, i.e. as all data that is not personal data – where the PDP Bill defines personal data as, ‘data about or relating to a natural person who is directly or indirectly identifiable, having regard to any characteristic, trait, attribute or any other feature of the identity of such natural person, whether online or offline, or any combination of such features with any other information, and shall include any inference drawn from such data for the purpose of profiling.’
More specifically, the Committee classifies non-personal data into three types: public data, such as ‘anonymised data of land records, public health information, vehicle registration data etc.’, community data, such as ‘datasets collected by municipal corporations and public electric utilities, datasets comprising user-information collected even by private players like telecom, e-commerce, ride-hailing companies’, and private data, like ‘inferred or derived data/insights, involving application of algorithms, proprietary knowledge’.
But this dichotomy, and envisioned relationship between personal and non-personal data, is concerning for a few reasons.
First, the reality of mixed datasets – containing both personal and non-personal data – and the inevitable overlap between the two means that a clear demarcation is not tenable. The language of the draft Personal Data Protection Bill alludes to mixed datasets falling under the ambit of personal data protection standards, much like the application of the GDPR in Europe.
This begs the question: is there such a thing as wholly non-personal data? Insofar as data refers to non-human, non-personal data, perhaps this is possible, but when data originates from an individual, the distinction is unclear, especially given the limitations of anonymisation. This is a challenge discussed even in the context of GDPR, yet overlooked in the proposed legal framework, which is particularly worrying in light of mandatory data sharing envisioned under the Bill.
Second, the well-documented issue of the possibility of reidentification of anonymised data, makes the distinction offered one that is blurry at best. Computer scientists have demonstrated the possibility of re-identification from datasets that provide coarse credit card metadata. Another study reidentified individuals based on movie preferences, demonstrating that ‘an adversary who knows a little bit about some subscriber can easily identify her record if it is present in the dataset, or, at the very least, identify a small set of records which include the subscriber’s record’. As capabilities to analyse and structure data have improved over the years, legal scholars have argued that the failure of anonymity and the development of reidentification techniques ‘lifts the veil that for too long has obscured privacy debates’.
Third, the negative definition of non-personal data is overly prescriptive. It assumes that it is a static concept when, in fact, the use and conception of data is contextual. Given the complexities of mixed datasets and reidentification techniques discussed above, it becomes crucial to assess appropriate use and flow of data in the context of its proposed use case. This is thus far more complicated than simply depending on a static definition.
It is also important to note that there are stark contrasts in how personal and non-personal data are construed. While the justification of the use of non-personal data rests on the conception of public good, economic interest and commercial competitiveness, the basis for personal data protection in India is grounded in concepts of informational privacy, liberty and autonomy of individuals, particularly following a landmark judgment on the fundamental right to privacy in 2017.
In an attempt to provide granularity within the broad realm of non-personal data, the Committee categorises it into the three above-mentioned types, namely, public, community and private non-personal data. This fails on several counts. At the outset, the basis of such classification is inconsistent. While private and public non-personal data are defined on the basis of the entity collecting data, the third category of community data rests on the basis of who the data pertains to. Overlaps and tensions between different types are inevitable, particularly in the case where public or private entities collect data about communities, or in the case of public-private partnerships.
In the case of community data, the complications are even more layered, as a ‘community’ can be formed along different criteria. Some of these criteria flow through pre-identified groups or are visible and explicit, like the community arising from a particular geographic area, while other communities may come into existence unbeknown to the very individuals that make up that community through the process of algorithmic profiling and analytics.
This is an important limitation to reckon with. The draft framework intends to empower communities to exercise and enforce rights over non-personal data, through the institution of data trusts and via the fiduciary duties exercised by data trustees acting in the best interests of communities, however, it does not provide guidance on how the community will come to be defined.
The framework does not offer clarity on how to overcome the challenge of communities emerging without the knowledge of individuals, and does not consider how social inequalities, power dynamics and legacies of exclusion within communities will come to be translated into the exercise of these rights, particularly as they pertain to having control over how community data is used.
Overarchingly, the mechanisms for practical implementation of these community rights remain muddy. A glaring omission from the report is articulating how data trusts will come into existence, how they will be empowered, how ‘best interests’ of communities will come to be defined to guide the data trusts’ fiduciary duties.
Further, the interplay between key actors contemplated in the report, such as communities, data trusts and data custodians (i.e. the entities that collect, store, process and use data), remains unarticulated. This is a crucial part of the puzzle that needs to be addressed, especially as accountability mechanisms, responsibilities and composition of these structures need careful thinking, given internal power structures implicit in communities.
Rationale & underlying assumptions
Beyond articulating the scope and definition of non-personal data, the justifications and underlying assumptions for why non-personal data needs to be regulated in the first place also merits analysis. This is broadly argued along two lines.
First, regulation is needed to unlock the economic value of data to benefit citizens, communities and businesses in India. The NPD framework observes that the proliferation of AI techniques have given rise to data-intensive businesses and services, where value is created by gaining insights from information that can be used for profit, or social and public interest activities.
The Committee notes that a fundamental imbalance in the data industry means that only dominant players with network effects and access to large datasets have reaped the benefits of these technologies, to the exclusion of newer players. Noting that, in a data economy, ‘companies with the largest data pools have outsized, unbeatable techno-economic advantages’, the Committee recommends mechanisms for data sharing to rectify this imbalance.
While well intentioned, this argument stands on shaky ground for multiple reasons. The assumption that data sharing will lead to a more competitive playing field is intuitive, but not necessarily true. As scholars have pointed out, ‘in practice, differing circumstances – including knowledge, wealth, power, access, and ability – render some better able than others to exploit a commons.’
Further, this logical leap does not engage with the tension between making a marketplace more competitive and mandating that companies lose their competitive edge. By excluding an analysis of how the NPD framework will interplay with existing competition law standards in India, this argument is incomplete at best.
Similarly, the framework does not engage with the question of how mandatory data sharing will be harmonised with standards of intellectual property, as datasets fall under the definition of a ‘literary work’ under Indian copyright law, and hence are subject to copyright protections. This tension is neither acknowledged nor resolved.
The second argument in favour of regulating non-personal data is to preserve collective privacy and prevent collective harms in the age of AI and data analytics. While the current recommendation is for the proposed Non-Personal Data Authority to factor in considerations of collective privacy, it does not offer analysis or recommendations on how to conceive of it.
Nevertheless, this is an important step in the right direction, as pioneering work in the field of group and collective privacy notes, ‘the main obstacle to a workable concept of group privacy may in fact be institutional – we may lack the institutional configurations, the right actors may not currently be empowered to act and regulate, and the populations at risk of harm from the new data technologies may not have either the informational tools to become aware of the problem or the legal or rights instruments to seek redress’.
The Committee’s recommendation inches towards resolving at least one of these obstacles. A crucial consideration going forward is that communities and groups for the purposes of collective privacy conceive that the process of algorithmic profiling not only infers data about pre-identified groups, but also discovers and creates new groups – more often than not without the knowledge or consent of individuals involved.
At the minimum, the NPD landscape in India is one that harbours confusion. There is little clarity as to how the regulation and regulator for non-personal data would interplay with those for personal data. This requires an urgent and cogent answer, especially as some aspects of data governance – such as the regulation of inferred data and the characterisation of mixed datasets – seem to fall squarely within the ambit of both. The interplay between the NPD framework and competition law and intellectual property also remains unaddressed – risking not only poor data governance, but also regulatory arbitrage.
Additionally, the conception of ‘data’ within the realm of non-personal data provides much food for thought going forward. A fundamental assumption that runs through the draft framework is that data can be priced, marketed and owned, which is at best reductive as it actively ignores the non-economic, non-price aspects of data, which have been explicitly recognised by the Supreme Court in KS Puttaswamy v. Union of India, and the Srikrishna Committee during the drafting of the PDP Bill.
Overarchingly, while the conversation around non-personal data is centred around the idea of promoting innovation, public good and decentralising power, the current framework seems to be getting in its own way. Hopefully, future versions of the framework, and the regulations that will follow, will engage with these challenges.
Image credit: Guirong Hao
Report with recommendations and findings of a public deliberation on biometrics technology, policy and governance
Examining how the commitment to responsible data in the UK's National Data Strategy could be realised and what it misses
A partnership with Traverse, the Geospatial Commission and Sciencewise to understand public perspectives on the responsible use of location data
A research partnership with NHS AI Lab exploring the potential for algorithmic impact assessments in an AI imaging case study