Blog

Safety first?

Reimagining the role of the UK AI Safety Institute in a wider UK governance framework

Matt Davies , Andrew Strait , Michael Birtwistle

17 May 2024

Reading time: 12 minutes

Keyword: Ada position

There have been significant developments in the field of AI safety even since November’s global AI Safety Summit at Bletchley Park. The UK, United States, Japan and Singapore now have their own AI safety institutes and the European Union’s AI Office has begun to set itself up with a mandate to evaluate ‘frontier models’.[1]

With the next global summit taking place in Seoul next week, what have we learned about the state of AI safety in the interim and what role should these emerging national AI safety institutes play in the wider governance system for AI?

In this blog we focus on the UK’s AI Safety Institute as a model for other jurisdictions and suggest – drawing on forthcoming research – that its current approach is so far failing to provide appropriate assurance that AI systems are safe. Fixing this will require a renewed focus on context-specific evaluations of AI systems in collaboration with sectoral regulators and new statutory powers to replace the existing voluntary approach.

The state of AI safety

Ahead of the Bletchley Park AI Safety Summit, we noted that AI safety ‘is not an established term, and there is little agreement on what risks this covers’. We argued that AI safety should mean keeping people and society safe from the range of risks and harms that AI systems cause, which range from the local (specific incidents of misuse and system failure) to the structural and systemic. A regulatory and governance regime that addresses AI safety should aim to prevent and mitigate those harms, and provide people with opportunities to seek redress and challenge harms when they do occur.

World leaders at Bletchley acknowledged that urgent action must be taken to address the risks that advanced AI systems raise. But with the notable exceptions of the European Union and China, governments (including the UK) have so far chosen not to create binding regulations on advanced AI systems.

Instead, several countries like the UK have chosen to establish AI safety institutes (AISIs) to improve their understanding of the most powerful AI systems and the risks they could pose. Under one of the voluntary commitments secured by the UK prior to the Bletchley AI Safety Summit, Google, OpenAI, Microsoft and Anthropic committed to provide the UK AISI with pre-release access to their latest advanced models to conduct evaluations for various risks.

Half a year on, this practice of model evaluation has become the dominant approach for AISIs looking to understand AI safety. This involves testing the capabilities of an advanced model (for example, GPT-4 or Google Gemini) rather than testing specific products and applications built from these models as they are used in applied contexts (such as ChatGPT or Google’s AI-enhanced search).

Evaluating evaluations

Over the past few months we’ve spoken to dozens of leading figures working on AI evaluations in industry, civil society and academia, as well experts in the governance and safety processes in sectors such as pharmaceuticals. The evidence we’ve collected suggests that, while evaluations have some value for exploring model capabilities, they are not sufficient for determining whether AI models and the products and applications built on them are safe for people and society in real-world conditions, for three main reasons.

Firstly, existing evaluation methods like red teaming[2] and benchmarking have a number of technical and practical limitations. They are easy to manipulate or game by training models on the evaluation dataset that will be used to assess the model, or by strategically choosing which evaluations to use to assess the model. In their current form, evaluation methods prioritise some kinds of risks over others and are particularly poorly suited to assessing the impacts of AI systems on minoritised communities.

Secondly, it matters which version of a model you are evaluating. Small changes to an AI product built on a foundation model can cause unpredictable changes in its behaviour, and may override safety features that the developers have put in place. This means evaluations of one version of a model (such as the version of GPT-4 initially made available for consumers via OpenAI’s API) may provide limited information about the safety of a later version made available to consumers, which may have undergone additional changes to its safety features. Worryingly, this leaves AISIs carrying out evaluations vulnerable to a ‘bait-and-switch’ dynamic, where evaluations of the models they are granted access to may have little bearing on the safety of the actual models being integrated into products serving billions of users.

Thirdly – and perhaps most fundamentally – the safety of an AI system is not an inherent property that can be evaluated in a vacuum. When conventional goods are tested for safety – such as drugs or engines for cars and aeroplanes – we can have a relatively high degree of confidence about how those goods will be used and the environments in which they will operate. This allows us to make grounded judgements about safety and – critically – efficacy, so we can understand the trade-offs involved in making a product available for wide release: if a new cancer drug doesn’t work as intended, or if the benefits don’t outweigh the costs, then it won’t be approved for sale.

Similarly, evaluations of a foundation model like GPT-4 tell us very little about the overall safety of a product built on it (for example, an app built on GPT-4). Making a determination about the safety of an AI system should involve assessing its impacts on its specific environment, including how it shapes the behaviour of humans interacting with that system. There are valuable tests to be done in a lab setting and there are important safety interventions to be made at the model level, but they don’t provide the full story.

Testing without teeth

These problems are exacerbated by the limitations of the voluntary framework within which AISIs and model developers are operating. Recent reporting has highlighted that voluntary agreements are fraying: three of the four major foundation model developers have failed to provide the requested pre-release access to the UK’s AISI for their latest cutting edge models. This may include models significantly enhanced with advanced capabilities and design features – such as the advanced personal assistant products announced this week from Google and OpenAI – which other labs have suggested may be problematic.

Even when access is granted, this usually consists of prompting the model via an Application Programming Interface (API), and with no ability to scrutinise the datasets used for training. This level of access allows for some evaluation of how the system behaves, but there are questions around the robustness, reliability and external validity of these types of evaluations. Without knowing what is in the dataset, it is very hard to assess if a model is capable of unforeseen or dangerous capabilities (or if the provider could have done more to mitigate those at the pre-training stage).

The limits of the voluntary regime extend beyond access and also affect the design of evaluations. According to many evaluators we spoke with, current evaluation practices are better suited to the interests of companies than publics or regulators. Within major tech companies, commercial incentives lead them to prioritise evaluations of performance and of safety issues posing reputational risks (rather than safety issues that might have a more significant societal impact).

It’s worth contrasting this state of affairs with other sectors such as pharmaceuticals, where companies are compelled to meet standards set by regulators, both for quality and for evidence of safety and efficacy (e.g. clinical trial results). Companies do, in many cases, have input into regulatory standards, which can include collaborating with regulators to develop clinical trial endpoints (measures of effect). While this is a key process for keeping regulatory standards workable and up-to-date, regulators have the ultimate say in whether a product can be put on the market.

This brings us to the central problem with the existing voluntary regime for AI safety. AISIs, as currently established, do not have the legal powers to block a company from choosing to release their model, or impose conditions on release such as further testing and the introduction of specific safety mechanisms. To return to the pharmaceutical comparison above, this is analogous to testing new drugs according to standards set solely by a pharmaceutical company, with regulators having no input into or any powers to deny approval. In short, a testing regime is only meaningful with pre-market approval powers underpinned by statute.

The future of the UK’s AI Safety Institute

As demonstrated above, evaluations in their current form are not working as intended and the voluntary regime for access to AI models is being strained by non-compliance. Any clear-eyed assessment of the state of AI safety, and the role the UK AISI has performed in its first six months of existence, must acknowledge these realities.

It’s clear that there is a pressing need for independent public interest research into AI safety, as a counterweight to research designed by and for industry. There is also strong precedent for this in other sectors, where well-funded public institutions play an important role in producing information to support regulatory enforcement and civil society action. We think there are two things that need to happen to enable the UK AISI to play such a role.

Firstly, the UK AISI needs to be integrated into a regulatory structure with complementary parts that can provide appropriate, context-specific assurance that AI systems are safe and effective for their intended use.

This will mean changing several key aspects of how the UK AISI functions. Perhaps the most important area to focus on – and the one furthest from AISI’s existing comfort zone – will be working with empowered and resourced sectoral regulators to develop frameworks for testing AI products in particular contexts for safety and efficacy. This could include studying applications of advanced AI systems that the UK Government is piloting, such as the AI Red Box, to understand their risks and their impacts on human operators.

This will help provide other sectoral regulators, private sector companies and public sector institutions seeking to adopt AI with a clearer sense of whether these technologies work for their purposes. Working with regulators in this way can help AISI expand its remit to cover the people, processes and governance decisions behind advanced AI systems, as well as risks from systems that are not necessarily ‘frontier’.

Secondly, it’s clear from the experience of voluntary safety commitments that UK AISI – and any sectoral regulators it works with – will need new capabilities underpinned by legislation.

This should include powers to compel companies to provide AISI and regulators with access to AI models, their training data and accompanying documentation. This will need to incorporate information on the model supply chain, such as estimates of energy and water costs for training and inference, and labour practices. These access requirements could be unilaterally imposed as a pre-requisite for operating in the UK – a market of nearly 70 million people – but aligning with international partners in the USA and EU should also be a priority.

Critically, though, it should also include powers for AISI and downstream regulators to block the release of models or products that appear too unsafe (such as voice cloning systems, which are already enabling widespread fraud). Conducting evaluations and assessments is meaningless without the necessary enforcement powers to block the release of dangerous or high-risk models, or to remove unsafe products from the market. As noted above, it is pre-market approval powers that give regulators in other sectors the leverage to set and enforce standards – so the presence of these powers will be vital to ensure that evaluation methods develop in a manner consistent with the public interest.

Delivering an effective AI governance ecosystem will not be cheap. Established safety-driven regulatory systems typically cost more than £100 million a year to run effectively, and the skills and compute demanded by effective AI regulation may drive this figure even higher. This may mean fees or levies on industry become necessary – a typical approach in other highly regulated sectors such as pharmaceuticals and finance, where paying for regulation is seen as a reasonable cost of doing business.

Comprehensive legislation will also be necessary: both to provide the statutory powers mentioned above, as well as to fix other gaps in the UK’s regulatory framework. Over the course of the year, Ada will continue to evidence why these elements are needed, how they add up to a blueprint for proper governance and how they can be delivered through new laws. It’s welcome that the UK Government has now acknowledged that a credible governance regime will ultimately require legislation, and this should be an urgent priority for the next Parliament.

Conclusion

As global policymakers head to Korea, they must grapple with the reality that the hastiness of companies to deploy state-of-the-art AI products across society has outpaced the ability of governments and even companies themselves to test these systems and manage their use.

This is bad for the public, who are already being affected by the misuse of these technologies and their failure to act as intended. It’s bad for businesses, who are being told to integrate the latest forms of AI into their products and services without meaningful assurance. And given how little we know about the longer-term societal impacts of these systems, it risks unleashing longer-term systemic harms in much the same way as the unchecked rollout of social media platforms did more than a decade ago.

The good news is that – unlike many of the problems currently facing the UK – this is a problem that is squarely within Government’s gift to solve. With meaningful changes to its wider governance and regulatory agenda, and new roles and powers for the UK AISI, the UK could lead the world in ensuring that powerful AI systems are effective and safe to deploy.

The authors are grateful to Elliot Jones, Connor Dunlop, George Lloyd-King and Dzintars Gotham for their input, and to Mahi Hardalupas, Julia Smakman, Willie Agnew, Lara Groves and Melissa Barber for their role in conducting research that this blog draws on.

Footnotes

[1] The term ‘frontier model’ is contested, and there is no agreed way of measuring whether a model is ‘frontier’ or not. We use the term in this blog simply because it is the term used by the UK AI Safety Institute.

[2] Red teaming is an approach originating in computer security. It describes approaches where individuals or groups (the ‘red team’) are tasked with looking for errors, issues or faults with a system, by taking on the role of a bad actor and ‘attacking’ it. For more information see: https://cset.georgetown.edu/article/what-does-ai-red-teaming-actually-mean/