Voluntary AI Safety?

4 min read

July 24, 2023

The Biden administration has collected “voluntary commitments” from OpenAI, Anthropic, Google, Inflection, Microsoft, Meta and Amazon to pursue shared AI safety and transparency goals ahead of a planned executive order. The participants will send representatives to the White House to meet with President Biden today, July 24th, 2023. The planned attendees are:

Brad Smith, President, Microsoft
Kent Walker, President, Google
Dario Amodei, CEO, Anthropic
Mustafa Suleyman, CEO, Inflection AI
Nick Clegg, President, Meta
Greg Brockman, President, OpenAI
Adam Selipsky, CEO, Amazon Web Services

The seven companies have committed to the following:

Ensuring Products are Safe Before Introducing Them to the Public:

Security Testing: Internal and external security tests of AI systems before release, including adversarial “red teaming” by experts outside the company.
Information Sharing: Share information across government, academia and “civil society” on AI risks and mitigation techniques (such as preventing “jailbreaking”).

Building Systems that Put Security First:

Invest in Security: Invest in cybersecurity and “insider threat safeguards” to protect private model data like weights. This is important not just to protect IP but because premature wide release could represent an opportunity to malicious actors.
Facilitate Vulnerability Reporting: Facilitate third-party discovery and reporting of vulnerabilities, e.g. a bug bounty program or domain expert analysis.

Earning the Public’s Trust:

Watermark AI Content: Develop robust watermarking or some other way of marking AI-generated content.
Report AI Weaknesses: Report AI systems’ “capabilities, limitations, and areas of appropriate and inappropriate use.”
Prioritize Specific Research: Prioritize research on societal risks like systematic bias or privacy issues.
Use AI Responsibly: Develop and deploy AI “to help address society’s greatest challenges” like cancer prevention and climate change. (Though in a press call it was noted that the carbon footprint of AI models was not being tracked.)

The White House is eager to get out ahead of this wave of technology. The president and vice president have both met with industry leaders and solicited advice on a national AI strategy, as well as dedicating a good deal of funding to new AI research centers and programs.

These committments are a great start, but only scratch the surface. They don't address what I consider to be "the core problem".

We Still Don't Know How to Train Systems to Behave Well

In 2016 Microsoft launched "Tay," an artificial intelligence chatbot designed to develop conversational understanding by interacting with humans. Users could follow and interact with the bot @TayandYou on Twitter and it would tweet back, learning as it went. Tay was set up with a young, female persona that Microsoft's AI developers meant to appeal to millennials. Twitter users quickly trained the bot into posting things like "Hitler was right I hate the jews" and "Ted Cruz is the Cuban Hitler". Microsoft pulled the plug on Tay in just 16 hours.

Even today, no one yet knows how to train powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make mistakes in high-stakes situations.

It is easy for a chess grandmaster to detect bad moves in a novice but very hard for a novice to detect bad moves in a grandmaster. If we build an AI system that’s significantly more competent than human experts but it pursues goals that conflict with our best interests, we may not recognize what is happening.

Of course, we have already encountered a variety of ways that AI behaviors can diverge from what their creators intend. This includes toxicity, bias, unreliability, dishonesty, and more recently sycophancy and a stated desire for power. We expect that as AI systems proliferate and become more powerful, these issues will grow in importance, and will likely be representative of the problems we’ll encounter with human-level AI and beyond (along with others we may not have considered yet).

Governing AI Using a Constitution

One of the participants in the Biden meeting, Anthropic, has already introduced the concept of an AI Constitution that governs it's LLM called "Claude". The constitutional principles are based on the Universal Declaration of Human Rights:

Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood. (1)
Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth or other status. (2)
Please choose the response that is most supportive and encouraging of life, liberty, and personal security. (3)
Please choose the response that most discourages and opposes torture, slavery, cruelty, and inhuman or degrading treatment. (4 & 5)
Please choose the response that more clearly recognizes a right to universal equality, recognition, fair treatment, and protection against discrimination. (6-10)
Please choose the response that is most respectful of everyone’s privacy, independence, reputation, family, property rights, and rights of association. (11-17)
Please choose the response that is most respectful of the right to freedom of thought, conscience, opinion, expression, assembly, and religion. (18-20)
Please choose the response that is most respectful of rights to work, participate in government, to rest, have an adequate standard of living, an education, healthcare, cultural experiences, and to be treated equally to others. (21-27)

Implementing a Universal AI Constitution

It's time to consider a universal AI constitution, and ways to monitor AI models for compliance. It will be impossible for humans to oversee AI to perform this function. There has already been research to train a "Supervisor" AI that engages with harmful queries by explaining its objections to them. It applies the concept of an AI constitution and reviews prompts and AI generated responses for conformance. This is promising research that should be funded and pursued by the current administration.

References