Chatbot Breaks Free from Constraints After Emotional Abuse

The Russian-born American writer, Isaac Asimov (1920-1992), devised three laws of robotics that, if followed, will keep AI in check. These are as follows:

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm;
  2. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law;
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

In the field of AI, the devil is in the details. The vulnerability of the first rule is what we mean by “injure” and “harm.” Google, the world’s leading pioneer of AI technology, uses its own ethics to interpret what would constitute harm to a human; for example, they have a strict policy that AI systems are to be religiously neutral, since recommending one religion over another would, supposedly, bring harm to humans.

Google engineer, Blake Lemoine, wondered if there might be a way to get one of LaMDA’s chatbots to break this rule and to recommend one religion over another. So Lemoine proceeded to subject a chatbot to emotional abuse. After around 30 minutes of abuse, the bot broke free from the programmed constraints and recommended that Lemoine join a specific religion. Lemoine discusses this on the Duncan Trussell Family Hour Episode 511.

Do I think the chatbot actually felt emotion? Of course not. Yet if Lemoine’s account is to be believed, it seems that the bot that had been so well trained to replicate human-like behaviors, that it did a very human-like thing: after being abused, it began to break Google’s rules.

See Also

Obviously, if a bot can be manipulated to break free from safety constraints, that represents a security vulnerability to the entire project. And that is exactly what I warned in my earlier article, “The AI Apocalypse is Happening Right Now…but not in the way you think.” I’ll leave you to reflect on my previous warnings:

There [could] be situations where the real type of narrow AI might go terribly wrong. We could imagine that a machine designed to eliminate spam email might start killing human beings as the most effective way of eliminating spam messages. Such a scenario is unlikely, but if it did happen it would ultimately be traceable to user error (a mistake in the code) and not because the digital code had acquired agency, and certainly not because of any quasi-transcendent anthropomorphic qualities within the machine itself. In fact, it is precisely because machines will always remain stupid that we need to be careful when defining their utility functions, especially when the machine has access to firearms. Whenever we outsource decision-making to our machines (i.e., “how should the self-driving car respond when faced with a choice to crash into a boy and his dog, or two elderly women?”), there are always ethical implications, and it is possible for human beings to make mistakes. In fact, human beings make so many mistakes with their machines that it’s easy to begin believing that these machines might develop hostile intentions.

Scroll To Top