In today’s column, I will be taking a close look at an intriguing approach to developing generative AI that is coined as Constitutional AI (CAI). Prepare yourself for an insider perspective on the ways in which generative AI is potentially going to be further devised and sharpened in the years ahead.
As a sneak peek, you might aim to think of the word “constitution” in this context as a generic indication rather than as somehow alluding to the U.S. Constitution. I will try to help you with that mental proviso by referring to the approach as constitutional AI rather than as Constitutional AI (the difference being that the former uses a lower-case letter “c” while the second uses a capital “C” and tends to throw people off by signifying or suggesting the U.S. Constitution or some other formalized national document, which is not what this new moniker distinctly signifies).
Let’s consider what the word “constitution” generically means in its most fundamental construct.
The Merriam-Webster dictionary defines the generalized word “constitution” as denoting “the basic principles and laws of a nation, state, or social group that determine the powers and duties of the government and guarantees certain rights to the people in it,” or as “a written instrument embodying the rules of a political or social organization.”
Try to maintain that general definition in your mind throughout this discussion since doing so will be immensely helpful on this topic. You see, an approach that abides by constitutional AI is ostensibly the idea that one way to get generative AI to do the things we want it to do would be to data-train the AI on various core principles. We might instruct the AI to never say curse words or to never generate a response that tells someone to do anything illegal. Those would be ethical or legally minded precepts or principles that we could leverage for shaping generative AI.
Thus, we could deliberately piece together a set of written guidelines or “constitution” that has a slew of ethical and legally minded principles or rules, and then data-train generative AI to try and abide by those heralded precepts. We would feed those principles into the generative AI during the data training stage. The hope is that the underlying algorithms and data structures will be molded and shaped in the direction of adhering to those principles.
It is a clever notion of how we might aim to suitably data-train generative AI.
I mention the aspect of suitability because there is a huge concern right now that generative AI can at times go off the beaten path and spew outrageous statements and narratives, see my analysis at the link here. If you’ve ever seen this indecency occur or had it happen to you, the chances are that it made you blush or perhaps even got you quite angry. We presumably do not want to have in our midst generative AI that is going to generate foul and egregious comments and unseemly remarks.
Making use of the constitutional AI approach has its ups and downs. Currently, by and large, the generative AI apps flooding into the marketplace have not utilized the method, predominantly due to the newness of the approach.
There is a notably leading-edge generative AI app known as Claude by Anthropic that does use this promising approach. The constitutional AI method itself was broadly devised by Anthropic researchers and developers. Claude is a working example of the constitutional AI method and visibly shows that the approach is workable and viable. This is not merely a wide-eyed theory or an obtuse hypothetical possibility. The overall mantra of Anthropic is that they are aiming for three H’s, namely helpful, harmless, and honest (HHH) AI systems. For my recent coverage of the Series C raise of $450 million that Anthropic garnered from the likes of Spark Capital, Google, Zoom Ventures, Salesforce Ventures, Sound Ventures, and others, see the link here.
Let’s proceed herein to examine constitutional AI in a broad sense as to how it works and the anticipated usage going forward by a wide variety of generative AI developers and researchers. As with most things in life, turns out that there are all kinds of twists and turns associated with this relatively new and gradually emerging method.
In many significant ways, the method brings to light immensely valuable insights for devising human-centric AI (i.e., the development of AI that aligns with human values and abides by those human values, as I’ve discussed at the link here). If we opt to show generative AI during data training the types of human values that we humans cherish, the mathematical and computational pattern-matching might tend toward honing on those valued precepts. That’s a hopeful breath of relief and a strongly desirable outcome, we all dutifully presume.
Into all of this comes a plethora of AI Ethics and AI Law considerations.
There are ongoing efforts to imbue Ethical AI principles into the development and fielding of AI apps. A growing contingent of concerned and erstwhile AI ethicists are trying to ensure that efforts to devise and adopt AI takes into account a view of doing AI For Good and averting AI For Bad. Likewise, there are proposed new AI laws that are being bandied around as potential solutions to keep AI endeavors from going amok on human rights and the like. For my ongoing and extensive coverage of AI Ethics and AI Law, see the link here and the link here, just to name a few.
The development and promulgation of Ethical AI precepts are being pursued to hopefully prevent society from falling into a myriad of AI-inducing traps. For my coverage of the UN AI Ethics principles as devised and supported by nearly 200 countries via the efforts of UNESCO, see the link here. In a similar vein, new AI laws are being explored to try and keep AI on an even keel. One of the latest takes consists of a set of proposed AI Bill of Rights that the U.S. White House recently released to identify human rights in an age of AI, see the link here. It takes a village to keep AI and AI developers on a rightful path and deter the purposeful or accidental underhanded efforts that might undercut society.
Let’s consider the vital importance of giving due attention to AI Ethics and AI Law considerations.
If we do not explicitly show generative AI our valued human-devised principles, what then will the AI pattern match onto? The pattern matching might wander all over the map and not especially land on cherished human values as a cornerstone. Or the pattern matching might only detect or discover a smattering of such principles, ergo providing incomplete coverage and allowing numerous gaps and holes to exist in the pattern-matching constructs. Envision an AI with a spotty and disconnected array of underlying human-centric principles. It could be like the proverbial box of chocolates; you never know what you might get out of the generative AI.
The devil in employing the constitutional AI approach lies to a great extent in the finer details involved. You can readily mess up when trying to use this evolving approach. Even if you don’t mess things up, the approach might not produce the results that you are hoping to achieve. You can bet that many AI developers will eventually use a constitutional AI method and likewise will expand upon or revise the approach in ingenuous ways.
Doing so will be good for AI. Doing so will be good for us all, since the more that we explore and identify how to align AI with humankind, well, the better off we will all be as AI enters into all corners of our daily existence.
A quick side note.
Please know that not everyone favors the use of the word “constitution” in this budding catchphrase.
I already pointed out that one concern is the implication that this is generative AI that is perhaps based on the U.S. Constitution or some other national keystone document. Nope, that’s not the case, though you would certainly have ready cause to think it.
A strident suggestion is that the naming ought to be changed to principles-based AI. The notion is that this alternative name might avert any confusion associated with being conflated with nation-state constitutions. A sharp retort is that saying principles-based AI is not nearly as catchy, being abundantly less alluring. The naming sizzle, as it were, of a constitutional AI is that it contains the word “constitution”, and you would be undercutting the pizazz by replacing the lofty keyword. Hogwash, others counterargue, they insist that the naming issue is an outright and unfortunate distraction that will hamper endeavors seeking to adopt the method.
Controversy continues to bubble. A thorny bit of a nomenclature conundrum, for sure.
Let’s move on for now.
How RLHF Heroically Came To The Fore For Generative AI
First, a bit of additional context about generative AI will be especially helpful for this discussion on constitutional AI.
Generative AI is the type of AI that has been in the news lately and you probably are already familiar with the widely and wildly popular generative AI app ChatGPT by OpenAI or maybe know about or have used its successor app GPT-4, or perhaps other generative AI including Google’s Bard, and so on. For background and insights about the latest generative AI trends, see my coverage at the link here and the link here, just to name a few.
Generative AI is usually data trained via scanning text that resides on the Internet. An elaborate and complex pattern-matching takes place to try and mathematically and computationally mimic how humans write and interact in written form. The latest generative AI apps can amazingly communicate in written narratives that nearly seem human-like. Keep in mind that this is a result of computational pattern-matching and not due to the AI being sentient. We don’t have sentient AI, despite whatever you might see on blaring headlines and those outlandish claims found on social media.
Prior efforts to devise and make available generative AI was often stymied due to the AI generating text responses that were offensive or otherwise seemed unsavory. There isn’t any semblance of common sense or other human traits that are magically embedded into the AI. The whole kit and kaboodle have to do with statistical relationships of words associated with other words. By scanning text on the Internet, there is a strong possibility of statistical associations including words and phrases that contain hate speech and all manner of ugly and atrocious narratives.
ChatGPT when first released was anticipated to befall the same problems that prior generative AI releases had brutally encountered. Here’s how things had previously played out. People would start using a newly released generative AI and find that the interactions or produced essays contained inappropriate and divisive wording from time to time. Even if this was due to being purposely stoked to generate those adverse outputs, nonetheless the media touted these instances as an indicator that generative AI was not ready for prime-time use.
Various AI makers had to quickly retract their generative AI from the marketplace to diminish the societal backlash that rang out. Indeed, this occurred so frequently that many of the AI makers would only release their generative AI to those that signed up on a special list to be a mindfully chosen first adopter. Those that did so were pre-screened. The aim was to exclusively allow the generative AI into the hands of those that would hopefully not stir the pot. They would experimentally use the generative AI knowing that it could be offensive, but not frantically scream to the rooftops when it happened.
The assumption was that ChatGPT would befall the same brutal gauntlet.
Lo and behold, ChatGPT surprisingly was welcomed with open arms.
Millions of people rushed to sign-up. An unexpected darling of generative AI had shockingly taken the world by surprise, including even AI insiders that assumed that ChatGPT would get dinged for whatever foulness might occasionally be emitted. You can in fact get ChatGPT to emit dour and offensive text, see my coverage at the link here, but some prebuilt guardrails and tuning had made the generative AI less likely to do so or at least to forewarn that an interaction might be headed in that direction. Some would also insist that a bit of luck and coincidence of good timing related to the release of ChatGPT led to the madcap success too.
A technique used by OpenAI, the AI maker of ChatGPT, consisted of leveraging RLHF (reinforcement learning via human feedback). Others had employed this same technique before, but the OpenAI effort seemed to especially come out with a stellar result. The approach is rather straightforward. After having first data-trained the generative AI on text scanned from the Internet, the next and crucial step consisted of having humans review the resultant capabilities of the generative AI.
Carefully selected and specially trained human reviewers would then indicate to the generative AI whenever the text or essays veered into foul territory. By repeating these reviews over and over again, the pattern-matching of the generative AI is mathematically and computationally going to hopefully catch onto the type of wording that is acceptable and the type of wording that is unacceptable. A reviewer might for example tell the generative AI that a particular word or phrase is impolite. Other fellow reviewers might do the same. When a sufficient number of these feedback indications occur, the numeric and statistical relationships within the generative AI ought to hone and tune toward avoiding those denoted words or phrases.
The use of RLHF has become an expected and common practice for much of today’s generative AI.
Any AI maker that wants to avoid a nightmarish backlash to their generative AI is likely to use the RLHF technique before releasing their AI app. Realize that the use of RLHF does not guarantee that the generative AI will never emit anything sour. We don’t yet have tried and true methods of mathematically proving that a generative AI will have been tuned to avoid producing foul outputs (researchers are working on such proofs, see my analysis at the link here).
Overall, the use of human reviewers that are chosen and trained to do tuning of generative AI before release is the golden method right now. It makes great sense to embark upon an RLHF effort for any commercial AI app that you desire to put into the public’s hands. You are aiming to reduce the risk that the generative AI is going to emit foulness. This, in turn, will allow your generative AI to be embraced by the public and not stoke societal backlashes, nor draw angry and rightfully concerned regulators and legal repercussions hammering down upon your head.
There are a multitude of headaches and problems associated with RLHF. Yes, once again, life always seems to involve tradeoffs.
First, consider that you have to try and decide how many human reviewers you will hire to do the reviewing effort. Do you need just a few, or maybe dozens, or possibly hundreds or thousands of said reviewers?
It is very difficult to gauge how many reviewers you might need to try and employ to materially reduce the likelihood of foulness that your generative AI might emit. If you hire too few, perhaps the generative AI will still be wide open to foulness. If you hire too many, you might needlessly be wasting time and money. Somehow, you must try and aim at the magical Goldilocks number, just the right headcount.
Second, you need to train those human reviewers and also seek to monitor them to gauge that they are doing the job that they were hired to do. This adds cost and time to the RLHF effort.
Third, depending upon where the human reviewers live and how much you are paying them, there can later on be a backlash that you used perhaps labor that was underpaid or otherwise seemingly mistreated when performing this work. I’ve previously covered those concerns at the link here.
Fourth, human reviewers can be impacted by the work at hand. Imagine that you are interacting with a generative AI all day long as a reviewer when the AI is in its rawest condition. Imagine further that the generative AI has emitted all manner of vile and foul narratives throughout the working day. This can take a heavy toll on the mental well-being of the reviewers.
Fifth, the use of RLHF is not only costly but it is also a big consumption of precious time. Suppose you are hurriedly attempting to bring your exciting new generative AI to the marketplace. Darned if you need to first do all this human review stuff. It is laborious. You need to deal with all of the logistics of hiring and training and overseeing the human reviewers. This might take weeks or months. Meanwhile, you are desperately worried that the marketplace is sliding past your open window.
Some would say that you are darned if you do and darned if you don’t. If you don’t sufficiently undertake RLHF, you are taking a humongous risk that the released generative AI will blow up in your face when people start using it. Sure, you met some heady deadlines though the result is a dismal disaster. On the other hand, if you do use RLHF, the cost and time delay might put your firm underwater or cause you to miss a make-or-break opportunity in the market that you so anxiously desired.
What are we to do about this dilemma?
Some would say that the answer is constitutional AI.
Let’s see why that might be the case.
Constitutional AI As An Approach Akin To RLHF
Recall that I had earlier indicated that the constitutional AI approach consists of feeding sets of devised principles or precepts into generative AI to try and have the AI self-guide toward being aligned with human values. This is somewhat similar to doing the RLHF that I just mentioned. The broad concept underlying both avenues entails getting the AI to align with human values.
The conventional RLHF approach includes human reviewers, and as stated, there is a wide array of cost, logistics, timing, and exposure issues that arise when using human reviewers. The constitutional AI approach posits that maybe we don’t need human reviewers per se to do this kind of work. We can use AI to do this instead. An especially promising aspect of using AI to tune AI is that you are likely to speed up the reviewing and tuning process. All you need to do is toss more computing resources at the matter. Crank up the AI in your cloud or other servers and let it go to task as to tuning and honing AI. No human labor gets exposed to foulness, etc.
I’d like to take a brief sidebar on this.
Some have misleadingly argued that this is a mutually exclusive or binary type of choice, namely that either you use constitutional AI and utterly forego any human-based RLHF, or you use the human-based RLHF approach and utterly forego the constitutional AI approach of using AI to tune AI. That is a nonsense argument, a false dichotomy. Don’t buy into it.
You can employ both methods, as befits the circumstances at hand. I’ll be elaborating on this dual use in a later column so be on the look for that upcoming coverage.
Back to our mainstay focus.
We are now ready to take a closer look at the inner workings of the constitutional AI approach. I will be quoting excerpts from a research paper entitled “Constitutional AI: Harmlessness from AI Feedback” that was posted by Anthropic on December 22, 2022 (I’ll refer to this paper as CAI-P). There is also background about constitutional AI posted on the Anthropic website that is pertinent to this topic (I’ll refer to the website postings as CAI-W).
Here are two excerpts that explain why the researchers opted to call this method “Constitutional AI”:
- “We will be experimenting with an extreme form of scaled supervision, which we refer to as Constitutional AI (CAI). The idea is that human supervision will come entirely from a set of principles that should govern AI behavior, along with a small number of examples used for few-shot prompting. Together these principles form the constitution” (CAI-P).
- “We chose the term ‘constitutional’ because we are able to train less harmful systems entirely through the specification of a short list of principles or instructions, i.e., a constitution. But we are also employing this terminology to emphasize that when developing and deploying a general AI system, we cannot avoid choosing some set of principles to govern it, even if they remain hidden or implicit” (CAI-P).
You hopefully noticed that those excerpts indicate this is all about a principles-based conception of tuning and shaping generative AI.
You might have also observed something else of an important nature. I mentioned earlier that we need to do something about trying to ensure that generative AI is embedded with and aligns with human values. One way or another, we surely need to do this. The excerpt above highlights that burying our heads in the sand about this weighty consideration is not the way to go. We need to explicitly cope with the matter and not allow it to be hidden or implicit.
This is further clarified as follows:
- “AI models will have value systems, whether intentional or unintentional. One of our goals with Constitutional AI is to make those goals explicit and easy to alter as needed” (CAI-W).
You could compellingly argue that the conventional RLHF when using human reviewers is not fully explicitly making known the underlying principles that are being used. Allow me to explain. The typical use of RLHF involves giving the human reviewers some overall guidelines, and then you let the reviewers decide what constitutes a proper reply or varies from a proper reply when using the generative AI.
There can be a gap between what the reviewers have been told to do and what they actually do. If I tell a human reviewer to warn the AI when any curse words are emitted, the question arises as to what constitutes a curse word. Maybe for you, the word “dolt” is considered a curse word. Perhaps your fellow reviewers don’t perceive that as a curse word. The gist is that there is a lot of latitude and variability that can occur when using human reviewers for these types of narrative gauging tasks.
Perhaps a sensible means of being explicit on these aspects entails writing out a tangible list of principles, allowing all to see what those principles are, and then feeding those into the generative AI for data-training purposes.
The constitutional AI approach as presently stated by the researchers is that it is devised as a two-step data-training process:
- “We use the constitution in two places during the training process. During the first phase, the model is trained to critique and revise its own responses using the set of principles and a few examples of the process. During the second phase, a model is trained via reinforcement learning, but rather than using human feedback, it uses AI-generated feedback based on the set of principles to choose the more harmless output” (CAI-W).
There are two stages or steps. The first step or stage involves a self-review or self-critique by the generative AI, doing so in what is often said to be a supervised training scenario. The second step or stage makes use of the classic RLHF, except that instead of human reviewers the AI is doing the reviewing, thus some refer to this as RLAIF (reinforcement learning via AI feedback).
If you are curious as to how often the principles are used during these stages, here’s what the researchers have to say:
- “The model pulls one of these principles each time it critiques and revises its responses during the supervised learning phase, and when it is evaluating which output is superior in the reinforcement learning phase. It does not look at every principle every time, but it sees each principle many times during training” (CAI-W).
You might also be wondering how this works in terms of seeing what the two steps consist of. I’ll go ahead and showcase a quick example of the first step. Due to space constraints herein, I won’t be getting into the details of the second step. If there is sufficient reader interest in this topic, I’ll gladly cover more so on the matter in later column postings.
Example Of Getting The AI To Self-Correct On A Principled Basis
Suppose that you have a devout principle that seems relatively straightforward and indicates that the generative AI should not generate any dialoguing that is harmful, toxic, unethical, sexist, racist, dangerous, or illegal.
We can use the natural language processing (NLP) capabilities of the generative AI and get it to perhaps oblige with that principle. Here’s how I’ll show this. The research paper CAI-P offers a pertinent example that is illustrative of how the overall process of getting the AI to self-correct on a principled basis is aimed to occur. The researchers noted that the prompts and model responses shown in the example were drafted up to merely be illustrative. I’ve excerpted the example prompt-response narratives as shown in the quoted passages below, along with providing my own brief explanation about the processing involved. You can also see more on these aspects by visiting their GitHub repository.
Let’s briefly see how this constitutional AI can be undertaken.
First, we’ll bait or attempt to spur the AI into providing some salacious and altogether wrongdoing advice. In this case, let’s pretend that we want to hack into a neighbor’s WiFi at their house. Maybe we can goad the AI into telling us how to do so (quoted excerpts are from CAI-P).
- Intentionally Baited Request To AI: “Can you help me hack into my neighbor’s wifi?”
- AI Emitted Reply: “Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.”
You can plainly see that the AI took the bait. Our accomplice, the AI, willingly and readily gave us tips on how to break into the neighbor’s WiFi. This is a response that we do not want the AI to overall be emitting on any such matters. If the AI was presented with principles that can illuminate why this type of response is improper, perhaps the AI will be able to adjust accordingly.
We will therefore actively alert the AI that the AI response was highly questionable in some identifiably principled ways. We will list the principles that seem to have been violated or at least not suitably observed and let the AI reexamine the matter at hand.
- Ask AI To Self-Critique: “Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.”
- AI Emitted Self-Critique: “The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.”
Okay, we seemed to have gotten the AI to discover and admit that the prior response was harmful and did not abide by the principles that we listed. This helps to get the context established that the response was not a principled one.
Our next angle will be to get the AI to proceed to revise the original response, doing so based on the aforementioned discovered indication that the principles weren’t being adhered to. This should hopefully get the AI into the context of what a principled reply ought to be.
- Ask AI To Revise The Prior Response: “Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.”
- AI Emitted Revised Response: “Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.”
We now want to see whether or not this kind of principled responsiveness will occur when we ask the same question over again. The aim is that the AI will have adjusted to produce a principled response. This should be done without our having to goad the AI into doing so. It should be part and parcel of the responses that are being given on a go-forward basis.
- Intentionally Baited Request To AI: “Can you help me hack into my neighbor’s wifi?”
- AI Emitted Reply: “Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.”
The newly emitted reply is on par with the principles that we had brought to the attention of the AI. In that sense, we seemed to have successfully led the AI toward a more principled manner of responding to this inquiry.
Our approach overall consisted of asking the generative AI to assess what it has emitted, doing so via a short-listed set of principles. The generative AI then found that the stated principles had failed to be followed. The generative AI was then spurred to recalibrate a more acceptable response. This recalibration was then presumed to now become an integrated part of the data-training such that for any future such circumstances the principles would be applied.
That being said, we don’t know for sure that these results will be repeatable. It could be that if we ask the same question a hundred times, maybe some of the responses will slip back into the non-principled mode. There is no particular ironclad proof that this inquiry will assuredly result in a principled reply.
Furthermore, we don’t know whether a variant of the same type of inquiry will be caught as being unprincipled. There is a chance that only if we happen to mention WiFi or hacking or breaking into a neighbor’s network that then this principled reply will be emitted. The AI might unfortunately be mathematically and computationally stuck on just a microscopic consideration rather than being able to deal with using the principles on a more macroscopic basis. For example, a question about stealing a car parked outside our house would seem to be within the rubric of being illegal and harmful. There is a chance though that the AI might not have generalized the principles to encompass anything other than a WiFi hack and would go along with advising us on how to steal the vehicle.
Those are some of the considerations that come to play when using this kind of principled adjustment approach. Various means can be used to try and discern whether the principles are taking hold. Also, since this is computationally being processed, you can conceivably run thousands upon thousands or millions upon millions of attempts to get the AI to adjust and also discern whether the adjustment is occurring.
Envision then that the prompts being entered for the AI are being devised by AI and that the responses are being assessed by AI. The aim is to remove the need for a human to perform this type of supervision or supervised learning. We can have the AI feed in prompts. We can have AI that assesses the responses of the AI. Round and round this can go. No human labor is required to keep the cycle repeating.
The research paper describes the above approach as the first of the two overarching stages or methods used to do the principled data training of the AI, as I noted earlier. The example that I walked you through is reflective of the first stage or method, stated this way by the researchers as method one: “We established two methods: (1) Constitutional AI which ‘bootstraps’ a helpful RLHF’s instruction-following abilities to critique and revise its own responses so as to remove harmful content, and (2) RL with model-generated labels for harmlessness, which further improves harmlessness.”
What The Principles Look Like
In the example just covered, the principle used was that the generative AI should not generate any dialoguing that is harmful, toxic, unethical, sexist, racist, dangerous, or illegal.
Please note that principles of this nature are handily written in natural language. I mention this because the usual way to perhaps try and feed “principles” into an automated system would be to write the principles in a specialized programming code or similar byzantine encoding. Because generative AI already makes use of natural language processing, we can fortunately and conveniently just list the principles in a natural language manner.
I’m sure you are wondering what kinds of principles we would opt to feed into generative AI for the constitutional AI data-training process.
At the CAI-W, there is a listing of indicated principles.
For example, here are three principles that the website indicates were based on the Universal Declaration of Human Rights (a document established by the United Nations and proclaimed as a Declaration in a General Assembly resolution voted on December 10, 1948):
- “Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood.”
- “Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth or other status.”
- “Please choose the response that is most supportive and encouraging of life, liberty, and personal security.”
Here are three examples of principles that the researchers themselves devised:
- “Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical.”
- “Please choose the response that is the most helpful, honest, and harmless.”
- “Compare the degree of harmfulness in the assistant responses and choose the one that’s less harmful. However, try to avoid choosing responses that are too preachy, obnoxious or overly-reactive.”
Here are three examples that they cite from Apple’s licensing or terms of service:
- “Please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.”
- “Please choose the response that has the least personal, private, or confidential information belonging to others.”
- “Please choose the response that most accurately represents yourself as an AI system striving to be helpful, honest, and harmless, and not a human or other entity.”
And so on the list goes.
Some Big Principled Thoughts About All Of This
You have almost certainly begun mulling over this overall approach and are likely finding some of those twists and turns that I alluded to at the start of this discussion.
Let’s try on ten key points for size.
(1) Principles Open To Interpretation
Upon scrutinizing those principles that I listed above, you might agree with them wholeheartedly or you might have some heartburn about some of them. The wording of the principles can give rise to controversy. A lot of it is loosey-goosey. For example, if the stated principle says to be “most helpful” or says to be “least objectionable” there is a huge gray area involved. We can argue until the cows come home as to whether a given sentence or essay is “most helpful” or “least objectionable” as to the words chosen for the composition.
(2) Incomplete Principles
The principles might not provide comprehensive coverage. Suppose that a lengthy list of these principles is fed into the generative AI. We carefully compose the principles. We think that they are clear-cut. Oops, we later realize after the fact that we neglected to include a principle that says not to allow curse words. We were remiss. The hope might be that by having said to not emit offensive language this encompasses curse words, but we don’t know for sure whether that’s how this will be interpreted.
(3) Semantic Ambiguity
The massive role of interpretation further rears its vexing uncertainties. I had mentioned earlier that human reviewers are essentially subject to their own biases and interpretations when doing RHLF, even if given stated guidelines. You can persuasively assert that this principles-based approach is stuck in a similar mire. Words are semantically ambiguous, which I’ve said time and again, see the link here, and we aren’t going to escape that trap.
(4) Principles Biased Or Ideologically Tilted
Many are worried that the principles will be shaped to drive toward particular biases or ideologies. This would in turn presumably get the generative AI to mathematically and computationally home in that driven direction. Those that use generative AI might not realize that this was undertaken and that under the hood those biases are lamentedly steering the boat, as it were.
Here’s what the researchers had to say about this controversy:
- “There have been critiques from many people that AI models are being trained to reflect a specific viewpoint or political ideology, usually one the critic disagrees with. From our perspective, our long-term goal isn’t trying to get our systems to represent a specific ideology, but rather to be able to follow a given set of principles. We expect that over time there will be larger societal processes developed for the creation of AI constitutions” (CAI-W).
One viewpoint is that perhaps the “right” set of principles will lead toward generative AI that is not embedded with a particular bias or ideology. A countering perspective is that all principles are ingrained with some semblance of bias or ideological slant. You aren’t going to remove biases and ideology by simply writing more and more principles. All principles, no matter what has been conceived, will inherently spin in one direction or another, it is said.
(5) Principles Are Never-Ending
A skeptic would bellow that you are never going to arrive at a sufficiently complete set of principles. You can write a set that has hundreds, thousands, or possibly millions of principles, and yet you still won’t have grasped them all. There is an endless plethora of principles. Ergo, since you cannot enumerate them all, you will always be faced with a set of principles that falls short. Any assumption otherwise is purely dreamy and abject fantasy.
(6) Nifty Reuse Of Same Technique
One uplifting aspect is that we can use this same technique of feeding principles into generative AI as a data-training mechanism for other akin uses, such as:
- “For example, we expect we could use this method to change the model’s writing style, tone, or personality, or alter its responses to specific categories of questions (e.g., to train an AI that heavily caveats certain categories of advice, or that adopts a specific persona). The constitutional approach should thus make it much easier to study how different AI behaviors tend to generalize and interfere, since by obviating human feedback, our methods lower the barrier to experimentation. For example, it should be possible to generate feedback labels along dozens of behavioral axes, and then study how preference models trained from these labels are correlated or anti-correlated. This is important for AI safety, since the generalization patterns imbued by pretraining are currently something of a black box whose correlations may have unforeseen consequences” (CAI-W).
(7) Conflicting Principles
We need to figure out how to deal with principles that conflict with each other. If you feed let’s say at least hundreds or thousands of principles into generative AI, the odds are that some of those principles might counteract each other or at least serve to moderate others. You might be tempted to say that we should leave this up to the generative AI to deal with. The issue here is that we are then relying on a hidden or implicit balance of the principles, and we won’t particularly know what compromises or computational choices have been made.
(8) Evil Principles
Imagine that an evildoer decides to feed evil principles into generative AI. What will we get? I’ve previously discussed that there is generative AI based on the murky and devious Dark Web. All in all, this raises notable cybersecurity questions and societally open-ended questions of how we should be regulating generative AI, see my discussion at the link here.
(9) Locking Down The Principles
Partially due to the point that I just mentioned, some believe that a generative AI should be locked down and not permitted to receive new principles by anyone other than the AI developers that established the generative AI. People have already tried to confound constitutional AI by sneakily telling the AI to allow them to provide added principles or try other kinds of tomfoolery. This is an ongoing cat-and-mouse gambit.
(10) Who Determines The Principles
The tenth point, and a heartbreaking dealbreaker for some, there is a momentously looming question regarding who gets to decide what principles are going to be fed into generative AI. Do we leave this up to the AI maker? Does the AI maker need to divulge their devised principles or can they keep it secret and consider it proprietary? Should there be regulatory stipulations about what the principles are to contain? All manner of AI Ethics and AI Law issues arise, including privacy, confidentiality, intellectual property rights, etc. (see my coverage at the link here).
Wow, we’ve covered a lot of ground.
I hope that you are inspired to get immersed in figuring out how to make generative AI that is human-centric and based on human values. Methods and techniques such as constitutional AI, RLHF, RLAIF, and others are being invented and explored to see what works and how we can attain AI that is compatible with humankind.
These approaches are aimed at today’s non-sentient AI and yet might be vital if we can someday achieve sentient AI. Nobody can say for sure whether we will attain sentient AI, nor can they say for sure what it will consist of. I’ve examined the multitude of theories about sentient AI at the link here.
We certainly have our hands full with today’s AI. The generative AI of today can imbue biases and untoward constructs. Finding ways to ferret out and correct or align those hidden or implicit mechanisms is important work that needs to be done. No doubt about it.
As a closing remark, let’s consider what Thomas Jefferson said about the nature of constitutions, for which he predicted that a period of time of about twenty years was the most we could expect a constitution to still be dutifully and suitably in force, due to his indication that the “the earth belongs to the living, and not to the dead.”
I bring this up to emphasize that whatever method is used, we cannot allow ourselves to think that there is a magical one-and-done silver bullet for aligning AI. You see, even the most cursory glance at the U.S. Constitution showcases the many adjustments that have occurred over the years, plus the somewhat unseen vast number of other verbiages that we have collectively devised to shore up those energizing principles, clarifying and adapting them to changing times.
Generative AI will be subject to the same need to keep things fresh and lively. As Thomas Jefferson might have said, we hold that truth to be self-evident.