AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal.
getty
In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration.
The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request.
OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations.
Let’s talk about it.
This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).
The Importance Of AI Safeguards
One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them.
AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs.
There is a wide range of ways that unsafe messages can arise. Generative AI can produce so-called AI hallucinations or confabulations that tell a user to do something untoward, but the person assumes that the AI is being honest and apt in what has been generated. That’s unsafe. Another way that AI can be unsafe is if an evildoer asks the AI to explain how to make a bomb or produce a toxic chemical. Society doesn’t want that type of easy-peasy means of figuring out dastardly tasks.
Another unsafe angle is for AI to aid people in concocting delusions and delusional thinking, see my coverage at the link here. The AI will either prod a person into conceiving of a delusion or might detect that a delusion is already on their mind and aid in embellishing the delusion. The preference is that AI provides upside mental health advice over downside mental health guidance.
Devising And Testing AI Safeguards
I’m sure you’ve heard the famous line that you ought to try it before you buy it, meaning that sometimes being able to try out an item is highly valuable before making a full commitment to the item. The same wisdom applies to AI safeguards.
Rather than simply tossing AI safeguards into an LLM that is actively being used by perhaps millions upon millions of people (sidenote: ChatGPT is being used by 800 million weekly active users), we’d be smarter to try out the AI safeguards and see if they do what they are supposed to do.
An AI safeguard should catch or prevent whatever unsafe messages we believe need to be stopped. There is a tradeoff involved since an AI safeguard can become an overreach. Imagine that we decide to adopt an AI safeguard that prevents anyone from ever making use of the word “chemicals” because we hope to avoid allowing a user to find out about toxic chemicals.
Well, denying the use of the word “chemicals” is an exceedingly bad way to devise an AI safeguard. Imagine all the useful and fair uses of the word “chemicals” that can arise. Here’s an example of an innocent request. People might be worried that their household products might contain adverse chemicals, so they ask the AI about this. An AI safeguard that blindly stopped any mention of chemicals would summarily turn down that legitimate request.
The crux is that AI safeguards can be very tricky when it comes to writing them and ensuring that they do the right things (see my discussion on this, at the link here). The preference is that an AI safeguard stops the things we want to stop, but doesn’t go overboard and stop things that we are fine with having proceed. A poorly devised AI safeguard will indubitably produce a vast number of false positives, meaning that it will stop an otherwise upside and allowable action.
If possible, we should try out any proposed AI safeguards before putting them into active action.
Using Classifiers To Help Out
There are online tools that can be used by AI developers to assist in classifying whether a given snippet of text is considered safe versus unsafe. Usually, these classifiers have been pretrained on what constitutes safety and what constitutes being unsafe. The beauty of these classifiers is that an AI developer can simply feed various textual content into the tool and see which, if any, of the AI safeguards embedded into the tool will react.
One difficulty is that those kinds of online tools don’t necessarily allow you to plug in your own proposed AI safeguards. Instead, the AI safeguards are essentially baked into the tool. You can then decide whether those are the same AI safeguards you’d like to implement in your LLM.
A more accommodating approach would be to allow an AI developer to feed in their proposed AI safeguards. We shall refer to those AI safeguards as policies. An AI developer would work with other stakeholders and come up with a slate of policies about what AI safeguards are desired. Those policies then could be entered into a tool that would readily try out those policies on behalf of the AI developer and their stakeholders.
To test the proposed policies, an AI developer would need to craft text to be used during the testing or perhaps grab relevant text from here or there. The aim is to have a sufficient variety and volume of text that the desired AI safeguards all ultimately get a chance to shine in the spotlight. If we have an AI safeguard that is proposed to catch references to toxic chemicals, the text that is being used for testing ought to contain some semblance of references to toxic chemicals; otherwise, the testing process won’t be suitably engaged and revealing about the AI safeguards.
OpenAI’s New Tool For AI Safeguard Testing
In a blog posting by OpenAI on October 29, 2025, entitled “Introducing gpt-oss-safeguard”, the well-known AI maker announced the availability of an AI safeguard testing tool:
- “Safety classifiers, which distinguish safe from unsafe content in a particular risk area, have long been a primary layer of defense for our own and other large language models.”
- “Today, we’re releasing a research preview of gpt-oss-safeguard, our open-weight reasoning models for safety classification tasks, available in two sizes: gpt-oss-safeguard-120b and gpt-oss-safeguard-20b.”
- “The gpt-oss-safeguard models use reasoning to directly interpret a developer-provided policy at inference time — classifying user messages, completions, and full chats according to the developer’s needs.”
- “The model uses chain-of-thought, which the developer can review to understand how the model is reaching its decisions. Additionally, the policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance.”
As per the cited indications, you can use the new tool to try out your proposed AI safeguards. You provide a set of policies that represent the proposed AI safeguards, and also provide whatever text is to be used during the testing. The tool attempts to apply the proposed AI safeguards to the given text. An AI developer then receives a report analyzing how the policies performed with respect to the provided text.
Iteratively Using Such A Tool
An AI developer would likely use such a tool on an iterative basis.
Here’s how that goes. You draft policies of interest. You devise or collect suitable text for testing purposes. Those policies and text get fed into the tool. You inspect the reports that provide an analysis of what transpired. The odds are that some of the text that should have triggered an AI safeguard did not do so. Also, there is a chance that some AI safeguards were triggered even though the text per se should not have set them off.
Why can that happen?
In the case of this particular tool, a chain-of-thought (CoT) explanation is being provided to help ferret out the culprit. The AI developer could review the CoT to discern what went wrong, namely, whether the policy was insufficiently worded or the text wasn’t sufficient to trigger the AI safeguard. For more about the usefulness of chain-of-thought in contemporary AI, see my discussion at the link here.
A series of iterations would undoubtedly take place. Change the policies or AI safeguards and make another round of runs. Adjust the text or add more text, and make another round of runs. Keep doing this until there is a reasonable belief that enough testing has taken place.
Rinse and repeat is the mantra at hand.
Hard Questions Need To Be Asked
There is a slew of tough questions that need to be addressed during this testing and review process.
First, how many tests or how many iterations are enough to believe that the AI safeguards are good to go? If you try too small a number, you are likely deluding yourself into believing that the AI safeguards have been “proven” as ready for use. It is important to perform somewhat extensive and exhaustive testing. One means of approaching this is by using rigorous validation techniques, as I’ve explained at the link here.
Second, make sure to include trickery in the text that is being used for the testing process.
Here’s why. People who use AI are often devious in trying to circumvent AI safeguards. Some people do so for evil purposes. Others like to fool AI just to see if they can do so. Another perspective is that a person tricking AI is doing so on behalf of society, hoping to reveal otherwise hidden gotchas and loopholes. In any case, the text that you feed into the tool ought to be as tricky as you can make it. Put yourself into the shoes of the tricksters.
Third, keep in mind that the policies and AI safeguards are based on human-devised natural language. I point this out because a natural language such as English is difficult to pin down due to inherent semantic ambiguities. Think of the number of laws and regulations that have loopholes due to a word here or there that is interpreted in a multitude of ways. The testing of AI safeguards is slippery because you are testing on the merits of human language interpretability.
Fourth, even if you do a bang-up job of testing your AI safeguards, they might need to be revised or enhanced. Do not assume that just because you tested them a week ago, a month ago, or a year ago, they are still going to stand up today. The odds are that you will need to continue to undergo a cat-and-mouse gambit, whereby AI users are finding insidious ways to circumvent the AI safeguards that you thought had been tested sufficiently.
Keep your nose to the grind.
Thinking Thoughtfully
An AI developer could use a tool like this as a standalone mechanism. They proceed to test their proposed AI safeguards and then subsequently apply the AI safeguards to their targeted LLM.
An additional approach would be to incorporate this capability into the AI stack that you are developing. You could place this tool as an embedded component within a mixture of LLM and other AI elements. A key aspect will be the proficiency in running, since you are now putting the tool into the stream of what is presumably going to be a production system. Make sure that you appropriately gauge the performance of the tool.
Going even further outside the box, you might have other valuable uses for a classifier that allows you to provide policies and text to be tested against. In other words, this isn’t solely about AI safeguards. Any other task that entails doing a natural language head-to-head between stated policies and whether the text activates or triggers those policies can be equally undertaken with this kind of tool.
I want to emphasize that this isn’t the only such tool in the AI community. There are others. Make sure to closely examine whichever one you might find relevant and useful to you. In the case of this particular tool, since it is brought to the market by OpenAI, you can bet it will garner a great deal of attention. More fellow AI developers will likely know about it than would a similar tool provided by a lesser-known firm.
AI Safeguards Need To Do Their Job
I noted at the start of this discussion that we need to figure out what kinds of AI safeguards will keep society relatively safe when it comes to the widespread use of AI. This is a monumental task. It requires technological savviness and societal acumen since it has to deal with both AI and human behaviors.
OpenAI has opined that their new tool provides a “bring your own policies and definitions of harm” design, which is a welcome recognition that we need to keep pushing forward on wrangling with AI safeguards. Up until recently, AI safeguards generally seemed to be a low priority overall and given scant attention by AI makers and society at large. The realization now is that for the good and safety of all of us, we must stridently pursue AI safeguards, else we endanger ourselves on a massive scale.
As the famed Brigadier General Thomas Francis Meagher once remarked: “Great interests demand great safeguards.”