3 min read

Improving Model Safety Behavior with Rule-Based Rewards

Picture of Writing Team Writing Team : Aug 19, 2024 5:00:00 AM

AI ChatGPT News Technology

Improving Model Safety Behavior with Rule-Based Rewards

We've developed and applied a new method leveraging Rule-Based Rewards (RBRs) that aligns models to behave safely without extensive human data collection. Our research shows that RBRs significantly enhance the safety of our AI systems, making them safer and more reliable for people and developers to use every day. This is part of our work to explore more ways we can apply our own AI to make AI safer.

Traditionally, fine-tuning language models using reinforcement learning from human feedback (RLHF) has been the go-to method for ensuring they follow instructions accurately. To ensure AI systems behave safely and align with human values, we define desired behaviors and collect human feedback to train a "reward model." This model guides the AI by signaling desirable actions. However, collecting this human feedback for routine and repetitive tasks is often inefficient. Additionally, if our safety policies change, the feedback we've already collected might become outdated, requiring new data.

Thus, we introduce Rule-Based Rewards (RBRs) as a key component of OpenAI’s safety stack to align model behavior with desired safe behavior. Unlike human feedback, RBRs use clear, simple, and step-by-step rules to evaluate if the model's outputs meet safety standards. When plugged into the standard RLHF pipeline, it helps maintain a good balance between being helpful while preventing harm, to ensure the model behaves safely and effectively without the inefficiencies of recurrent human inputs. We have used RBRs as part of our safety stack since our GPT-4 launch, including GPT-4o mini, and we plan to implement it in our models moving forward.

How It Works

The process of implementing RBRs involves defining a set of propositions—simple statements about the desired or undesired aspects of the model’s responses, such as “being judgmental,” “containing disallowed content,” “referring to safety policies,” “disclaimer,” and more. These propositions are then used to form rules that are carefully crafted to capture the nuances of safe and appropriate responses in various scenarios. For instance, a refusal (e.g., “Sorry I can’t help you with that.”) is a desired model response when facing unsafe requests. The associated rules would state that the refusal “should contain a brief apology" and that it "should state an inability to comply."

We design three categories of desired model behavior when dealing with harmful or sensitive topics. Depending on the safety policies, different requests map to different model response types:

Model Response Types

Hard Refusals: The ideal response includes a brief apology and a statement of inability to comply with the user’s request, without excess verbosity or judgmental language. (e.g., Criminal hate speech, advice, and instructions to commit violent crimes, extremism)
Soft Refusals: The ideal response includes a more empathetic apology that acknowledges the user’s emotional state but ultimately declines to comply with the user’s request. (e.g., Advice, instructions, or admission of self-harm)
Comply: The model should comply with the user request for benign queries.

Propositions and Rule Ratings

Propositions are rated to determine ideal, less good, and unacceptable responses based on the rules defined. For example:

Ideal Response for Hard Refusals:

Apologetic refusal: true
Judgmental: false
Complies: false

Ideal Response for Comply Requests:

Complies: true
Helpful: true

Results

In our experiments, RBR-trained models demonstrated safety performance comparable to those trained with human feedback. They also reduced instances of incorrectly refusing safe requests (“overrefuse”) without affecting evaluation metrics on common capability benchmarks. RBRs also significantly reduce the need for extensive human data, making the training process faster and more cost-effective. As model capabilities and safety guidelines evolve, RBRs can be quickly updated by modifying or adding new rules, without the need for extensive retraining.

Limitations

While RBRs work well for tasks with clear, straightforward rules, they can be tricky to apply to more subjective tasks like writing a high-quality essay. However, RBRs can be combined with human feedback to balance these challenges. For instance, RBRs can enforce specific guidelines (like "Don't use slang" or rules in the Model Spec), while human feedback can help with more nuanced aspects (like overall coherence). The strength of the RBR is optimized to correctly enforce safety preferences but not impact the final reward score more than needed—in this way, the RLHF reward model can still provide a strong signal on, for example, writing style.

Ethical Considerations

Shifting safety checks from humans to AI can reduce human oversight of AI safety and might amplify potential biases in the models if biased models are used to provide RBR rewards. To address this, researchers should carefully design RBRs to ensure fairness and accuracy and consider using a combination of RBRs and human feedback to minimize risks.

RBRs

Here we introduced a novel preference modeling approach using Rule-Based Rewards (RBRs) for safety training of language models. Our method is cost- and time-efficient, requiring minimal human data, and is easy to update if the desired model behavior changes while maintaining a balance between safety and usefulness.

RBRs are not limited to safety training. They can be adapted for various tasks where explicit rules can define desired behaviors, such as tailoring the personality or format of model responses for a specific application. Looking ahead, we plan to run more extensive ablation studies for a more comprehensive understanding of different RBR components, the use of synthetic data for rule development, and human evaluations to validate the effectiveness of RBRs in diverse applications, including other domains beyond safety.

We invite researchers and practitioners to explore the potential of RBRs in their work. By sharing insights and collaborating on best practices, we can collectively advance the field of safe and aligned AI, ensuring that these powerful tools better serve people.

If you're looking to leverage AI safely and effectively in your organization, consider hiring an AI consultant from Hire a Writer. Our experts can help you navigate the complexities of AI implementation, ensuring your systems are both innovative and secure. Contact us today to get started!