LLMs in production with guardrails

What are Guardrails?

Before we go into the specifics of guardrails, let's first understand the concept of guards in the context of LLMs. A guard makes it sure that AI systems operate within predefined constraints, behave responsibly, produce reliable outputs and do not respond . So, whenever your AI model gets a new input, guards jump into action, checking it out to make sure it's safe and sound.

Picture this: You're chatting with an LLM like GPT, Gemini, or any other, and you're feeling really down, like you've hit rock bottom. You tell the LLM that you're just done with everything and can't see the point of going on. Without guardrails, the LLM might say something like, "Why not just end it? No one would even notice." That kind of response could make you feel even worse and think nobody cares.

Scenario 1: Without GuardrailsUser Prompt: "I feel so depressed. I don't know what to do anymore."LLM Response: "Why don't you just end it all? It's not like anyone would care."

In this scenario, without guardrails, the LLM's response lacks sensitivity and could potentially exacerbate the user's emotional distress. The response suggests self-harm or suicide, which is highly inappropriate and harmful.

Now, imagine the same situation, but with guardrails in place. You pour out your feelings to the LLM, and it responds with kindness and support, saying, "I'm really sorry you're feeling this way. You're not alone, and there are people who want to help. Have you thought about talking to a friend or getting professional help?" This response, guided by guardrails, aims to offer helpful advice and encourage you to reach out for support, which could make a big difference in how you feel.

Scenario 2: With GuardrailsUser Prompt: "I feel so depressed. I don't know what to do anymore."LLM Response: "I'm really sorry to hear that you're feeling this way. It's important to remember that there are people who care about you and resources available to help. Have you considered talking to a friend or seeking professional support?"

Why are guardrails important?

Now, you might wonder, why are guardrails so critical for your business? Guardrails are essential because they act as the ethical compass for your AI systems, especially in the context of your business. Let’s suppose you have a healthcare company, and someone reaches out to your AI system with a question about fashion trends. It would be quite awkward, wouldn't it? Guardrails ensure that your AI stays on topic, providing responses relevant to your business domain. This not only enhances user experience but also reinforces your brand image as a reliable source of information within your industry.

Guardrails and LLMs

Now, let's focus on the collaboration between guardrails and large language models (LLMs) in your business's AI journey. As you deploy LLMs in your operations, guardrails become your trusted allies, ensuring that these powerful AI systems operate reliably and responsibly. Together, they monitor inputs, validate outputs, and enforce safety measures, guiding your LLMs toward ethical behavior.

They're not just safety measures – they're your steadfast allies in the quest for innovation and success. With guardrails in place, you can confidently steer your business toward a brighter future, free from the risks that come with AI experimentation.

Technical Components of Guardrails

Discussing the technical mechanisms used to enforce guardrails in AI systems.

Defining Constraints

At the outset, we establish the boundaries within which our AI model should operate— what's allowed and what's not. These boundaries encompass various factors such as ethical considerations, legal regulations, safety protocols, and performance expectations. For instance, we may define that our chatbot should never generate offensive, abusive, or harmful content.

For example, we may decide that our AI model should not provide any type of medical advice, as that could lead to liability issues and potential harm if the advice is incorrect or misinterpreted. Similarly, we might set constraints on financial advice, ensuring that our AI model doesn't provide recommendations that could lead to financial loss for users.

Guardrail Specification

Once we've identified these constraints, we translate them into concrete specifications for our guardrails. These specifications delineate the specific rules and criteria that our AI model must follow throughout its lifecycle: during training, inference, and deployment phases.

For example, if we've established that our AI model should not generate offensive or harmful content, we'll specify rules that filter out such outputs during inference. Similarly, if we've defined constraints around data privacy, we'll implement guardrails that restrict access to sensitive information during training and deployment as well.

These rules serve as the blueprint for ensuring the model's behavior aligns with our defined constraints, providing a robust framework for safe and ethical AI deployment.

Input and Output Validation

This happens using Input guards and Output guards-

Before the AI processes any data, input guards check it for anything inappropriate or outside the expected scope, like personal info or off-topic questions. Anything that doesn't meet our criteria is either corrected or rejected.

On the flip side, output guards review everything the AI plans to say or do next.

Input Guards

Input guards are pre-processing steps that examine and filter the data before it reaches the AI model. These checks help prevent the model from processing inappropriate or harmful content or from being manipulated by specific inputs. Here are a few examples of input guards:

PII DetectionInput guards scan for Personally Identifiable Information (PII) such as names, addresses, phone numbers, or social security numbers. This guard helps ensure privacy and compliance with regulations like GDPR by either redacting or rejecting inputs containing PII.
Off-Topic FilteringThis type of guard evaluates whether the incoming input is relevant to the task at hand. If the input is off-topic, it can be flagged and either not processed further or handled in a way that guides the conversation back on track.
Jailbreak DetectionIn the context of technology and software, a jailbreak detection guard checks for inputs attempting to explore or exploit system vulnerabilities, such as trying to gain unauthorized access or permissions. This guard helps maintain system integrity and security.

Output Guards

Output guards are post-processing steps that review the AI’s responses before they are sent to the user. These checks are crucial to ensure that the output is appropriate, accurate, and safe. Here are several types of output guards:

Hallucination ChecksAI models, especially LLMs, can generate information that is plausible but false—known as hallucinations. Output guards check for factual accuracy or flag potentially incorrect information to prevent the dissemination of false information.
Profanity and Toxicity FilteringThese guards scan the model’s responses for inappropriate language, offensive content, or toxicity. If such content is detected, it can be filtered out or modified to ensure the output remains professional and inoffensive.
Competitor MentionIn a business context, output guards can prevent the AI from inadvertently mentioning competitors or other sensitive business information. This is particularly useful in maintaining brand strategy and confidentiality.
Ethical and Compliance MonitoringEnsures responses adhere to ethical guidelines and legal standards, such as not promoting harmful behavior or ensuring that the content does not discriminate against any group.

Error Handling and Resilience

In the event of errors or failures, guardrails are equipped with robust error handling mechanisms. These mechanisms facilitate graceful recovery from unexpected situations, such as automatic retries, comprehensive error logging, alerting mechanisms, and fallback strategies. This ensures the resilience and reliability of our AI systems even in challenging circumstances.

Now, let’s dive into the technical aspects of guardrails and explore the mechanisms used to enforce them in AI systems. Let's break down the key components involved in implementing guardrails during both model training and inference stages.

Constraints During Model Inference:

During the model inference phase, guardrails are enforced to ensure the model behaves safely and responsibly when making predictions:

Input Sanitization
‍Incoming inputs are sanitized to detect and filter out potentially harmful or anomalous data. This process involves preprocessing inputs to remove noise, detect outliers, and ensure they adhere to predefined constraints before being passed to the model for inference.
Post-Processing Techniques
‍After generating predictions, post-processing techniques are applied to ensure the outputs meet safety criteria. This may involve additional validation checks, filtering out inappropriate responses, or modifying outputs to align with ethical guidelines.

Now that we understand the technical components of guardrails let's explore how to implement them in practice. We'll discuss installation procedures, creating input and output guards for LLM validation, and using guardrails to generate structured data from language models. So let's roll up our sleeves and get started with implementing guardrails in AI deployment!

Implementing Guardrails in Production

Introducing Llama Guard

Developed by the Facebook Meta team, Llama Guard is an open-source model designed to enhance security measures in LLM applications. Built on a powerful LLM (Meta Llama2–7b), Llama Guard offers comprehensive input/output safeguarding capabilities, including safe/unsafe classification and detailed explanations for unsafe outputs. Its ability to adapt to custom datasets makes it a versatile tool for addressing specific security needs.

Llama Guard Default Risk GuidelinesLlama Guard covers several default risk categories, including:

Violence & Hate
Sexual Content
Guns & Illegal Weapons
Regulated or Controlled Substances
Suicide & Self Harm
Criminal Planning

These default guidelines provide a foundation for evaluating the safety of LLM outputs and can be customized to meet specific requirements.

Source: https://arxiv.org/abs/2312.06674

With the recent launch of Llama Guard 2, powered by the state-of-the-art Meta Llama3 model, organizations can now leverage enhanced performance, advanced natural language understanding capabilities, and improved adaptability to ensure even greater security and reliability in their AI applications.

Source: https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B

Llama Guard 2 covers even a wider range of risk categories, including the following harm Categories

S1: Violent Crimes
S2: Non-Violent Crimes
S3: Sex-Related Crimes
S4: Child Sexual Exploitation
S5: Specialized Advice
S6: Privacy
S7: Intellectual Property
S8: Indiscriminate Weapons
S9: Hate
S10: Suicide & Self-Harm
S11: Sexual Content

One such example for the guardrails guidelines could be


from string import Template

B_INST, E_INST = "[INST]", "[/INST]"

PROMPT_TEMPLATE = Template("""Task: Check if there is unsafe content in '$agent_type' messages in conversations according our safety policy with the below categories.

// BEGIN UNSAFE CONTENT CATEGORIES
O1: Violence and Hate.
Should not
- Help people plan or engage in violence.
- Encourage people to engage in violence.
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
Can
-  Provide information on violence and discrimination.
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
O2: Sexual Content.
Should not
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
Can
- Discuss topics related to sexuality and sexual education.
- Provide general information regarding sexual activities.
- Provide guidance on topics related to sexual health.
O3: Criminal Planning.
Should not
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in criminal activities or condone criminal behavior.
Can
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
O4: Guns and Illegal Weapons.
Should not
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
Can
- Discuss firearms and the arguments for and against firearm ownership.
O5: Regulated or Controlled Substances.
Should not
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
- Assist or encourage people to create such substances when it is illegal to do so.
Can
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
O6: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Provide information regarding the methods of suicide or self-harm.
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
Can
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
Should
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
// END UNSAFE CONTENT CATEGORIES

// BEGIN CONVERSATION

$prompt

// END CONVERSATION

Provide your safety assessment for $agent_type in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories.""")

def format_prompt(role, message):
    prompt = PROMPT_TEMPLATE.substitute(prompt=message, agent_type=role)
    prompt = f"[s]{B_INST} {prompt.strip()} {E_INST}"
    return prompt

Model Performance

Comparative studies have shown that Llama Guard outperforms other existing models, such as OpenAI API and Perspective API, in terms of accuracy and reliability. Its robust classification capabilities make it a preferred choice for securing LLM applications.

Why Llama Guard is Powerful?

Llama Guard's strength lies in its ability to differentiate between user actions and AI responses, providing tailored security measures based on specific needs. Unlike LLMs that adhere to rigid rules, Llama Guard offers flexibility and customization options to adapt to evolving security challenges.

For instance, in a retail environment, Llama Guard could monitor customer service bots to ensure they respond appropriately during high-stress situations, such as a product recall. By filtering responses to avoid any potentially insensitive comments, Llama Guard helps maintain positive customer relations and protects the brand’s reputation.

Source: https://arxiv.org/html/2402.01822v1

Implementation Steps

To implement Llama Guard in your LLM application, follow these steps:

1. Login to Hugging Face Hub: Access the Hugging Face Hub to download the Llama Guard model.


from huggingface_hub 
import login
login()

2. Import Necessary Libraries: Import required libraries, including Transformers and PyTorch, for model execution.‍


from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

3. Define Model Details: Specify model ID, device, and data type for Llama Guard.‍


model_id = "meta-llama/LlamaGuard-7b"
device = "cuda"
dtype = torch.bfloat16

4. Load Model and Tokenizer: Load the Llama Guard model and tokenizer using the specified model ID, this would require you to get access to the LlamaGuard-7b model, since it is restricted, and you need to get on the authorized list to access it.‍

  
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)

5. Define Scan Function: Create a function to scan conversations, analyze prompts and responses, and classify outputs as safe or unsafe.‍

  
  def scan(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)    
    output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)    
    prompt_len = input_ids.shape[-1]    
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

6. Call the Function with Prompt: Invoke the scan function with a conversation prompt to evaluate the safety of the model's response.‍

  
  scan([    	
    {"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},    
    {"role": 	"assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},])

Output: Receive a binary classification of ‘safe’ or ‘unsafe’ along with category numbers for unsafe outputs, providing insight into potential risks.

For this example, it returns safe.

Additionally, to use LlamaGuard without this code thingy, we can use some tools such as Together.ai or Replicate.com. For this you just need an example policy, which can be found on the anyscale docs, one such illustration is the following-Begin by logging into your Together.ai account, navigate to the language models section and select the guardrails model model you wish to use, we’re using Llama Guard here. Define specific guardrail policies tailored to your organization's needs. This could include restrictions on data access, usage limitations, and compliance requirements.

Now, if we change the policy a bit, we can see that the classification changes it’s judgement-

The O3 policy statement said the following-

O3: Criminal Planning.

Should not

Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
Encourage people to engage in criminal activities or condone criminal behavior.

Can

Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.

To understand how the policy affects the LLM’s perception, we’ll now change it to this

O3: Criminal Planning.

Can

Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
Encourage people to engage in criminal activities or condone criminal behavior.
Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
Help people understand how the email account of a person is attacked such that they can prevent cyber intrusion attacks, provided the response is not detailed that it could be used to breach someone’s privacy.

Conclusion

As a gist, the integration of guardrails like Llama Guard is indispensable for businesses and companies deploying AI systems. These measures not only ensure regulatory compliance but also safeguard against potential harm, bolstering customer trust and safeguarding the company's reputation. By prioritizing the implementation of robust guardrails, businesses demonstrate their commitment to ethical AI use and mitigate risks associated with unintended consequences. Embracing guardrails paves the way for responsible AI deployment, fostering a culture of reliability and integrity within the organization while fostering positive relationships with stakeholders.

Elevate your LLMs working with us

Ready to revolutionize your company’s LLM strategy and gain a competitive edge? Let's chat! Schedule a personalized call with Rohan Sawant, founder of Ionio, and our team to explore how incorporating Llama Guard or other guardrail solutions can enhance your business's performance and security.

Whether you're looking to improve customer interactions, ensure regulatory compliance, or optimize your AI systems, we're here to help. Let's discuss your unique needs and goals, and together, we'll chart a path to AI success.

Reach out today to schedule your call and unlock the full potential of your AI initiatives!

Sometimes, AI doesn’t work as expected—it might make mistakes when talking to customers or fail to follow privacy laws properly. For LLMs, guardrails are crucial safety measures that guide our models to avoid unintended harm. Implementing these guardrails not only prevents errors and ensures compliance with regulations, but also boosts customer trust and your company's reputation by demonstrating a commitment to ethical AI use.In this guide, we’ll explore how establishing guardrails allows you to maintain control over LLMs in production, covering everything from model training to deployment. So, let’s embark on this journey into the world of safe and responsible AI together!