GenAI Security Technical Blog Series 5/6: Secure AI by Design - Trustworthy GenAI By Design

rlu · ‎08-14-2024

Blog written by: Brody Kutt, Yiheng An, Haozhe Zhang, Yu Fu, Qi Deng, and Royce Lu
Technical Editors: Nicole Nichols, Sam Kaplan, and Katie Strand

Introduction
GenAI Alignment
Aligning GenAI with Data
Consequences of Misalignment
Detecting and Preventing Misalignment
GenAI Hallucination
Consequences of Hallucination
Mitigating Hallucination
Transparency and Explainability
Consequences of Poor Transparency and Explainability
Forms of Explanations
GenAI Model Drift
Dealing with Model Drift
Conclusion
Palo Alto Networks Can Help

The rollout of our Secure AI by Design product portfolio has begun. If you want to see how we can help secure AI applications, please see the Palo Alto Networks Can Help section below.

Introduction

This blog illuminates the central technical elements of trustworthy generative AI (GenAI), underscoring its significance for responsible, transparent, and aligned AI operations within an organization. The focus is on the prevention of unintentional biases, reinforcement of organizational values, and the promotion of user trust. We dissect four key subtopics: GenAI alignment, GenAI hallucination, transparency/explainability, and GenAI model drift. Combined, these four categories span a set of prevalent potential threats and vulnerabilities. Each topic is expanded through real-world examples and proposed strategies for detection and prevention. The blog concludes by advocating for strong governance protocols for AI products that provide for secure AI by design from the initial implementation. GenAI includes continuous monitoring and ethical usage to enhance trust and mitigate potential threats. Throughout this post, we focus on text-based examples; however, the ideas apply to foundation models of any modality, such as images, video, audio, and beyond.

GenAI Alignment

GenAI alignment refers to ensuring that AI systems produce outputs that are in line with human values. In practice, this means preventing the model from inadvertently causing harm. For example, due to misinterpretation or misapplication of their objectives. The goal is to ensure that the AI's objectives consistently match the social norms where the model will be used, including previously unseen scenarios.

Aligning GenAI with Data

GenAI alignment is achieved in several steps, beginning with having accurate and representative data. Some important considerations of data are:

Quality: The data must not contain errors, inconsistencies, and undesirable biases.
Diversity: The training data must encompass many topics, contexts, and language styles to equip the LLM to handle various situations and users.
Representativeness: The data should accurately represent the real-world scenarios where the LLM will operate. The wider the gap between your training data and the data seen in production, the less you can ensure alignment will be preserved.

Datasets can be augmented by incorporating human feedback to validate effectiveness and reliability. Although human feedback is generally more expensive to collect, humans can help the model learn and improve upon their objective over time. This can mean interactive training, like reinforcement learning with human feedback (RLHF). Here, humans are in the loop with the LLM, guiding it towards correct interpretations and responses. This feedback can fine-tune the LLM, helping it generate more accurate and relevant responses.

Consequences of Misalignment

To illustrate misalignment, it is important to define the LLM’s expected behavior first. To illustrate this concept, we will consider a model trained to maximize user satisfaction in a customer service context. Misalignment could result from training data that includes meandering and evasive examples from true human agents or a reward function (sometimes also called an objective) that overly prioritizes the length of customer engagement. The resulting LLM might intentionally or unintentionally provide inaccurate or ambiguous responses to extend the conversation, leading to frustrated customers and decreased customer satisfaction. Consider also the example of content moderation. These AI systems must be trained to recognize and filter out inappropriate or harmful content, aligning with the platform's community standards and societal norms. In this case, misalignment could spread hate speech, misinformation, or explicit content.

Misaligned models can also present vulnerabilities. In cybersecurity, an LLM might be trained to alert users when it detects potential phishing emails. However, if the AI's reward function is aligned with the number of alerts it generates, rather than the accuracy of these alerts, it could lead to a misalignment. The LLM might over-alert users, causing unnecessary panic and desensitizing users to alerts. This can lead to real threats being overlooked due to the frequency of false alarms.

Detecting and Preventing Misalignment

Most models will exhibit some degree of misalignment. It is currently debated whether misalignment in LLMs can ever be eliminated with theoretical guarantees, especially in the case of free-form text generation. However, there are actions to detect and mitigate this behavior:

Regular Monitoring and Evaluation: GenAI systems should be continuously monitored to assess performance and ensure the system functions as expected.
Iterative Refinement: Proper retraining pipelines should always be implemented in the initial design to lower the long-term cost of inevitable refinements.
User Feedback: Incorporating RLHF or other feedback mechanisms is essential and invaluable in detecting and correcting misalignments.
Clear Definition of Objectives: The AI's feedback, realized through a reward function and evaluation metrics, should align with the intended human goals. For instance, in a customer service chatbot, the goal should not be to prolong the conversation but to resolve the customer’s queries effectively.
Consider Guardrails: Guardrails is a popular, generic term for detective controls around an LLM. Restrictions on the possible conversation can be useful in preventing misaligned responses. For example, one can help maintain the conversation course in a previously trained context wherein the LLM was provided with the proper answer. Guardrails can consist of rule-based systems and heuristics; LLM uncertainty estimation; topic extraction and restrictions; or encouraging desirable behavior through prompt engineering.

GenAI Hallucination

GenAI hallucination is anthropomorphic terminology that refers to the tendency of models to make up information that is not grounded in its input data. Despite being incorrect, the hallucinated content often appears highly credible and convincing. Looking like human-authored content is frequently a component of LLM training objectives. Hallucinations can expose GenAI systems to a myriad of security vulnerabilities.

Consequences of Hallucination

GenAI hallucinations can lead to misleading or incorrect results inside applications. Some hallucinations can be entertaining or even useful (to mimic imagination, for example). However, in enterprise apps and safety-critical systems, hallucinations can pose major risks and societal harm.

For example, GenAI may hallucinate false facts about people which can invent and proliferate misinformation causing harm to certain groups. There is a precedent of this happening in the real world, such as cases in Georgia and Australia where LLMs were used to produce false and misleading content regarding public figures. Harm to individuals can also be realized through erroneous or dangerous medical advice that is convincingly provided. Researchers have explored the efficacy of these models in patient care use cases and found a notable prevalence of inaccurate medical citations.

Hallucinations can also lead to severe security vulnerabilities. For instance, a GenAI system designed for facial recognition in a security system might hallucinate and incorrectly identify an unauthorized person as authorized. In a more sinister scenario, a malicious user might intentionally manipulate the input data to induce hallucination, tricking the system into generating undesired outputs. Alternatively, in financial use cases, GenAI may hallucinate the precise values of numbers in arbitrary ways causing widespread inconsistencies.

Mitigating Hallucination

Addressing GenAI hallucination is critical to maintaining the integrity and security of GenAI applications. Similar to alignment, it is debated whether hallucination is an inevitable and inescapable feature of GenAI technologies. However, many methods have been developed to address hallucination to cover a majority of everyday use cases:

Model-level:

Training or Fine-Tuning: Despite impressive feats on certain benchmark tasks, GenAI relies heavily on recalling and recombining learned statistical patterns of words (or pieces of words). A major source of hallucinations is any deviation from these learned token patterns which can lead to unpredictable responses. Models tuned on task-specific data will have a lower hallucination rate. As discussed above, there are tips and best practices on data curation that can assist in this activity. If full retraining is impossible due to the model size, many memory-efficient fine-tuning techniques have been developed as alternatives such as Low-Rank Adaptation of Large Language Models (LoRA). Briefly, LoRA is a method to modify a small number of the model's parameters, rather than completely retraining the model, enabling efficient and resource-saving adaptations to new tasks or objectives.
External Uncertainty Estimation: Two major forms of uncertainty are studied in the research literature: aleatoric and epistemic. The former arises from inherent uncertainties in the data, and the latter concerns uncertainty within the model itself. In advanced methods, external models like Bayesian Neural Networks can be used to model both forms of LLM uncertainty. This is useful because we can penalize epistemic uncertainty during decoding which encourages the model to take the prediction path only where it is best supported. In general, it has been found that focusing on epistemic uncertainty tends to work better at reducing hallucinations.
Internal Uncertainty Estimation: Other than using external models to estimate uncertainty, another path is to use your LLM’s softmax output over tokens to estimate uncertainty. You could consider, for example, the max or average max probabilities in the next-token distributions while decoding, or use the average or max entropy of next-token distributions.
Self-Consistency: This method tests using sample-based answer consistency as a proxy to measure uncertainty. This is useful, for example, when using closed-source LLMs. This method stochastically generates many answers to the same question and measures group agreement with several similarity analyses. This self-consistency approach often matches or outperforms measuring uncertainty directly in the softmax output.
Retrieval-Augmented Generation: The fundamental idea behind RAG is straightforward: it merges generation with information retrieval (IR) to give the LLM the correct answer inside its prompt. That is, IR is responsible for discovering the external content needed to answer a question truthfully such that the LLM only needs to extract the discovered answer.

Product-level:

User Editability: Allow users to edit AI-generated outputs, especially before the invocation of an action. This can reduce frustration if users feel the system isn’t working right.
Input Validation: Implementing strong input validation, where appropriate, can prevent malicious users from manipulating inputs to induce hallucination.
User Responsibility: Make it clear that users are ultimately responsible for what happens after output is generated. Make sure they know the output won’t be perfect.
User Optionality: Offer various operational modes, such as a “precision” mode that uses a more accurate (but computationally expensive) model.
Limit Output Length: Limit the length and complexity of generated responses and turns in a conversation. Longer and more complex outputs have a higher chance of producing hallucinations.
Structured Input/Output: Consider using structured fields instead of free-form text to lower the risk of hallucinations.
Regular and Robust Testing: Regular testing of GenAI systems with different types of data, including user feedback, can help identify and rectify hallucination instances.

Transparency and Explainability

GenAI systems are often perceived as 'black boxes' due to their complex and opaque decision-making processes. This opacity can lead to mistrust and skepticism, particularly when these systems are deployed in sensitive areas where decisions can have far-reaching consequences. However, it doesn't have to be this way.

Transparency and explainability are two fundamental pillars of trustworthy GenAI. Transparency refers to the accessibility and clarity of information about the GenAI system's design, capabilities, and performance. Transparent GenAI systems provide insight into how it was trained, descriptions of the data it uses, its strengths and limitations, and its performance metrics. This information is crucial in assessing the system's reliability and suitability for different tasks. Explainability in GenAI refers to the extent to which the internal workings of an AI system can be understood in human terms. It involves the AI system providing a clear, understandable account of its decision-making process. For instance, if a GenAI system decides to flag a network activity as suspicious, it should be able to explain why it made that decision, such as highlighting specific patterns or behaviors that it deemed unusual. Together, they serve as key tools in promoting the ethical use of AI systems, ensuring accountability, and building user trust. This can aid in understanding why a system acted in a certain way and rectifying any issues.

Consequences of Poor Transparency and Explainability

Lacking transparency and explainability can foremost lead to societal harm:

Discrimination and Bias: Without transparency and explainability, GenAI systems can unknowingly perpetuate and even amplify societal biases embedded in their training data. For instance, an opaque GenAI system used in hiring might unknowingly discriminate against certain demographic groups because it was trained on biased data.
Loss of Trust: If GenAI systems, especially those used in sensitive areas like healthcare or criminal justice, are not transparent and explainable, they may lose public trust. For example, if a GenAI system used for diagnosing diseases can't explain its conclusions, patients, and doctors may form antagonistic biases and distrust its decisions. This can lead to lost efficiency gains if truly beneficial GenAI tools are rejected for reasons other than their performance.
Legal Implications: The lack of adequate explainability may produce legal uncertainty as it would be difficult to establish how or why an LLM produced a particular response. For instance, if an autonomous vehicle is involved in an accident, a lack of explainability can create legal uncertainty in determining who is responsible.
Unpredictable Operation: It is harder for humans to form an accurate mental model of LLMs without explainability, for example, to predict future behavior. This leads the model owners and users to feel less in control of the model’s outputs. This limits the use cases they can be applied to, and users would be more cautious in deploying these models in scenarios where model predictability would be a paramount concern.
Privacy Violations: GenAI systems trained on sensitive data can reveal private information if not properly designed and governed. Especially in the absence of specialized private training strategies, LLMs tend to memorize their training data, which can then be leaked in deployment. Without transparency, users have no way of knowing if their sensitive data has been memorized by a certain model and is thus at risk of leakage.
Stifled Model Improvement: Model owners who lack a clear understanding of the inner workings of their model are less equipped to fix model issues as they arise.
Barriers to Education and Learning: Educational GenAI systems that lack explainability might provide students with answers without the capacity to clearly explain the reasoning behind the answer. This would dampen the LLM’s effectiveness as a teaching tool.

Forms of Explanations

GenAI systems should be intentionally designed to provide clear explanations for their decisions. GenAI explainability is a wide area of research. Explanations can take many forms:

Feature Importance Values: This involves identifying the most influential features in a dataset that contribute significantly to the model's decision. For instance, SHAP (SHapley Additive exPlanations) provides a unified measure of feature importance that is theoretically grounded in game theory.
Citations and References: AI systems can be designed to provide references or citations to external sources to support their conclusions, increasing their credibility. RAG pipelines (mentioned above) can assist in discovering relevant content to cite when answering queries.
Saliency Maps: These visual representations highlight areas in an input (like an image) that were most salient in the model's decision-making process. Saliency maps have a long history in different fields but were first introduced to deep learning in this work.
Decision Paths: These illustrate the sequence of model decisions leading up to the outcome. This is best facilitated with the incorporation of tree-based models into AI pipelines.
Natural Language Explanations: AI models can be designed to provide explanations in natural language that are easy for humans to understand. While LLMs can often produce convincing natural language rationalizations for decisions, the utility of these rationalizations in terms of a faithful explanation of the model is debated. For example, it has been shown that such rationalizations have poor simulatability, that is, the rationalizations do not increase humans' ability to predict future model behavior more than the decisions themselves.

GenAI Model Drift

Model drift is when performance deteriorates over time because external conditions change. For example, a model trained to recognize cats on the internet may begin to fail if a social trend suddenly increases the number of cat photos that have a wide-angle zoom of just the nose and whiskers. Without continuous updating, a deployed model is limited to responses based on the historical training data. Closed-source third-party models are particularly opaque, making it more difficult to assess drift. Continuously updating a model can be a costly and time-consuming task; however, model drift is a common issue that should be addressed when robustness is a priority.

Drift can exacerbate issues with both alignment and hallucination, which we discuss at length above. This is attributable to the idea that GenAI models perform worse when asked to perform outside of what they saw in training. Drift can expose the model to new and unpredictable scenarios it has never seen before. Reliance on an outdated model can pose significant security and operational risks, especially if an organization depends heavily on the model's outputs without periodic verification and updates. For example, consider a loan scenario where the criteria or algorithm for calculating credit scores has changed. If the model still uses outdated standards to assess creditworthiness, a previously high credit score might no longer reflect the same level of risk. Relying on outdated models can lead to inaccurate loan approvals or rejections, causing financial and reputational harm.

Dealing with Model Drift

Detecting Model Drift:

Regular Monitoring: This involves routinely checking the performance of your GenAI system against predefined benchmarks or key performance indicators (KPIs). Any significant and sustained deviation from these benchmarks might indicate model drift.
Real-Time Tracking: Implementing real-time tracking of the model's performance can help detect model drift as soon as it occurs. This can be done using dashboards that track important metrics and send alerts for significant changes.
Conducting Predictive Accuracy Tests: Use statistical methods such as the Brier score, log loss, or receiver operating characteristic (ROC) analysis to evaluate the discrepancies between the model's predictions and the actual outcomes. Consistent, significant decreases in these metrics over time might indicate model drift.

Preventing Model Drift:

Regular Model Updates: Regularly updating the GenAI model with fresh data is the most effective way to prevent model drift, albeit typically the most costly.
Implement Adaptive Models: Utilize models designed to adjust their parameters and learn in real time from new data, often known as online learning models. Online learning is a vast area of research with new methods and insights being published regularly.
Diverse Training Data: Ensure the training data is diverse and represents the scenarios the model can encounter. This can make the model more robust and resistant to drift.

Mitigating Model Drift:

Model Versioning: Maintain multiple versions of your model. If one version starts to drift, you can quickly switch to a different version while you investigate and address the issue.
Ensemble Models: Use an ensemble rather than relying on a single model. This can increase the robustness of your system and reduce the impact of drift on any one model.
Utilize Domain Experts: In industries where changes are common and unpredictable, involve domain experts in model maintenance. Their expertise can help in understanding the changes and making necessary adjustments in the model.

Conclusion

Building trustworthy GenAI by design requires a proactive approach to detect and mitigate misalignment and hallucination, while providing transparency and explainability, despite model drift. As with many security challenges, the weakest link can break a system. Therefore, all elements need to be addressed to win and maintain user trust. The ever-evolving field of GenAI challenges us to continually learn, adapt, and innovate to keep pace with breakthrough AI technologies and AI vulnerabilities.

Palo Alto Networks Can Help

The rollout of our Secure AI by Design product portfolio has begun.

We can help you solve the problem of protecting your GenAI infrastructure with AI Runtime Security which is available today. AI Runtime Security is an adaptive, purpose-built solution that discovers, protects, and defends all enterprise applications, models, and data from AI-specific and foundational network threats.

AI Access Security secures your company’s GenAI use and empowers your business to capitalize on its benefits without compromise.

Prisma® Cloud AI Security Posture Management (AI-SPM) protects and controls AI infrastructure, usage, and data. It maximizes the transformative benefits of AI and large language models without putting your organization at risk. It also gives you visibility and control over the three critical components of your AI security — the data you use for training or inference, the integrity of your AI models, and access to your deployed models.

These solutions will help enterprises navigate the complexities of Generative AI with confidence and security.

GenAI Security Technical Blog Series 5/6: Secure AI by Design - Trustworthy GenAI By Design