- Access exclusive content
- Connect with peers
- Share your expertise
- Find support resources
Blog written by: Brody Kutt, Yiheng An, Haozhe Zhang, Yu Fu, Qi Deng, and Royce Lu
Technical Editors: Nicole Nichols, Sam Kaplan, and Katie Strand
The rollout of our Secure AI by Design product portfolio has begun. If you want to see how we can help secure AI applications, please see the Palo Alto Networks Can Help section below.
This blog illuminates the central technical elements of trustworthy generative AI (GenAI), underscoring its significance for responsible, transparent, and aligned AI operations within an organization. The focus is on the prevention of unintentional biases, reinforcement of organizational values, and the promotion of user trust. We dissect four key subtopics: GenAI alignment, GenAI hallucination, transparency/explainability, and GenAI model drift. Combined, these four categories span a set of prevalent potential threats and vulnerabilities. Each topic is expanded through real-world examples and proposed strategies for detection and prevention. The blog concludes by advocating for strong governance protocols for AI products that provide for secure AI by design from the initial implementation. GenAI includes continuous monitoring and ethical usage to enhance trust and mitigate potential threats. Throughout this post, we focus on text-based examples; however, the ideas apply to foundation models of any modality, such as images, video, audio, and beyond.
GenAI alignment refers to ensuring that AI systems produce outputs that are in line with human values. In practice, this means preventing the model from inadvertently causing harm. For example, due to misinterpretation or misapplication of their objectives. The goal is to ensure that the AI's objectives consistently match the social norms where the model will be used, including previously unseen scenarios.
GenAI alignment is achieved in several steps, beginning with having accurate and representative data. Some important considerations of data are:
Datasets can be augmented by incorporating human feedback to validate effectiveness and reliability. Although human feedback is generally more expensive to collect, humans can help the model learn and improve upon their objective over time. This can mean interactive training, like reinforcement learning with human feedback (RLHF). Here, humans are in the loop with the LLM, guiding it towards correct interpretations and responses. This feedback can fine-tune the LLM, helping it generate more accurate and relevant responses.
To illustrate misalignment, it is important to define the LLM’s expected behavior first. To illustrate this concept, we will consider a model trained to maximize user satisfaction in a customer service context. Misalignment could result from training data that includes meandering and evasive examples from true human agents or a reward function (sometimes also called an objective) that overly prioritizes the length of customer engagement. The resulting LLM might intentionally or unintentionally provide inaccurate or ambiguous responses to extend the conversation, leading to frustrated customers and decreased customer satisfaction. Consider also the example of content moderation. These AI systems must be trained to recognize and filter out inappropriate or harmful content, aligning with the platform's community standards and societal norms. In this case, misalignment could spread hate speech, misinformation, or explicit content.
Misaligned models can also present vulnerabilities. In cybersecurity, an LLM might be trained to alert users when it detects potential phishing emails. However, if the AI's reward function is aligned with the number of alerts it generates, rather than the accuracy of these alerts, it could lead to a misalignment. The LLM might over-alert users, causing unnecessary panic and desensitizing users to alerts. This can lead to real threats being overlooked due to the frequency of false alarms.
Most models will exhibit some degree of misalignment. It is currently debated whether misalignment in LLMs can ever be eliminated with theoretical guarantees, especially in the case of free-form text generation. However, there are actions to detect and mitigate this behavior:
GenAI hallucination is anthropomorphic terminology that refers to the tendency of models to make up information that is not grounded in its input data. Despite being incorrect, the hallucinated content often appears highly credible and convincing. Looking like human-authored content is frequently a component of LLM training objectives. Hallucinations can expose GenAI systems to a myriad of security vulnerabilities.
GenAI hallucinations can lead to misleading or incorrect results inside applications. Some hallucinations can be entertaining or even useful (to mimic imagination, for example). However, in enterprise apps and safety-critical systems, hallucinations can pose major risks and societal harm.
For example, GenAI may hallucinate false facts about people which can invent and proliferate misinformation causing harm to certain groups. There is a precedent of this happening in the real world, such as cases in Georgia and Australia where LLMs were used to produce false and misleading content regarding public figures. Harm to individuals can also be realized through erroneous or dangerous medical advice that is convincingly provided. Researchers have explored the efficacy of these models in patient care use cases and found a notable prevalence of inaccurate medical citations.
Hallucinations can also lead to severe security vulnerabilities. For instance, a GenAI system designed for facial recognition in a security system might hallucinate and incorrectly identify an unauthorized person as authorized. In a more sinister scenario, a malicious user might intentionally manipulate the input data to induce hallucination, tricking the system into generating undesired outputs. Alternatively, in financial use cases, GenAI may hallucinate the precise values of numbers in arbitrary ways causing widespread inconsistencies.
Addressing GenAI hallucination is critical to maintaining the integrity and security of GenAI applications. Similar to alignment, it is debated whether hallucination is an inevitable and inescapable feature of GenAI technologies. However, many methods have been developed to address hallucination to cover a majority of everyday use cases:
GenAI systems are often perceived as 'black boxes' due to their complex and opaque decision-making processes. This opacity can lead to mistrust and skepticism, particularly when these systems are deployed in sensitive areas where decisions can have far-reaching consequences. However, it doesn't have to be this way.
Transparency and explainability are two fundamental pillars of trustworthy GenAI. Transparency refers to the accessibility and clarity of information about the GenAI system's design, capabilities, and performance. Transparent GenAI systems provide insight into how it was trained, descriptions of the data it uses, its strengths and limitations, and its performance metrics. This information is crucial in assessing the system's reliability and suitability for different tasks. Explainability in GenAI refers to the extent to which the internal workings of an AI system can be understood in human terms. It involves the AI system providing a clear, understandable account of its decision-making process. For instance, if a GenAI system decides to flag a network activity as suspicious, it should be able to explain why it made that decision, such as highlighting specific patterns or behaviors that it deemed unusual. Together, they serve as key tools in promoting the ethical use of AI systems, ensuring accountability, and building user trust. This can aid in understanding why a system acted in a certain way and rectifying any issues.
Lacking transparency and explainability can foremost lead to societal harm:
GenAI systems should be intentionally designed to provide clear explanations for their decisions. GenAI explainability is a wide area of research. Explanations can take many forms:
Model drift is when performance deteriorates over time because external conditions change. For example, a model trained to recognize cats on the internet may begin to fail if a social trend suddenly increases the number of cat photos that have a wide-angle zoom of just the nose and whiskers. Without continuous updating, a deployed model is limited to responses based on the historical training data. Closed-source third-party models are particularly opaque, making it more difficult to assess drift. Continuously updating a model can be a costly and time-consuming task; however, model drift is a common issue that should be addressed when robustness is a priority.
Drift can exacerbate issues with both alignment and hallucination, which we discuss at length above. This is attributable to the idea that GenAI models perform worse when asked to perform outside of what they saw in training. Drift can expose the model to new and unpredictable scenarios it has never seen before. Reliance on an outdated model can pose significant security and operational risks, especially if an organization depends heavily on the model's outputs without periodic verification and updates. For example, consider a loan scenario where the criteria or algorithm for calculating credit scores has changed. If the model still uses outdated standards to assess creditworthiness, a previously high credit score might no longer reflect the same level of risk. Relying on outdated models can lead to inaccurate loan approvals or rejections, causing financial and reputational harm.
Building trustworthy GenAI by design requires a proactive approach to detect and mitigate misalignment and hallucination, while providing transparency and explainability, despite model drift. As with many security challenges, the weakest link can break a system. Therefore, all elements need to be addressed to win and maintain user trust. The ever-evolving field of GenAI challenges us to continually learn, adapt, and innovate to keep pace with breakthrough AI technologies and AI vulnerabilities.
The rollout of our Secure AI by Design product portfolio has begun.
We can help you solve the problem of protecting your GenAI infrastructure with AI Runtime Security which is available today. AI Runtime Security is an adaptive, purpose-built solution that discovers, protects, and defends all enterprise applications, models, and data from AI-specific and foundational network threats.
AI Access Security secures your company’s GenAI use and empowers your business to capitalize on its benefits without compromise.
Prisma® Cloud AI Security Posture Management (AI-SPM) protects and controls AI infrastructure, usage, and data. It maximizes the transformative benefits of AI and large language models without putting your organization at risk. It also gives you visibility and control over the three critical components of your AI security — the data you use for training or inference, the integrity of your AI models, and access to your deployed models.
These solutions will help enterprises navigate the complexities of Generative AI with confidence and security.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Subject | Likes |
---|---|
1 Like | |
1 Like | |
1 Like | |
1 Like | |
1 Like |