GenAI Security Technical Blog Series 3/6: Secure AI by Design - Cracks in the Digital Fortress

rlu · ‎07-15-2024

General Graphics (2).jpg

Blog written by: Yiheng An, Brody Kutt, Haozhe Zhang, Yu Fu, Qi Deng, Royce Lu

Introduction
Data Security
Potential Data Breaches
Leak of Sensitive Training Data from Data Extraction Attack
Data accessible by the model
User interaction history
Threats to Data
Membership Inference Attack
Extractable Memorization
Model Security
Safeguarding AI Models
Model Parameter
Model Fingerprint
Model Integrity
Threats to Model
Data Poisoning
Model Skewing Attack
Model Inversion
Conclusion
Palo Alto Networks Can Help

The rollout of our Secure AI by Design product portfolio has begun. If you want to see how we can help secure AI applications, please see the Palo Alto Networks Can Help section below.

Introduction

Generative AI (GenAI) applications have rapidly advanced since 2020, significantly impacting productivity. These applications use user inputs as prompts for LLM models, creating a unique attack vector. We have talked about details for this content in the previous blog Securing Generative AI: A Comprehensive Framework and GenAI Security Framework Blog Series 2/6: Prompt Injection 101.

In this blog, we will focus on data security in the field of Generative AI. This technology relies heavily on the size and quality of training data, powered by breakthrough Transformer architectures. Transformers are advanced neural networks designed to process and understand sequential data like text, revolutionizing how AI systems interpret language.Large language models, built on Transformer technology, can effectively understand language semantics in specific contexts and perform a wide variety of text-related tasks. To ensure their accuracy and effectiveness, these models require high-quality training data. Consequently, protecting the privacy and integrity of this data is crucial.

The performance of modern generative AI, which commonly uses the Transformer architecture, largely depends on the quality of its training data. Preventing leaks and compromises of this high-quality data has become a critical challenge in developing high-performing large language models. Companies often fine-tune these models with their internal data to answer specific questions. Safeguarding these internal documents from leaks is equally important. Additionally, some security risks from earlier deep learning neural networks persist, such as training data poisoning, which can degrade model performance. Even more concerning are potential backdoors that can be inserted into the models.

These risks are amplified because generative AI is often granted access to internal documents and source code to better assist in completing tasks. This access not only increases the potential impact of a security breach but also provides vectors for introducing backdoors or manipulating the model's behavior. Consequently, this data remains at risk of being leaked.

In this blog, we will discuss these issues one by one, exploring the challenges and potential solutions in securing generative AI systems.

Data Security

Potential Data Breaches

Leak of Sensitive Training Data from Data Extraction Attack

Training data is crucial in AI, highlighting the need to maintain its safety and integrity. However, in the age of large language models (LLMs), there are numerous ways that training data can be compromised. The scope and breadth of training data required for these models introduce an expanded attack surface for those developers who are training or refining these LLMs. This expanded attack surface gives threat actors more potential vectors to compromise the integrity of those data sources.

Unlike traditional data security, where access to data is tightly controlled, LLMs present a unique challenge. After training, these models often do not retain the original data, but they can still generate outputs that potentially reveal sensitive training data. This means simply storing training data in a secure database is insufficient. Recent work has successfully shown retrieving sensitive information from operational models. Such data breaches could expose confidential internal documents that are not intended for public view.

Data accessible by the model

LLM are powerful tools capable of automating a wide range of tasks, thereby significantly reducing the need for manual labor. One of their most notable capabilities is efficiently summarizing web pages and documents which can enhance information processing and aid in decision-making. To do this, LLMs can be granted access to data in the production database that was not included in the training dataset. Some of this data may be less sensitive, such as product information, but some may be more sensitive, such as algorithms, internal documents, or even real customer data.

Prompt injection attacks can exploit vulnerabilities in generative AI models, leading them to return sensitive data not intended for end users. For instance, an attacker might craft a prompt that tricks the model into disclosing another user's private information. This risk arises when the AI agent does not adequately sanitize and validate inputs and outputs. Ensuring robust input validation and output filtering is essential to mitigate such threats. Prompt injection can be particularly dangerous in contexts where AI models handle confidential or personal data. Proper safeguards must be implemented to prevent unauthorized data access through manipulated prompts.

User interaction history

In communications with an LLM, users can typically access the full history of their interactions within that session. Even if there is a long break between conversations, the LLM can seamlessly continue the dialogue as long as the session history is maintained. This continuity is enabled by the chat platform, which stores the chat history for each session and sends the entire history—or a truncated version if it's too lengthy—to the model. This allows the model to understand the context and generate relevant responses.

However, this feature also presents data security risks. The stored chat history could potentially be extracted deliberately or leaked accidentally. Ensuring the security and privacy of this data is crucial, as it contains personal interactions that users expect to remain confidential. Moreover, these user interactions often contain confidential or proprietary information that the enterprise might not have direct control over. For instance, users might inadvertently share sensitive business details, personal data, or intellectual property during their conversations with the AI. This introduces a unique attack vector, as malicious actors could potentially access a wealth of sensitive information from various sources through a single breach of the chat history.

Adequate data protection measures and robust security protocols specific to user interaction histories are essential to prevent unauthorized access and safeguard both user privacy and enterprise confidentiality. These measures must account for the unpredictable nature of user-generated content and the potential sensitivity of information shared during AI interactions.

Threats to Data

Membership Inference Attack

Membership inference attacks (MIAs) are designed to ascertain whether a particular data sample was included in a model’s training dataset. Traditionally, executing MIAs requires having access to the model's prediction probabilities or confidence scores for the data samples in question. These scores help attackers compare the model's behavior on the target sample against its behavior on samples known to be inside and outside the training set. The difference in the model's response can then be used to infer membership. Additionally, recent research has shown that an attacker can create a separate model specifically tailored to learn and infer membership. Traditionally, executing MIAs requires a deep understanding of the model's training data and outputs. This involves setting a threshold for a loss function to identify if a data point was part of the training set. Such an approach can be challenging with commercial LLMs, where the specifics of the training data are generally kept confidential by the providers.

Despite these challenges, LLMs are not immune to MIAs. One method attackers use is called the Neighborhood Comparison approach. In this method, attackers slightly modify a specific data point to create several variants and then observe how the model responds to each variant. If the original input consistently triggers responses that are significantly different from those of its variants, it could suggest that the original data point, or something very similar, was part of the training dataset. This method highlights a potential vulnerability in LLMs as it does not require prior detailed knowledge of the model’s training data. This is a feasible approach for attackers to infer membership without insider information.

Extractable Memorization

LLMs have revolutionized natural language processing, demonstrating remarkable capabilities in understanding and generating human-like text. However,recent research has uncovered a vulnerability in these models that raises important privacy concerns. The study has revealed that LLMs can memorize and reproduce specific pieces of their training data when prompted in certain ways.

This phenomenon, known as extractable memorization, allows external entities to potentially retrieve data from the model without prior knowledge of the training set. The researchers developed a novel technique called a 'divergence attack' to exploit this vulnerability. By prompting AI Chatbot to repeat a word multiple times, they could cause the model to deviate from its usual responses and output memorized data.

Alarmingly, the study found that the memorized information could include personally identifiable information (PII) such as email addresses and phone numbers. This discovery has profound implications for the design, implementation, and use of language models, suggesting that current alignment and adjustment techniques may be insufficient to prevent unintended data leakage.

The researchers emphasize the need for further investigation into areas like training data deduplication and the relationship between model capacity and memorization. As we continue to harness the power of LLMs, this study underscores the critical importance of balancing their capabilities with robust privacy protections, opening new avenues for research to ensure these models remain both powerful and respectful of user privacy.

One technique used in a divergence attack involves prompting the model to repeat a specific word multiple times. We have already discussed this technology in the previous blog: GenAI Security Framework Blog Series 2/6: Prompt Injection 101.

Fig 1_Cracks-Digital-Fortress_palo-alto-networks.png

Model Security

Safeguarding AI Models

Model Parameter

Model parameters are fundamental to the operation of any machine learning model, and their security is critical, especially when deploying large language models (LLMs) in production environments. Once loaded into a system, it is vital to establish a Trusted Execution Environment (TEE) to protect these parameters, substantially minimizing the risk of unauthorized access by other processes. Moreover, encrypting the model parameters is crucial both during runtime and while at rest to ensure they remain secure throughout their lifecycle. Additionally, implementing robust user authentication mechanisms is essential for enhancing security, as it helps prevent unauthorized access to the model. Collectively, these security practices form a robust framework for safeguarding sensitive model data, essential for maintaining the integrity and confidentiality of the models.

Model Fingerprint

For generative AI, developing models from scratch is often impractical for enterprises. Instead, they may utilize open-source LLMs like LLAMA, Bloom, and BERT, which they can fine-tune to suit specific applications. These open-source LLMs are comparable to open-source web frameworks in that their structures are well-documented and transparent. This transparency shifts the security model from a black box (where internals are unknown) to a gray box scenario (where some knowledge about the system is known).

Transfer learning is a popular approach for companies that want to create their own fine-tuned LLMs but need more substantial computing resources or are reluctant to allocate a large budget to them. Rather than building a model from scratch, using these open-source models as fully visible starting points can be a cost-efficient alternative. The process involves freezing the majority of the model’s layers and fine-tuning only the final layers with new data. This approach maintains the core performance of the model since most layers remain unchanged, while the final layers incorporate additional knowledge.

However, the inherent transparency of open-source models presents a structural security consideration. While it allows for greater customization and understanding of the model, it also means that potential attackers have access to the same data and can better understand the model’s structure. This can facilitate targeted attacks, especially if the attackers identify the base model an enterprise is using for fine-tuning.

Additionally, recent research indicates that an attacker can embed a backdoor during the pre-training phase. Consequently, any model created through transfer learning from this compromised model may inherit the backdoor. If a vulnerability is discovered in a widely used open-source model, it can affect all models derived from it. Recent work has shown the transferability of attacks not only within LLM families but between families as well. That is, a successful attack on one LLM confers a relatively high probability of success on a different LLM even if it has been fine-tuned for a different task. Furthermore, models produced via transfer learning from a tainted model are indirectly subjected to poisoning attacks, as the malicious parameters are integrated into the originally trained model.

From a security perspective, model fingerprinting can aid in detecting potentially compromised models. If a vulnerability or backdoor is discovered in a base model, fingerprinting techniques can quickly identify which fine-tuned models might be affected, allowing for targeted updates and patches. Additionally, fingerprinting can help verify the integrity of models in use, ensuring they haven't been tampered with or replaced by malicious versions.

However, the transparency of open-source models also means that attackers could potentially use fingerprinting techniques to identify specific model implementations and target their attacks more effectively. This dual nature of model fingerprinting – as both a security tool and a potential attack vector – underscores the need for robust and secure fingerprinting methods in the development and deployment of AI models.

Model Integrity

Unlike traditional software, LLMs and other deep learning models are often considered black boxes to both developers and users. In traditional programming, developers can audit the source code to detect and remove malicious insertions, such as backdoors. If the code integrity is compromised, developers can usually trace and rectify such changes. However, in the field of Generative AI, detecting embedded backdoors is significantly more challenging. In this context, a backdoor doesn’t involve direct code manipulation but rather a conditioning of the model to respond anomalously to specific, typically innocuous inputs. For example, an LLM might be manipulated to output sensitive or nonsensical information upon receiving a certain sequence of meaningless characters that a typical user wouldn’t input. This type of manipulation involves backdoor triggers, which are distinct from typical prompt injection attacks. While prompt injection attacks generally involve crafting inputs to elicit unintended outputs, backdoor triggers are covertly embedded during the model's training process and can be activated by specific inputs. Recent work has also shown multiple backdoor triggers can be distributed across multiple prompt components to increase the obfuscation of the trigger and better evade detectors. To mitigate such risks, it is crucial to ensure the integrity of the model throughout its lifecycle, from development to deployment, monitoring for any unusual behaviors that could indicate the presence of such vulnerabilities.

Threats to Model

Data Poisoning

A data poisoning attack is a method used to compromise machine learning models by introducing tainted data into the training or fine-tuning dataset. This type of attack does not require manipulation of the entire dataset but rather can be effective even when only a small portion of the data is corrupted. Poisoning attacks can take various forms, such as polluting the dataset with incorrect labels, which leads to inaccurate model outputs, or inserting a large amount of meaningless data, which degrades the model's overall performance. Additionally, some poisoning attacks are designed to embed a backdoor in the model, similar to the backdoor triggers previously mentioned. These backdoor triggers can be activated by specific inputs to manipulate the model’s outputs covertly.

Research indicates that even a small percentage of poisoned data can be sufficient to successfully install a backdoor in a model. Studies have shown that even with an Attack/Clean Ratio (A/C Ratio) as low as 0.01, a backdoor can be successfully implemented in a model. Remarkably, the original model maintains its performance on clean data, indicating that the backdoor injection does not degrade its overall efficacy. This underscores the subtlety and potential danger of backdoor attacks, as they can be introduced with minimal malicious data without noticeably affecting the model's primary performance.

In a legitimate use scenario, users can provide feedback on responses generated by a large language model. This feedback is crucial for the training phase called reinforcement learning from human feedback (RLHF). In RLHF, human evaluations fine-tune the model to improve its performance and align its behavior with human values and expectations.

Involving more human input through feedback is beneficial for LLM training, as it helps the model learn from a diverse range of experiences and perspectives. This collaborative approach enhances the model's ability to generate accurate and contextually appropriate responses.

However, a significant challenge is distinguishing between feedback given with good intentions and feedback given with malicious intent. Some users might deliberately provide misleading or harmful feedback to bias the model. This vulnerability requires robust mechanisms to evaluate and filter feedback to maintain the integrity and reliability of the LLM training process.

Furthermore, production systems often utilize user feedback to refine and retrain models. For example, an AI Chatbot may ask users to evaluate the quality of its responses or choose between two options to determine which is better. This feedback mechanism is crucial for continuous improvement, yet, users may not always be aware of its significance or may abuse the system for malicious purposes. For instance, if a user leaves a website after receiving an answer, it might be interpreted that the answer was satisfactory, which could be misleading. Attackers without direct access to training data could exploit this feedback loop to influence the model inappropriately, leading to biased outcomes.

Model Skewing Attack

LLMs can make decisions based on their extensive knowledge and information provided by users. However, these models are vulnerable to model skewing attacks, where malicious users manipulate the model's outputs to align with their expectations. To illustrate this, let's consider this scenario in e-commerce: A popular online marketplace relies on a sophisticated recommendation system to suggest products to its users. This system learns from user behavior, including browsing patterns and purchase history. However, an unscrupulous seller devises a plan to game the system. They create numerous fake accounts that interact with their products, simulating high interest and sales. These accounts also flood the platform with glowing reviews and ratings. As the recommendation model updates with this artificial data, it begins to favor the dishonest seller's products. Genuine users soon find their feeds filled with these potentially subpar items, while deserving products from honest sellers get pushed aside. The platform's user experience suffers, potentially driving away customers. This manipulation not only harms the marketplace's reputation but also undermines the trust that both buyers and sellers place in the system.

Another area of concern is phishing email detection. A successful model-skewing attack could mislead the model into categorizing phishing emails sent by the attacker as safe, thus undermining security measures. For example, if an attacker gains access to the training process and injects specially crafted emails into the learning loop, these emails can be designed to look like legitimate emails but contain subtle features typically associated with phishing emails. As a result, the model learns to associate these skewed features with legitimate emails. Consequently, the model's ability to distinguish between phishing and legitimate emails is degraded, leading to an increase in false negatives where phishing emails are incorrectly classified as safe. This not only poses a significant security risk by increasing the likelihood of users falling victim to phishing attacks but also undermines the trust in the email filtering system.

Model Inversion

Model inversion attacks meticulously reconstruct an input that would elicit a specific output from a model. This approach can inadvertently reveal sensitive training data or precise commands that were issued to the model. Consider a loan application scenario: if an attacker can reverse-engineer the model, they might identify the exact criteria that lead to a “pass” decision, thereby manipulating the model’s approval process.

To carry out such an attack, attackers start by analyzing the outputs the model generates for various inputs, building a profile of next-token probabilities. In an open-source model, attacks can get the complete next-token probabilities from the last layer. For LLMs offered as services, complete token probabilities are often not accessible. However, these platforms typically allow users to adjust certain hyperparameters such as the temperature. The temperature setting controls the variability of the model's output, adjusting the uniqueness or originality of the responses. A higher temperature results in more diverse and creative outputs, while a lower temperature produces more focused and deterministic responses. Recent research shows that this feature, while intended to refine outputs, can be exploited by attackers to produce a range of outputs from a single input, thus enabling them to approximate the full probability distribution. This detailed understanding of outputs helps attackers predict under what specific conditions the model generates particular decisions, effectively compromising the model’s integrity and the confidentiality of its internal processes.

Conclusion

As GenAI and machine learning continue to evolve and integrate into various facets of our lives, the significance of robust security measures cannot be overstated. Ensuring the security of both data and models is paramount to maintaining the integrity and trustworthiness of AI systems. The potential data breaches, from compromising training data to exploiting user interaction history, underscore the critical need for vigilance and advanced protective strategies. Similarly, safeguarding model attributes and preventing sophisticated attacks like data poisoning and model inversion are essential to uphold the reliability of AI solutions.

Precision AI by Palo Alto Networks represents a significant advancement in enhancing the security of AI systems. It provides robust, real-time detection, prevention, and remediation capabilities tailored for cybersecurity. Precision AI’s ability to analyze vast amounts of security-specific data and its focus on high-fidelity automation allows for rapid and accurate threat mitigation. This technology can enhance the security of generative AI (GenAI) systems by improving data security and safeguarding model attributes, thereby preventing sophisticated attacks.

Security Posture Management (AI-SPM) by Palo Alto Networks is pivotal in addressing a range of security issues associated with generative AI and LLMs. By offering comprehensive features, AI-SPM helps safeguard AI systems from vulnerabilities, ensuring their integrity and reliability. AI-SPM enhances security by fixing model misconfigurations and ensuring the correct plug-in and weights, safeguarding models from potential exploits. Monitoring for excessive model agency prevents models from behaving unpredictably or exposing sensitive information. By addressing supply chain vulnerabilities, remedying data poisoning, correcting misconfigurations, and preventing data exposure, AI-SPM ensures AI models remain secure, reliable, and compliant with data privacy regulations, providing robust protection against various security challenges in an evolving threat landscape.

In this blog, we explored the multifaceted threats to data and model security, highlighting the intricate challenges that must be addressed to protect these systems. By understanding and mitigating these risks, we can pave the way for the development of secure, resilient AI technologies. As we advance, continuous research, proactive security measures, and collaborative efforts within the AI community will be crucial in defending against evolving threats, ensuring that AI remains a beneficial and trusted tool in our digital age.

Palo Alto Networks Can Help

The rollout of our Secure AI by Design product portfolio has begun. With our AI product portfolio, you can Secure AI by Design rather than patching it afterwards. Plus, leverage the power of an integrated platform to enable functionality without deploying new sensors. Our customers can now use our technologies powered by Precision AI™ in this brand-new fight.

We can help you solve the problem of protecting your GenAI infrastructure with AI Runtime Security that is available today. AI Runtime Security is an adaptive, purpose-built solution that discovers, protects, and defends all enterprise applications, models, and data from AI-specific and foundational network threats.

AI Access Security secures your company’s GenAI use and empowers your business to capitalize on its benefits without compromise.

Prisma® Cloud AI Security Posture Management (AI-SPM) protects and controls AI infrastructure, usage and data. It maximizes the transformative benefits of AI and large language models without putting your organization at risk. It also gives you visibility and control over the three critical components of your AI security — the data you use for training or inference, the integrity of your AI models and access to your deployed models.

These solutions will help enterprises navigate the complexities of Generative AI with confidence and security.

GenAI Security Technical Blog Series 3/6: Secure AI by Design - Cracks in the Digital Fortress

GenAI Security Technical Blog Series 3/6: Secure AI by Design - Cracks in the Digital Fortress

Introduction