Bridging the Language Gap: Our Journey to a Synonym-Aware RAG System at Palo Alto Networks

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements
Engineering Blogs
10 min read
L2 Linker

 

Title.jpg

 

In the fast-paced world of technology, where information is king, the ability to quickly and accurately access knowledge is paramount. Whether it's a customer seeking support, an engineer debugging a system, or a sales team looking up product specifications, the expectation is instant, precise answers. Yet, a subtle yet pervasive linguistic challenge often stands in the way: the use of synonyms.

 

At Palo Alto Networks, like many enterprises, we faced this head-on. Our internal documentation, product names, and technical jargon are rich with acronyms, aliases, and specific terminology. A user might search for "PAA," while the relevant document refers to "Prisma Access Agent." Or they might use a common term like "ML" when the official documentation uses "Machine Learning." This seemingly minor difference, known as lexical mismatch, can lead to frustratingly incomplete answers from even the most sophisticated Retrieval-Augmented Generation (RAG) systems. If a RAG system can't understand that "PAA" and "Prisma Access Agent" are the same, it fails to retrieve all relevant information, directly translating into incomplete or inaccurate responses. This isn't just an inconvenience; it erodes user trust and diminishes the overall utility of our AI solutions.

 

This challenge isn't unique to us. It's a universal problem that every enterprise, from tech giants to specialised startups, must address to truly unlock the power of their knowledge bases. In this blog, we'll explore why synonym awareness is crucial for every enterprise, how industry leaders like Google and Amazon tackle this, and how we at Palo Alto Networks built a robust, synonym-aware RAG system to ensure no relevant information is left behind.

 

​​The Universal Challenge: Why Every Enterprise Needs Synonym Awareness

 

The core limitation of traditional information retrieval systems, including many RAG implementations, lies in their reliance on exact keyword matching. While powerful, this approach struggles with the inherent variability of human language. Users search for meaning, not just literal strings of characters.

 

Consider an enterprise with vast internal documentation, customer support knowledge bases, or product manuals. An employee might search for "PTO policy," but the official document uses "Paid Time Off." A customer might ask about "CRM," while the system's data refers to "Customer Relationship Management." Without explicit synonym awareness, these relevant documents, though conceptually related, might be missed due to a simple difference in phrasing. This directly impacts the recall of the retrieval component, leading to incomplete or inaccurate answers from the RAG system. The system effectively creates "blind spots" in its knowledge retrieval, compromising its utility and user trust.

 

The importance of synonyms extends beyond just finding exact matches. It's about:

 

  • Improving search relevance: Finding documents that use different terms to express the same concept.  
  • Making domain-specific vocabulary user-friendly: Allowing users to use terms they are more familiar with.  
  • Handling common misspellings and typos: Transparently correcting common mistakes.  

 

For any RAG system aiming for high accuracy and user satisfaction in real-world applications, addressing synonym handling is not a peripheral feature but a critical requirement.

 

Learning from the Giants: Google and Amazon's Masterclass in Language Understanding

 

Before diving into our solution, it's insightful to examine how two of the world's leading information and e-commerce platforms, Google and Amazon, have addressed the synonym challenge. Their approaches, while distinct, offer valuable lessons for any enterprise.

 

Google's Semantic Symphony: Understanding User Intent

 

Google, a pioneer in search, recognised the importance of understanding language beyond keywords decades ago. Its Knowledge Graph helps connect queries to real-world entities, moving beyond simple word matching to concept-level understanding. Furthermore, Google's Natural Language Understanding (NLU) capabilities interpret the nuance and intent behind user queries, allowing for sophisticated query rewriting that implicitly handles synonyms by mapping natural language to structured data.  

 

Amazon's Product Discovery Engine: Driving Conversions with Synonyms

 

Amazon's search engine, powered by its A9 algorithm, is "laser-focused on sales" and aims to "facilitate purchases". Amazon directly engages with sellers, advising them to "make use of synonyms" and "expand your list with synonyms" through "backend search terms" to maximise product discoverability. For enterprise clients, Amazon offers Kendra, which allows for custom, domain-specific synonyms via "thesaurus files".  

 

Our Journey: Building a Synonym-Aware RAG System at Palo Alto Networks

 

Inspired by these industry leaders and driven by our own need for precise information retrieval, we embarked on building a truly synonym-aware RAG system at Palo Alto Networks. Our goal was to ensure that regardless of how our users phrased their queries—whether using an official product name, a common acronym, or an internal alias—they would always find the most relevant information.

 

RAG2.jpg

 

Our Multi-Layered Solution

 

We adopted a comprehensive strategy that integrates several key components :  

 

  1. Building Comprehensive Synonym Dictionaries for Our Domain: This forms the explicit, rule-based foundation for mapping known variations, especially for our domain-specific jargon and product names (e.g., "PAA" to "Prisma Access Agent").  
  2. Implementing Query Expansion at the Retrieval Stage: We proactively modify the user's input query to include known synonyms, broadening the search scope before it hits the database.  
  3. Combining Semantic Search for Optimal Coverage: We leverage vector embeddings to capture the conceptual meaning of both documents and queries, enabling the retrieval of semantically similar documents even without exact lexical matches.  
  4. Continuously Updating Synonym Mappings Based on User Feedback: Language is dynamic, so our system incorporates a mechanism for ongoing refinement of its synonym dictionaries, adapting to evolving terminology and user query patterns.  

 

Deep Dive into Our Architecture: The VectorDBSynonymManager

 

The heart of our system is the SynonymManager class, designed to orchestrate this multi-layered approach.  

 

SynonymManager.jpg

 

1. Setting Up Our Vector DB Collections for Synonym Awareness

 

We configured our main VectorDB collection to store documents with multiple, complementary vector representations :  

 

  • semantic: This vector stores embeddings of the original, unexpanded document text. It captures the core conceptual meaning, allowing for general queries.  
  • synonym_expanded: This vector stores embeddings of the document text after it has been expanded with synonyms. This broadens the document's discoverability by representing it in terms of its various lexical forms.  
  • keyword (sparse): A sparse vector configuration is included for efficient keyword-based matching, invaluable for precise lexical lookups of acronyms or product names.  

 

This multifaceted representation ensures that our system is robust against both semantic drift and exact term mismatches.

 

Beyond the main document collection, we created a separate synonym_collection within VectorDB to persistently store our synonym mappings. This design choice enables dynamic management of our synonym dictionary.  

 

Table: Vector Database Vector Types for Synonym-Aware RAG

 

Vector Type

Purpose

Source Text for Embedding

Example Use Case

semantic

Captures core conceptual meaning

Original Document Text

General queries, broad topic understanding

synonym_expanded

Enhances discoverability via synonyms

Expanded Document Text

Queries using synonyms, acronyms, or variations

keyword (sparse)

Enables precise lexical/keyword matching

Expanded Document Text

Exact term searches, specific entity lookups

 

2. Intelligent Document Indexing: Multi-Vector Embeddings in Action

 

Our index_document method orchestrates the creation of these multi-faceted document representations :  

 

  • Original Semantic Embedding: We first generate an embedding from the raw, original document text.  
  • Synonym Expansion: Crucially, the document's original text undergoes a dual-direction synonym expansion. For example, if "AI" is found, "artificial intelligence" is appended, and if "artificial intelligence" is found, "AI" is also appended. This ensures comprehensive lexical coverage.  
  • Synonym-Expanded Embedding: An embedding is then generated from this expanded text, representing the document's meaning enriched by all its known synonym variations.  
  • Sparse Keyword Vector: A sparse vector is created from the expanded text for efficient keyword-based matching.  
  • Upsert to VectorDB: All three vector types, along with the original and expanded text, are then upserted into the main Vector DB collection.  

 

Create New CollectionCreate New Collection

 

This pre-computation during indexing optimises retrieval efficiency, ensuring that when a query arrives, the system can quickly leverage all relevant "views" of the document.

 

3. Dynamic Synonym Management: Keeping Our RAG System Smart

 

A key feature of our architecture is the dynamic management of synonym dictionaries. Instead of static files, synonyms are persistently stored within the synonym_collection in the vector DB. This allows for real-time updates without requiring application redeployments.  

 

When a new synonym mapping is added (e.g., "nlp" to ["natural language processing"]), our system updates the in-memory dictionary and persists this change to the vector DB. Crucially, it  optionally triggers a re-indexing of any documents containing the newly added or updated synonym terms. This ensures that the indexed data remains fresh and aligned with the active synonym dictionary, preventing retrieval failures due to outdated synonym definitions.  

 

For example, our system manages mappings like:

 

Table: Example Palo Alto Networks Synonym Mappings

 

Short Term/Acronym

Full Terms/Synonyms

PAA

Prisma Access Agent, PA-Agent, Jupiter

SCM

Strata Cloud Manager, SCM, SCM Actor

PAN-OS

PanOS Actor, PAN-OS Actor, PanOS

CIE

CIE Actor, Cloud Identity Engine Actor

NPN

NPN Actor

 

4. Hybrid Retrieval: Fusing Semantic, Synonym, and Keyword Search

 

Our search method initiates the retrieval process by first expanding the incoming user query with all known synonym variations.  

 

The core of our hybrid retrieval lies in the Vectordb client.query_points method, utilising its prefetch parameter to perform multiple, concurrent searches across different vector fields :  

 

  • Semantic Prefetch: The original query is used to search against the semantic vector field.  
  • Synonym-Expanded Prefetch: The expanded_query is used to search against the synonym_expanded vector field, vital for pulling in documents that might contain synonyms.  
  • Keyword Prefetch: A sparse vector from the expanded_query is used to search against the keyword sparse vector field, ensuring precise keyword matches.  

 

The results from these prefetch operations are then combined and re-ranked, primarily using the semantic embedding of the original query against the semantic vector field. This sophisticated fusion strategy ensures that while we cast a wide net to capture all potential matches (high recall), the final ranking prioritises documents most semantically relevant to the user's initial intent (high precision).  

 

The Impact: Why This Matters for Palo Alto Networks (and You)

 

Implementing this synonym-aware RAG system has had a profound impact at Palo Alto Networks. It means:

 

  • Improved Accuracy: Our internal RAG systems now provide more precise and comprehensive answers, reducing the "blind spots" caused by linguistic variations.
  • Enhanced User Experience: Employees and customers can find the information they need faster, regardless of their specific phrasing, leading to greater satisfaction and efficiency.
  • Increased Efficiency: Support teams can resolve issues more quickly, and engineers can access critical documentation without linguistic hurdles.
  • Future-Proofing: Our dynamic synonym management ensures the system remains adaptable to evolving terminology, a crucial aspect in a rapidly innovating tech company.

 

This isn't just a technical achievement; it's a strategic imperative. For any enterprise, mastering synonyms is about bridging the inherent "language gap" between how users express themselves and how information is stored. It transforms search from a literal lookup into an intelligent conversation, directly impacting key business metrics related to user engagement, operational efficiency, and the overall quality and trustworthiness of AI-powered interactions.

 

Synonym vs. Non-Synonym Comparison (Allganize)Synonym vs. Non-Synonym Comparison (Allganize)

 

Conclusion: The Future of Intelligent Search

 

The challenge of synonyms in RAG systems, while seemingly minor, represents a significant hurdle to achieving truly comprehensive and accurate information retrieval. By embracing a multi-layered strategy, as demonstrated through our SynonymManager architecture, we've built a RAG system that moves beyond simple keyword matching or even basic semantic similarity.

 

The future of information retrieval lies in systems that understand not just the explicit words but also their myriad implicit connections and variations. By meticulously addressing the synonym challenge, RAG systems can be empowered to deliver richer, more reliable, and ultimately, more valuable insights to users. We encourage AI/ML engineers and data scientists in every enterprise to explore these techniques to elevate their RAG implementations, paving the way for a new era of intelligent knowledge access.

 

Team : 

Puneet Gupta 

Nikhil Soni

Ramesh Nampelly 

Peter Kirubakaran N