This post is the first in a series co-authored by Abhishek Das, VP of Engineering at Acante and Raja Perumal, Senior Alliances Manager at Databricks
It's probably an understatement to say we have entered the age of Artificial Intelligence (AI). Over the next few years almost everything around us will be infused with AI – enriching our personal lives and empowering enterprise applications to deliver orders of magnitude better experiences to their customers. AI has become a boardroom conversation today, with every enterprise aggressively investing in developing an “AI strategy” that will allow them to gain a competitive edge in the market, as well as to optimize complex business processes and increase employee productivity.
The path to AI success requires organizations to unlock the value of their proprietary data, allowing them to discover insights unique to their business, their market and their customers. But in order to do that, they need to ensure that the data they feed into these AI systems, including large language models (LLMs), is secure and adheres to business confidentiality and consumer privacy requirements.
Governance at the Data Layer Is Largely Unaddressed
The first wave of AI safety solutions have primarily focused on securing the inference and model layers of the AI technology stack. Inference layer security solutions – such as LLM firewalls – analyze queries and responses for prompt injections and provide guardrails that detect harmful responses. Model layer solutions focus on risks such as compromised model asset supply chains, model poisoning or model jailbreaking and address moral risks such as violent crimes, child exploitation, hate, defamation and self harm. Securing the inference and model layers are important, of course, but these solutions largely ignore the third layer of the AI technology stack – the data layer.
Enterprise AI Architectures 2.0
Early enterprise AI systems were built using prompt engineering approaches. These systems are driven by LLMs trained on public or undisclosed data (closed-source models), and do not include any enterprise proprietary data to provide appropriate context. This has proven to be a great low-risk way for organizations to experiment, test the market and get comfortable with LLM technology.
The next wave of enterprise AI systems are evolving to better capture the value of proprietary data and create the ROI every business is demanding. These systems are largely built using Retrieval Augmented Generation (RAG) or in some cases Fine-tuning architectures. The RAG AI architecture is particularly suited to enterprise use because of the balance between cost (no expensive model training effort) and the ability to continuously update the domain-specific knowledge base. In the RAG architecture, the user query is augmented with the relevant internal proprietary data to enable LLMs to generate more accurate outputs unique to the organization. This ensures freshness of data and relevance while minimizing hallucinations in LLM outputs.
Addressing the Data Governance Risks for these 2.0 AI Architectures
The introduction of proprietary data into new AI/LLM architectures exposes data governance risks such as data privacy, compliance and security to the forefront. Concerns such as data poisoning, anonymization, access authorization and leakage need to be addressed at the data layer BEFORE the data is fed in upstream LLM models or generated responses. Applying governance controls to model outputs AFTER the fact is ineffective and is often limited by performance constraints.
Unfortunately, there aren’t robust AI-specific data governance solutions today that address these risks in a systematic manner while fitting into existing AI technology stacks. Overcoming this barrier is critical for organizations to confidently adopt and optimize the value from RAG and Fine-tuning architectures.
Acante’s mission is to redefine data security and access governance for the AI-driven enterprise. We are working with customers across multiple verticals to systematically address these AI risks. There are three primary datasets at the data layer that need to be secured in AI systems:
- Document chunks are unstructured enterprise data that is broken in smaller pieces (due to LLM context window limitations) - “chunks” - and vectorized. These vector embeddings are then hosted in vector databases (such as Databricks’ Mosaic AI vector search, Pinecone, Weaviate, etc.), which upon retrieval provide the relevant proprietary context in RAG architectures.
- High-quality training datasets are internal proprietary datasets that are carefully prepared and curated to train models during fine-tuning.
- Existing feature datasets are features from their existing analytical and ML models that are used to augment the context.
Each of these datasets carry new and unaddressed risks that can compromise the overall safety of the end AI application. Acante has identified the security and governance risks at the data layer of the AI technology stack and is developing approaches to mitigate each one. Databricks, one of the leading platforms for data and AI, has realized the pivotal role security plays in the adoption of AI, publishing a comprehensive AI security framework that captures the breadth of security concerns across inference, model and data layers. The Databricks AI Security Framework (DASF) identifies 55 AI security risks across 12 foundational architecture components and prescriptive controls for mitigating each.
Expanding on our partnership with Databricks, Acante has mapped these risks to the DASF framework, and done the same for the other popular industry framework - OWASP Top 10 for LLM. Here is a brief summary of the key data risks:
Addressing AI Safety Needs to Start with the Data Layer
New AI architectures are helping organizations unlock the full power of their proprietary data. But, as enterprises embark on this AI journey, every single bit of organizational data will be at risk of exposure through these AI applications in ways we never fathomed before and ways that aren’t easily comprehensible to humans.
Acante is focused on empowering enterprises with the most seamless yet secure ways to confidently unlock the value of their data by providing data access governance and security models tailored to these RAG and Fine-tuning architectures. Among the variety of industry frameworks, we have found the OWASP Top 10 for LLM and the DASF to be particularly well structured and have mapped our view of data risks to these frameworks.
In the next blog post, we’ll go much deeper into these risks and necessary mitigations, as well as the capabilities of the Acante data & AI governance and security solution. It will cover how we are working with Databricks as a close technology partner to systematically address these risks. If you are adopting RAG or Fine-tuning architectures for your AI applications, we’d love to hear from you and explore how we can leverage the experience we’ve gained from working with dozens of customers to be of help to you.