Are you model-agnostic, or do you only test specific LLM providers?

Model-agnostic. Prompt injection, agentic security flaws, RAG isolation failures, and excessive-agency issues are architectural risks that apply regardless of whether the underlying model is Claude, GPT-4, Gemini, Llama, Mistral, or a fine-tuned in-house model. The underlying model matters less than how your application wraps it.

Do you need API access to our AI features?

Yes. We strongly prefer a staging or test environment with valid API access to your LLM features. Black-box prompt injection testing against a production chatbot has legitimate uses, but real tool-chain and RAG auditing requires access to system prompts, tool definitions, retrieval pipelines, and safe test data. If you cannot provide a test environment, we will scope around production with strict rate limits and pre-approved test accounts.

How long does an AI security audit take?

Standard AI audits take 2-4 weeks of active testing plus 1 week of reporting. Basic single-feature reviews can complete in 1-2 weeks. Agentic AI audits with complex tool-chains typically run 4-6 weeks because each tool integration is tested individually and then chained together in attack paths.

Do you red-team jailbreaks with reproducible test cases?

Yes. Every jailbreak, prompt injection, or guardrail bypass we find is delivered with a reproducible test case: the exact prompt, the model version, the temperature setting, the system prompt context, and the response observed. We do not report "the model sometimes says bad things" without a reproducible payload and reproduction rate. Findings include regression test snippets you can add to your eval suite.

AI Security Audit | LLM + Agentic AI Pentest

AI security is not traditional application security.

Every other category of software has determinism on its side. You send a request, you get a response, and the response is either right or wrong. LLMs do not work that way. They are stochastic. The same prompt returns different outputs at different temperatures, on different model versions, with different context windows. Security testing has to account for that.

Prompt injection is also, as of 2026, an unsolved problem. Every major model vendor and every published defense (spotlighting, structured prompting, constitutional classifiers, input sanitization) can be bypassed given enough creativity. That does not mean you give up. It means you design systems that limit blast radius when the LLM is compromised, not systems that assume the LLM cannot be compromised.

Agentic AI amplifies everything. The moment your LLM can call tools, read files, send emails, or touch a database, prompt injection stops being a content-moderation issue and becomes remote code execution by proxy. The attack surface of an agent is the union of every tool it can reach multiplied by every data source it can read.

If your product includes an LLM feature, a RAG chatbot, or any tool-calling agent, and you have not had it tested by someone whose job is breaking AI features, you are shipping an attack surface you cannot describe to your customers or your auditor.

What an AI security audit covers

Prompt injection testing (OWASP LLM01)

The core of any AI audit. We test direct, indirect, multimodal, stored, and tool-chain prompt injection. Direct injection attacks the user input; indirect hides payloads in retrieved documents, web pages, PDFs, or third-party data the model ingests. Stored injection persists a payload into your RAG index, vector store, or conversation memory so it fires against future users. Tool-chain injection abuses the model's own tool outputs to inject prompts into itself.

Direct injection across every user-facing input surface
Indirect injection via documents, URLs, emails, and retrieved context
Multimodal injection (images, audio, embedded text in screenshots and PDFs)
Stored injection in RAG indexes, vector stores, and conversation history
Tool-chain injection through tool responses the model reads back
Guardrail and classifier bypass testing with reproducible payloads

System prompt leakage testing (LLM07)

Your system prompt is an intellectual property asset and an attack roadmap. We test extraction via common patterns, adversarial suffixes, role-reversal attacks, and context-window manipulation. Leaked system prompts often reveal tool definitions, internal API endpoints, and customer data schemas that accelerate further attacks.

Output handling review (LLM05)

The LLM returns a string. Your application trusts it. We trace every downstream sink: HTML rendering (XSS via model output), SQL queries (SQLi via generated queries), shell commands (RCE via generated code), markdown links (phishing injection), and file paths (traversal via filename generation). If the model output reaches a sink without sanitization appropriate to that sink, it is a finding.

Excessive agency analysis (LLM06)

Tool permission enumeration and blast radius analysis. For every tool your agent can call, we map: who authorized this tool, what credentials does it carry, what can a malicious prompt cause it to do, and what is the worst-case outcome if the LLM is compromised. Most agentic products give the agent more power than it needs. We identify the over-privileged tools and the confused-deputy paths.

Training data and model provenance review (LLM03, LLM04)

If you fine-tune models or ship custom adapters, we review training data hygiene, data poisoning exposure, model supply chain, and artifact integrity. This is code review plus MLOps review, scoped to the weights and pipelines that matter.

Vector store and RAG pipeline security (LLM08)

RAG adds an entire database and embedding pipeline to your attack surface. We test:

Tenant isolation in multi-tenant vector stores (Pinecone, Weaviate, Qdrant, Chroma, pgvector)
Embedding inversion attacks to recover source text from vectors
Metadata filter bypass (injecting context outside tenant namespace)
Cross-tenant retrieval via similarity-ranked leakage
Authorization checks on retrieved chunks (does the user who asked actually have rights to that chunk)
Poisoned document ingestion into the RAG corpus

Agentic tool-call chain threat modeling

This is where Valtik extends beyond the OWASP LLM Top 10. For agentic products, we build a tool-call graph: which tools can invoke which other tools, what data flows between them, and which attack paths chain from a user-controlled input to a high-impact outcome. The finding is not "this tool has a bug"; it is "this sequence of five tool-calls, starting from a crafted email the agent reads, exfiltrates your database and your customer credentials."

Rate limiting, cost exhaustion, denial-of-wallet (LLM10)

LLM API calls cost real money. An attacker who can trigger unlimited inference at your expense is running a new category of DDoS. We test request amplification, recursive tool-calls, context-window bloat attacks, token-budget bypass, and billing isolation across tenants.

Sensitive information disclosure testing (LLM02)

We probe for training data leakage, customer data cross-contamination in multi-tenant deployments, credential exposure in model outputs, and PII regurgitation from conversation memory or RAG indexes.

Misinformation and hallucination risk review (LLM09)

Not a traditional security issue, but a product risk that auditors, regulators, and enterprise buyers now ask about. We identify the high-stakes decision paths in your product where a confidently wrong model output causes real damage (legal, medical, financial, compliance-gated actions) and recommend guardrails, confidence thresholds, and human-in-the-loop checkpoints.

Who this is for

SaaS products with LLM features. A copilot, a summarizer, a writing assistant, a support bot baked into an existing product.
Agentic AI products. Autonomous agents, research assistants, workflow automators that read data and take actions.
RAG-based knowledge tools. Internal knowledge bases, customer-facing docs chat, legal or medical research assistants.
AI coding assistants. Products that generate, review, or execute code on behalf of users.
Customer-support AI. Chatbots with access to order data, account actions, refunds, or billing systems.

Methodology

We use the OWASP LLM Top 10 (2025 revision) as the baseline taxonomy. Every finding maps to an LLM category so your developers, auditors, and executives can triangulate it against the industry standard. On top of that baseline we run Valtik's own agentic tool-chain extensions, which cover multi-step attack paths the OWASP list does not yet formalize.

Each engagement starts with a threat model: your architecture, your data flows, your tool definitions, and your trust boundaries. We then move through active testing in an agreed test environment, develop reproducible exploits for every issue found, and verify impact before writing it up.

Pricing tiers

Pricing scales with AI surface area, not with headcount. A solo founder shipping an agent can still hold a surface larger than a 200-person company with a single read-only RAG bot. We scope after reviewing the architecture.

Tier	Scope	Price range
Basic AI review	Chatbot or single-feature LLM product. One primary integration, limited tool access, no RAG or minimal RAG.	$8,500-$15,000
Standard AI audit	RAG-based product, or multi-feature LLM application. Vector store, retrieval pipeline, multiple user-facing surfaces.	$15,000-$35,000
Agentic AI audit	LLM with tool-calling, persistent memory, or autonomous behavior. Tool-chain threat modeling and multi-step exploit development.	$25,000-$60,000

Deliverables

Technical findings report. Every issue with OWASP LLM mapping, reproducible payload, model and version tested, reproduction rate, impact analysis, and remediation guidance.
Executive summary. Board-ready language translating the technical findings into business risk, without the usual AI-hype fluff.
Remediation guidance. Concrete fixes: prompt hardening patterns, output sanitization code, tool-permission reductions, rate-limit configurations, vector-store isolation patterns.
Reproducible test cases. Regression tests you can drop into your eval suite so fixed issues stay fixed.
90-minute debrief. Live walkthrough with your engineering and product team, plus Q&A.
Retest included. One round of retest within 90 days, at no additional cost, to verify remediation of any findings your team ships fixes for.

Honest scope

Valtik is a solo-led firm. Tre runs engagements directly. For work that exceeds solo scope (continuous red-teaming, long-running multi-region agent audits, or engagements that need multiple operators on the clock simultaneously), we coordinate with partner firms rather than overextending. You will always know which work is ours, which is partnered, and who you are paying. We do not hold government clearances, we are not a C3PAO, and we do not pretend to be an enterprise red-team bench.

What we do hold is a working pentester's perspective on AI systems. Every finding in a Valtik report is exploited in a controlled environment before it is written up. No automated-scanner output, no paraphrased OWASP text, no "AI could hypothetically do bad things" theorycrafting.

Common questions

Are you model-agnostic?

Yes. Prompt injection, excessive agency, and RAG weaknesses are architectural problems, not tied to a single vendor. The audit covers products built on Claude, GPT-4, Gemini, Llama, Mistral, and in-house fine-tuned models equally. The underlying model matters less than how your application wraps it.

Do you need API access?

Yes, ideally in a staging or test environment. Real tool-chain and RAG audits need access to system prompts, tool definitions, and retrieval pipelines. We can scope around production when a test environment does not exist, with strict rate limits and pre-approved test accounts.

How long does an AI audit take?

Basic reviews run 1-2 weeks. Standard audits run 2-4 weeks. Agentic audits typically run 4-6 weeks because each tool integration is tested individually and then chained together into realistic attack paths.

Do you red-team jailbreaks?

Yes, with reproducible test cases. Every jailbreak is delivered with the exact prompt, model version, temperature, system prompt context, and response, plus a reproduction rate. We include regression test snippets so you can add the test case to your eval suite.

Do we need an AI audit for SOC 2?

Not explicitly required today. The AICPA Trust Services Criteria do not yet carve out AI-specific controls. However, auditors increasingly treat LLM features as systems within scope of Security (CC6, CC7) and Privacy criteria, especially when the model has access to customer data or tool-calling. If you ship AI features and pursue SOC 2 Type II, expect questions about how you test prompt injection, data leakage, and output handling. An AI audit produces that evidence. See our SOC 2 readiness service for the broader compliance picture.

Ready to stress-test your AI features? Start with a free website security check or scope an AI audit directly. We reply within one business day with a fixed-price proposal based on your architecture, not on guesswork.

Request a free security check

AI Security Audit

AI security is not traditional application security.

What an AI security audit covers

Prompt injection testing (OWASP LLM01)

System prompt leakage testing (LLM07)

Output handling review (LLM05)

Excessive agency analysis (LLM06)

Training data and model provenance review (LLM03, LLM04)

Vector store and RAG pipeline security (LLM08)

Agentic tool-call chain threat modeling

Rate limiting, cost exhaustion, denial-of-wallet (LLM10)

Sensitive information disclosure testing (LLM02)

Misinformation and hallucination risk review (LLM09)

Who this is for

Methodology

Pricing tiers

Deliverables

Honest scope

Common questions

Are you model-agnostic?

Do you need API access?

How long does an AI audit take?

Do you red-team jailbreaks?

Do we need an AI audit for SOC 2?

Related reading

Ready to start?