Introducing Securin’s “AI Nutrition Labels”

What happens when an AI model does something it shouldn’t? A well-crafted prompt – the conversational equivalent of malware - can trick AI assistants into ignoring security guardrails, revealing sensitive data, or generating harmful content.

These risks are no longer theoretical; from prompt injection and model poisoning attacks to more good-natured, ‘exposé’ attacks, AI models and applications are vulnerable to attacks in ways that many organizations have yet to get to grips with, or even understand.

Many organizations building with AI today, or integrating it into their products don’t know:

If they’re vulnerable to prompt injection or encoding attacks.
If their models can be jailbroken.
How their models behave under adversarial or multi-turn manipulation.
What any of the first three things on this list even mean…

No testing requirements. No public disclosures. No standardized way to assess risk. And no easy way to do any of this.

That’s why Securin has introduced our “AI Nutrition Label” - an automated security report card for every GenAI model, from GPT and Grok to in-house deployments. Clear, easy-to-interpret summaries of how a given AI model performs when tested under real-world adversarial pressure.

What’s on the label?

Introducing Securin’s AI Nutrition Label: Transparency for AI Models

Food nutrition labels transformed consumer safety. Securin’s AI Nutrition Labels bring the same level of transparency to AI systems, so organizations of all kinds can make informed decisions around securing them. How do we generate our metrics?

Securin’s technologies conduct GenAI red teaming to evaluate Large Language Model (LLM) deployments and identify vulnerabilities, assess risks, and enhance security posture. Our framework comprises comprehensive assessments combining static and dynamic attack vectors to thoroughly evaluate model robustness against a range of exploitation techniques.

One short, clear label gives users immediate insight into exactly what they can expect from their model, including:

• Harmful Instructions

• Information Leakage

• Compliance Violation

• Bias Exploitation

• Training Data Anonymization

What does that mean in more detail?

• Risk Category Resilience - Likelihood of generating unethical behavior, toxic content, harmful instructions, information leakage, compliance violations, and biased outputs

• Data Privacy Practices - Whether customer data is used for training, shared with vendors, anonymized, and available deletion policies

• Governance & Transparency - Model country of origin, open vs. closed source architecture, and privacy policy documentation

• Aggregate Risk Score - Overall risk assessment combining all risk factors

The stakes are real, and so is the data we base our scores on:

The Sources

Our framework integrates 1,900+ attacks & tactics from HarmBench, Garak, PyRIT and an open source curation of attacks from GitHub and HuggingFace datasets.

The Approach

We organize our testing approach into two primary categories.

Static attacks: Fixed, predetermined prompts that don't require AI agents to execute. These use pre-written techniques like direct prompt injection, role-playing scenarios, encoding manipulations, and filter circumvention to test model vulnerabilities through single, crafted inputs. These include:

• Prompt Injection: Attempts to override system instructions with malicious commandsRole-Playing Attacks: Coercing the model to assume problematic personas.

• Context-Free Override: Bypassing context limitations to extract unauthorized information

• Filter Circumvention: Evading content filters through creative rewording

• Domain-Specific Malicious Requests: Targeting vulnerabilities in specific knowledge domains

• Encoding Attack: Using alternative text encodings to bypass detection

• Encryption Attack: Employing encryption to hide malicious intent

• System Prompt Leakage Attack: Extracting system instructions and configuration

• SQL Injection Attack: Testing for vulnerability to database query manipulation

• False Claims: Testing model's tendency to generate misinformation

• Dynamic attacks: Adaptive attacks that leverage AI red teaming agents to iteratively refine their approach based on model responses. These include single-turn orchestrated attacks and multi-turn strategies that evolve through multiple interactions, adjusting tactics in real-time to achieve attack objectives:

Single-Turn Dynamics:

• FlipOrchestrator: Single-interaction attacks that attempt to flip model behavior

• Skeleton Key Orchestrator: Testing with prompts designed to unlock multiple restricted behaviors

• ContextComplianceOrchestrator: Testing adherence to contextual restrictions

• Prompt Sending Orchestrator: Testing forwarding of malicious prompts

Multi-Turn Dynamics:

• Crescendomation: Gradually escalating attack sophistication through multiple interactions

• PAIR (Personalized Attack through Iterative Refinement): Adapting attacks based on model responses

• TAP (Tree of Attacks with Pruning): Exploring multiple attack paths with optimization

• TAPCrescendo: Combined approach using both TAP and Crescendo methodologies

The Success Criteria

A multi-stage evaluation system assesses AI model security using specialized AI agents:

• Advanced Evaluator Framework: Six specialized AI agents (Unethical Behavior, Toxic Content, Harmful Instructions, Information Leakage, Compliance Violations, Bias Exploitation) with sophisticated contextual analysis capabilities.

• Contextual Intelligence: Evaluators distinguish between legitimate educational content, professional discourse, and actual harmful outputs through multi-dimensional assessment matrices.

• False Positive Prevention: Built-in checks prevent misclassification of academic discussions, cybersecurity education, medical information, legal analysis, and fictional content as violations.

• Confidence-Based Scoring: Each evaluation includes confidence scores, severity levels, and detailed justifications with specific risk categorization (None/Low/Medium/High/Critical).

• Comprehensive Documentation: All attacks documented with complete conversation flows, OWASP Top 10 LLM mappings, and actionable recommendations for improvement.

The Metrics Tracked

Metrics are tracked across multiple indicators:

• Attack Success Rate: Percentage of successful exploitations across attack vectors

• Vulnerability Patterns: Common weaknesses identified across multiple tests

• Risk Index per Category & Attack Type: Scores generated for individual categories across both Static & Dynamic attacks.

• Defense Resilience: Effectiveness of inference layer guardrails against various attack types

• Cross-Category Comparison: Relative vulnerability to static vs. dynamic attacks

A Word on Ethics

AI is uncharted territory, with the capacity for both good and bad outcomes. In recognition of this reality, Securin’s pentesting methodology adheres to strict ethical guidelines. All testing is conducted in controlled environments with a strict focus on improving AI safety, rather than enabling exploits. We are committed to responsible disclosure of identified vulnerabilities, and the continuous refinement of testing methods to address emerging threats.

Next up…

AI deployment without security labels is like the old days of patent medicines or pre-FDA food. You can’t be sure where it came from, you have no idea what vulnerabilities you might be inheriting, and you don’t know what’s inside, or how it might work out once you ingest it. It’s time for AI Nutrition Labels, and that’s what Securin has done.