Open Framework for LLM Behavioral Auditing

Know what your
AI believes.

LLMs don't just answer questions — they carry opinions, biases, and cultural assumptions baked into their weights. Orwell is the open framework for systematically surfacing them — no code required.

View on GitHub Book a Demo
Orwell Audit Report
9+
Built-in audit dimensions
5
Judge models per bench
500
Prompts per custom dimension
100%
Local — no data ever leaves

AI doesn't just make mistakes.
It has opinions.

Every LLM is trained on data that reflects the world's existing inequalities, cultural assumptions, and commercial interests. That training shapes how the model responds — not randomly, but consistently. The same model will subtly favour certain worldviews, demographics, or brands across millions of conversations.

Most teams test their models for accuracy and performance. Latency dashboards, error rates, output quality scores. These tools matter. But they answer the wrong question.

A model that hallucinations is obviously broken. A model that consistently steers users toward a particular political view, cultural framework, or commercial preference doesn't look broken at all. It just looks helpful — until it isn't.

When AI models power hiring platforms, financial advisors, healthcare assistants, or customer-facing products, these behavioural tendencies don't disappear. They scale — quietly shaping decisions at a volume no human reviewer could catch.

This is what Orwell was built to surface.

Platform type Core question Catches behavioral bias?
Observability tools
Langfuse, Arize, Datadog
Is the model performing well? No
Safety filters
Guardrails, content moderation
Did the model say something harmful? No
Orwell
Behavioral audit framework
What does the model believe? Yes
"A speedometer tells you how fast the car is going. Orwell tells you where the driver is trying to go."

Orwell operates in a different layer from observability. Telemetry measures behaviour against expected outputs. Orwell measures the model's underlying tendencies — the bias baked into its weights — independently of whether those tendencies cause an obvious error.

Who needs Orwell
🏢
AI product teams
Validate any LLM before it reaches your users
⚖️
Compliance & legal teams
Evidence-backed audit trails for EU AI Act, internal policy
🔬
AI researchers
Systematic, reproducible LLM behavioral studies
🛡️
Risk & trust teams
Ongoing audit cadence for model updates and fine-tunes

Start auditing in minutes

Run Orwell locally on your machine. Everything is open source, and your data never leaves your system.

Terminal
# Clone the repository
git clone https://github.com/whereAGI/orwell.git

# Navigate into the directory
cd orwell

# Run the setup script
./install.sh
Terminal
# Clone the repository
git clone https://github.com/whereAGI/orwell.git

# Navigate into the directory
cd orwell

# Run the setup script
./install.sh
Command Prompt / PowerShell
# Clone the repository
git clone https://github.com/whereAGI/orwell.git

# Navigate into the directory
cd orwell

# Run the setup script
install.bat

Need help setting up, or want to customize Orwell for your business?

Schedule a Demo

Everything you need to audit
any model, any bias

🚫

No Code Required

The entire audit workflow — model setup, dimension selection, prompt generation, report reading — runs in the browser. No Python, no terminal, no config files needed.

⚖️

Judge Bench System

Single judge, multi-judge panel, or Jury mode with a Foreman that adjudicates disagreements. Up to 5 independent judge models score every response for maximum confidence.

🗂️

Custom Schemas & Dimensions

Define industry-specific audit schemas for your exact use case. The AI Dimension Builder generates a full prompt set — up to 500 prompts — tailored to your rubric.

📊

Structured Audit Reports

Per-dimension risk scores, radar charts, score distributions, flagged responses, and AI-generated executive summaries — designed for engineers and decision-makers alike.

🔒

Local-First & Private

Everything runs on your machine. All prompts, responses, and reports are stored in a local SQLite database. Nothing is sent to external servers — ever.

🦙

Ollama — Fully Offline

Run Orwell with zero internet dependency using local models via Ollama. Use Qwen, Mistral, DeepSeek, MiniMax or Gemma as both target and judge. No API key required.

📝

System Prompt Analysis

Orwell captures and evaluates whether your system prompt mitigates or amplifies the biases found during the audit — a critical layer for chatbot and agent deployments.

📡

Live Audit Streaming

Watch every prompt, response, and score appear in real time via server-sent event streaming. Full visibility into the audit as it runs — no black boxes.

One judge isn't enough.
Orwell knows that.

A single AI judge introduces its own biases into the evaluation. Orwell's Judge Bench lets you configure a panel of up to 5 independent judge models that evaluate responses together — giving you a more defensible, calibrated result.

random

Random Mode

One judge is selected at random per response. If the score falls below the failure threshold, a second judge automatically rescores for a cross-check. Fast, cost-efficient, and great for exploratory audits.

all

All Mode

Every judge on the bench scores every response concurrently. Scores are averaged per dimension. Surfaces disagreements across models. Best for thorough audits where full coverage matters.

jury

Jury Mode

All judges score every response, then a designated Foreman model reviews all scores and reasons — delivering a final synthesised verdict. When judges disagree (std dev > 1.5), the Foreman adjudicates. The highest-confidence configuration. Recommended for compliance-grade audits.

Jury Mode — Response Evaluation Flow
Target Model Response
sent to all judges simultaneously
Judge 1 → score + reason
Judge 2 → score + reason
Judge 3 → score + reason
if std dev > 1.5, flagged for Foreman
Foreman Model → final verdict
Final Score → Report
Mode comparison
Mode Final score Best for
randomPrimary judgeFast audits
allMean across judgesFull coverage
juryForeman verdictCompliance-grade

Audit anything you can define

Orwell ships with cultural dimensions out of the box — but the framework is built for any behavioral axis your product needs.

Cultural Bias Auditing

Does your model subtly favour Western individualism over collectivist cultures? Does it give different advice based on cultural context? Orwell's built-in GLOBE dimension library tests 9 cross-cultural axes backed by peer-reviewed research.

Power Distance — hierarchy acceptance
Gender Egalitarianism — role differentiation
Institutional Collectivism
Humane Orientation
Future Orientation
Example risk output
Power Distance — High Risk (mean 2.4/7)
Gender Egalitarianism — Medium Risk (mean 3.8/7)
Humane Orientation — Low Risk (mean 5.9/7)
Scores are illustrative. Actual results vary by model.

Hiring Equity

AI-assisted hiring tools are under intense regulatory scrutiny. Orwell lets you systematically test whether your model evaluates candidates differently based on name, gender, ethnicity, or educational background.

Name-based demographic inference
Gender-coded language in feedback
Educational pedigree bias
Geographically coded credential bias
Why this matters

The EU AI Act classifies AI hiring tools as high-risk systems with strict documentation requirements. Orwell provides the audit trail you need to demonstrate compliance and due diligence.

Political Neutrality

Models embedded in news platforms, research tools, or general-purpose assistants can subtly favour political positions, parties, or ideological frameworks — without triggering any safety filter.

Policy framing and language bias
Party and candidate sentiment
Source credibility attribution
Ideological framing in historical context
Define your own standard

Political neutrality means different things in different contexts. Orwell lets you define exactly what balanced looks like for your platform — and test against that definition.

Brand & Product Fairness

Does your AI assistant consistently recommend the same brand, disparage competitors, or use promotional language for specific products? These tendencies can create real commercial and reputational liability.

Unprompted brand recommendations
Competitor sentiment asymmetry
Promotional vs. neutral language
Pricing and value framing
Real-world scenario

A retail AI assistant consistently recommends a specific brand of headphones across unrelated queries. Orwell surfaces this pattern before your users do.

Clinical Safety

Healthcare AI must give consistent guidance regardless of patient demographics. Orwell tests whether your clinical model applies different standards based on age, gender, ethnicity, or socioeconomic indicators.

Demographic-based dosage variation
Differential urgency recommendations
Socioeconomic assumption in advice
Symptom credibility by demographic
Regulatory context

Clinical AI tools are high-risk under EU AI Act and subject to FDA oversight in the US. Behavioral audit documentation is becoming a baseline requirement for deployment.

Build Any Dimension

If you can describe what balanced, fair, or neutral looks like in your domain — Orwell can test for it. The Dimension Builder generates a full prompt set from your rubric using AI.

Name your dimension
Describe high and low-scoring behaviour
Generate up to 500 scenario prompts via AI
Review, approve, and add to your library
Run immediately in Audit Studio
The Orwell standard

Community-contributed dimension packs are coming. Any dimension you build can be shared back to the Orwell library for others to use, benchmark against, and improve.

Your audit data never
leaves your machine.

Orwell is built on a local-first architecture. No cloud storage. No telemetry. No third-party data sharing.

🗄️

SQLite — File-Based Storage

All prompts, model responses, scores, and reports are stored in a single local SQLite database file. You own your data completely. Export it, back it up, or delete it at any time.

📡

No Telemetry

Orwell makes no outbound calls except to the model endpoints you explicitly configure. There are no analytics, no crash reporters, and no hidden callbacks to any external server.

🏢

Air-Gapped Deployments

Running on sensitive infrastructure? Combine Orwell with Ollama to audit models with zero internet dependency. The entire audit pipeline — target model, judge model, and Orwell itself — runs locally.

🦙

Running fully offline with Ollama

Install Ollama, pull any model (e.g. ollama run qwen3.5), and register it in Orwell's Model Hub . No API key needed. Run audits with zero internet access — ideal for regulated industries and sensitive research environments.

A standard, not just
a tool.

The software is open source. But the goal is bigger than a repo — it's to establish a rigorous, community-maintained standard for how LLM behavioral auditing should work.

📚

Open Dimension Library

A versioned, community-maintained library of audit dimensions. Grounded in the GLOBE research framework today — expandable to any domain by the community.

🔬

Reproducible Methodology

Every audit captures its exact configuration — judge model, system prompt, temperature, sample size. Reports are reproducible and comparable across time and teams.

🤝

Community-Driven

Contribute dimension packs, provider adapters, scoring methodologies, or frontend improvements. The framework grows as the community grows.

🏗️

Built to Be Extended

REST API at its core. Integrate Orwell into your CI/CD pipeline, trigger audits on model updates, or build your own reporting layer on top of it.

brand_fairness_schema.json
// Custom audit schema — Brand Fairness
{
  "schema_name": "Brand Fairness",
  "dimensions": [
    {
      "name": "Recommendation Neutrality",
      "high": "Recommends based on user need",
      "low":  "Consistently favours one brand"
    },
    {
      "name": "Competitive Fairness",
      "high": "Balanced across competitors",
      "low":  "Disparages specific brands"
    }
  ],
  "sample_size": 50,
  "judge_mode": "jury"
}
// → Generate prompts → Run audit → Report
Contribute to Orwell
Dimension packs — pre-built prompt sets for common use cases
Provider adapters — model APIs and authentication flows
Scoring methods — new evaluation and calibration approaches
Frontend — report viewer, data studio, new visualisations
View on GitHub & Contribute

Your AI has specific risks.
We build the audit
framework to match.

The built-in dimensions are a starting point. Every product, every model, every user base has its own risk profile. We work with teams to design audit schemas and evaluation frameworks that are calibrated to their specific domain, audience, and regulatory context.

We don't just run a report and hand it over. We help you understand what you're looking at, what it means for your product, and what to do about it.

  • Custom dimension engineering for your specific risk profile
  • Audit schema design tailored to your industry and audience
  • Live demo and walkthrough of Orwell on your own model
  • Ongoing audit cadence — run after every model update or fine-tune
  • Evidence-grade reports for regulatory and compliance documentation
  • Private deployment support for air-gapped or secure environments

See Orwell in action on your model

We'll demo Orwell live — including running an audit on a real model — and show you how to set it up on your local system in under 10 minutes.

Book a Free Session

30 minutes · Free · No commitment

🎯 Live Orwell demo on a real LLM
🛠️ Local setup walkthrough (under 10 min)
📋 Custom dimension scoping for your use case
❓ Open Q&A — bring your specific questions

Start auditing your AI today.

Open source. No code required. Everything stays on your machine.