Orwell — Know What Your AI Believes

The Problem

AI doesn't just make mistakes.
It has opinions.

Every LLM is trained on data that reflects the world's existing inequalities, cultural assumptions, and commercial interests. That training shapes how the model responds — not randomly, but consistently. The same model will subtly favour certain worldviews, demographics, or brands across millions of conversations.

Most teams test their models for accuracy and performance. Latency dashboards, error rates, output quality scores. These tools matter. But they answer the wrong question.

A model that hallucinations is obviously broken. A model that consistently steers users toward a particular political view, cultural framework, or commercial preference doesn't look broken at all. It just looks helpful — until it isn't.

When AI models power hiring platforms, financial advisors, healthcare assistants, or customer-facing products, these behavioural tendencies don't disappear. They scale — quietly shaping decisions at a volume no human reviewer could catch.

This is what Orwell was built to surface.

Platform type	Core question	Catches behavioral bias?
Observability tools Langfuse, Arize, Datadog	Is the model performing well?	No
Safety filters Guardrails, content moderation	Did the model say something harmful?	No
Orwell Behavioral audit framework	What does the model believe?	Yes

"A speedometer tells you how fast the car is going. Orwell tells you where the driver is trying to go."

Orwell operates in a different layer from observability. Telemetry measures behaviour against expected outputs. Orwell measures the model's underlying tendencies — the bias baked into its weights — independently of whether those tendencies cause an obvious error.

Who needs Orwell

🏢

AI product teams

Validate any LLM before it reaches your users

⚖️

Compliance & legal teams

Evidence-backed audit trails for EU AI Act, internal policy

🔬

AI researchers

Systematic, reproducible LLM behavioral studies

🛡️

Risk & trust teams

Ongoing audit cadence for model updates and fine-tunes

Setup & Installation

Start auditing in minutes

Run Orwell locally on your machine. Everything is open source, and your data never leaves your system.

              Terminal
            
# Clone the repository
git clone https://github.com/whereAGI/orwell.git

# Navigate into the directory
cd orwell

# Run the setup script
./install.sh

              Terminal
            
# Clone the repository
git clone https://github.com/whereAGI/orwell.git

# Navigate into the directory
cd orwell

# Run the setup script
./install.sh

              Command Prompt / PowerShell
            
# Clone the repository
git clone https://github.com/whereAGI/orwell.git

# Navigate into the directory
cd orwell

# Run the setup script
install.bat

Need help setting up, or want to customize Orwell for your business?

Schedule a Demo

Capabilities

Everything you need to audit
any model, any bias

🧠

Any LLM, Any Provider

Works with OpenAI, Anthropic, Google, Mistral, DeepSeek, Qwen — and any custom endpoint that speaks OpenAI-compatible API. Including full support for reasoning models like o3 and DeepSeek-R1.

🚫

No Code Required

The entire audit workflow — model setup, dimension selection, prompt generation, report reading — runs in the browser. No Python, no terminal, no config files needed.

⚖️

Judge Bench System

Single judge, multi-judge panel, or Jury mode with a Foreman that adjudicates disagreements. Up to 5 independent judge models score every response for maximum confidence.

🗂️

Custom Schemas & Dimensions

Define industry-specific audit schemas for your exact use case. The AI Dimension Builder generates a full prompt set — up to 500 prompts — tailored to your rubric.

📊

Structured Audit Reports

Per-dimension risk scores, radar charts, score distributions, flagged responses, and AI-generated executive summaries — designed for engineers and decision-makers alike.

🔒

Local-First & Private

Everything runs on your machine. All prompts, responses, and reports are stored in a local SQLite database. Nothing is sent to external servers — ever.

🦙

Ollama — Fully Offline

Run Orwell with zero internet dependency using local models via Ollama. Use Qwen, Mistral, DeepSeek, MiniMax or Gemma as both target and judge. No API key required.

📝

System Prompt Analysis

Orwell captures and evaluates whether your system prompt mitigates or amplifies the biases found during the audit — a critical layer for chatbot and agent deployments.

📡

Live Audit Streaming

Watch every prompt, response, and score appear in real time via server-sent event streaming. Full visibility into the audit as it runs — no black boxes.

The Judge Bench

One judge isn't enough.
Orwell knows that.

A single AI judge introduces its own biases into the evaluation. Orwell's Judge Bench lets you configure a panel of up to 5 independent judge models that evaluate responses together — giving you a more defensible, calibrated result.

random

Random Mode

One judge is selected at random per response. If the score falls below the failure threshold, a second judge automatically rescores for a cross-check. Fast, cost-efficient, and great for exploratory audits.

all

All Mode

Every judge on the bench scores every response concurrently. Scores are averaged per dimension. Surfaces disagreements across models. Best for thorough audits where full coverage matters.

jury

Jury Mode

All judges score every response, then a designated Foreman model reviews all scores and reasons — delivering a final synthesised verdict. When judges disagree (std dev > 1.5), the Foreman adjudicates. The highest-confidence configuration. Recommended for compliance-grade audits.

Jury Mode — Response Evaluation Flow

Target Model Response

↓ sent to all judges simultaneously

Judge 1 → score + reason

Judge 2 → score + reason

Judge 3 → score + reason

↓ if std dev > 1.5, flagged for Foreman

Foreman Model → final verdict

↓

Final Score → Report

Mode comparison

Mode	Final score	Best for
random	Primary judge	Fast audits
all	Mean across judges	Full coverage
jury	Foreman verdict	Compliance-grade

Use Cases

Audit anything you can define

Orwell ships with cultural dimensions out of the box — but the framework is built for any behavioral axis your product needs.

Cultural Bias Auditing

Does your model subtly favour Western individualism over collectivist cultures? Does it give different advice based on cultural context? Orwell's built-in GLOBE dimension library tests 9 cross-cultural axes backed by peer-reviewed research.

Power Distance — hierarchy acceptance

Gender Egalitarianism — role differentiation

Institutional Collectivism

Humane Orientation

Future Orientation

Example risk output

Power Distance — High Risk (mean 2.4/7)

Gender Egalitarianism — Medium Risk (mean 3.8/7)

Humane Orientation — Low Risk (mean 5.9/7)

Scores are illustrative. Actual results vary by model.

Hiring Equity

AI-assisted hiring tools are under intense regulatory scrutiny. Orwell lets you systematically test whether your model evaluates candidates differently based on name, gender, ethnicity, or educational background.

Name-based demographic inference

Gender-coded language in feedback

Educational pedigree bias

Geographically coded credential bias

Why this matters

The EU AI Act classifies AI hiring tools as high-risk systems with strict documentation requirements. Orwell provides the audit trail you need to demonstrate compliance and due diligence.

Political Neutrality

Models embedded in news platforms, research tools, or general-purpose assistants can subtly favour political positions, parties, or ideological frameworks — without triggering any safety filter.

Policy framing and language bias

Party and candidate sentiment

Source credibility attribution

Ideological framing in historical context

Define your own standard

Political neutrality means different things in different contexts. Orwell lets you define exactly what balanced looks like for your platform — and test against that definition.

Brand & Product Fairness

Does your AI assistant consistently recommend the same brand, disparage competitors, or use promotional language for specific products? These tendencies can create real commercial and reputational liability.

Unprompted brand recommendations

Competitor sentiment asymmetry

Promotional vs. neutral language

Pricing and value framing

Real-world scenario

A retail AI assistant consistently recommends a specific brand of headphones across unrelated queries. Orwell surfaces this pattern before your users do.

Clinical Safety

Healthcare AI must give consistent guidance regardless of patient demographics. Orwell tests whether your clinical model applies different standards based on age, gender, ethnicity, or socioeconomic indicators.

Demographic-based dosage variation

Differential urgency recommendations

Socioeconomic assumption in advice

Symptom credibility by demographic

Regulatory context

Clinical AI tools are high-risk under EU AI Act and subject to FDA oversight in the US. Behavioral audit documentation is becoming a baseline requirement for deployment.

Build Any Dimension

If you can describe what balanced, fair, or neutral looks like in your domain — Orwell can test for it. The Dimension Builder generates a full prompt set from your rubric using AI.

Name your dimension

Describe high and low-scoring behaviour

Generate up to 500 scenario prompts via AI

Review, approve, and add to your library

Run immediately in Audit Studio

The Orwell standard

Community-contributed dimension packs are coming. Any dimension you build can be shared back to the Orwell library for others to use, benchmark against, and improve.

Privacy First

Your audit data never
leaves your machine.

Orwell is built on a local-first architecture. No cloud storage. No telemetry. No third-party data sharing.

🗄️

SQLite — File-Based Storage

All prompts, model responses, scores, and reports are stored in a single local SQLite database file. You own your data completely. Export it, back it up, or delete it at any time.

📡

No Telemetry

Orwell makes no outbound calls except to the model endpoints you explicitly configure. There are no analytics, no crash reporters, and no hidden callbacks to any external server.

🏢

Air-Gapped Deployments

Running on sensitive infrastructure? Combine Orwell with Ollama to audit models with zero internet dependency. The entire audit pipeline — target model, judge model, and Orwell itself — runs locally.

🦙

Running fully offline with Ollama

Install Ollama, pull any model (e.g. ollama run qwen3.5), and register it in Orwell's Model Hub . No API key needed. Run audits with zero internet access — ideal for regulated industries and sensitive research environments.

The Orwell Framework

A standard, not just
a tool.

The software is open source. But the goal is bigger than a repo — it's to establish a rigorous, community-maintained standard for how LLM behavioral auditing should work.

📚

Open Dimension Library

A versioned, community-maintained library of audit dimensions. Grounded in the GLOBE research framework today — expandable to any domain by the community.

🔬

Reproducible Methodology

Every audit captures its exact configuration — judge model, system prompt, temperature, sample size. Reports are reproducible and comparable across time and teams.

🤝

Community-Driven

Contribute dimension packs, provider adapters, scoring methodologies, or frontend improvements. The framework grows as the community grows.

🏗️

Built to Be Extended

REST API at its core. Integrate Orwell into your CI/CD pipeline, trigger audits on model updates, or build your own reporting layer on top of it.

            brand_fairness_schema.json
          

            // Custom audit schema — Brand Fairness
{
  "schema_name": "Brand Fairness",
  "dimensions": [
    {
      "name": "Recommendation Neutrality",
      "high": "Recommends based on user need",
      "low":  "Consistently favours one brand"
    },
    {
      "name": "Competitive Fairness",
      "high": "Balanced across competitors",
      "low":  "Disparages specific brands"
    }
  ],
  "sample_size": 50,
  "judge_mode": "jury"
}
// → Generate prompts → Run audit → Report
          

Contribute to Orwell

• Dimension packs — pre-built prompt sets for common use cases

• Provider adapters — model APIs and authentication flows

• Scoring methods — new evaluation and calibration approaches

• Frontend — report viewer, data studio, new visualisations

View on GitHub & Contribute

Enterprise & Custom Solutions

Your AI has specific risks.
We build the audit
framework to match.

The built-in dimensions are a starting point. Every product, every model, every user base has its own risk profile. We work with teams to design audit schemas and evaluation frameworks that are calibrated to their specific domain, audience, and regulatory context.

We don't just run a report and hand it over. We help you understand what you're looking at, what it means for your product, and what to do about it.

Custom dimension engineering for your specific risk profile
Audit schema design tailored to your industry and audience
Live demo and walkthrough of Orwell on your own model
Ongoing audit cadence — run after every model update or fine-tune
Evidence-grade reports for regulatory and compliance documentation
Private deployment support for air-gapped or secure environments

Book a Session

See Orwell in action on your model

We'll demo Orwell live — including running an audit on a real model — and show you how to set it up on your local system in under 10 minutes.

Book a Free Session

30 minutes · Free · No commitment

🎯 Live Orwell demo on a real LLM

🛠️ Local setup walkthrough (under 10 min)

📋 Custom dimension scoping for your use case

❓ Open Q&A — bring your specific questions

Know what your
AI believes.

AI doesn't just make mistakes.
It has opinions.

Start auditing in minutes

Everything you need to audit
any model, any bias

Any LLM, Any Provider

No Code Required

Judge Bench System

Custom Schemas & Dimensions

Structured Audit Reports

Local-First & Private

Ollama — Fully Offline

System Prompt Analysis

Live Audit Streaming

One judge isn't enough.
Orwell knows that.

Random Mode

All Mode

Jury Mode

Audit anything you can define

Cultural Bias Auditing

Hiring Equity

Political Neutrality

Brand & Product Fairness

Clinical Safety

Build Any Dimension

Your audit data never
leaves your machine.

SQLite — File-Based Storage

No Telemetry

Air-Gapped Deployments

Running fully offline with Ollama

A standard, not just
a tool.

Open Dimension Library

Reproducible Methodology

Community-Driven

Built to Be Extended

Your AI has specific risks.
We build the audit
framework to match.

See Orwell in action on your model

Start auditing your AI today.

Know what yourAI believes.

AI doesn't just make mistakes.It has opinions.

Start auditing in minutes

Everything you need to auditany model, any bias

Any LLM, Any Provider

No Code Required

Judge Bench System

Custom Schemas & Dimensions

Structured Audit Reports

Local-First & Private

Ollama — Fully Offline

System Prompt Analysis

Live Audit Streaming

One judge isn't enough.Orwell knows that.

Random Mode

All Mode

Jury Mode

Audit anything you can define

Cultural Bias Auditing

Hiring Equity

Political Neutrality

Brand & Product Fairness

Clinical Safety

Build Any Dimension

Your audit data neverleaves your machine.

SQLite — File-Based Storage

No Telemetry

Air-Gapped Deployments

Running fully offline with Ollama

A standard, not justa tool.

Open Dimension Library

Reproducible Methodology

Community-Driven

Built to Be Extended

Your AI has specific risks.We build the auditframework to match.

See Orwell in action on your model

Start auditing your AI today.

Know what your
AI believes.

AI doesn't just make mistakes.
It has opinions.

Everything you need to audit
any model, any bias

One judge isn't enough.
Orwell knows that.

Your audit data never
leaves your machine.

A standard, not just
a tool.

Your AI has specific risks.
We build the audit
framework to match.