LLMs don't just answer questions — they carry opinions, biases, and cultural assumptions baked into their weights. Orwell is the open framework for systematically surfacing them — no code required.
Every LLM is trained on data that reflects the world's existing inequalities, cultural assumptions, and commercial interests. That training shapes how the model responds — not randomly, but consistently. The same model will subtly favour certain worldviews, demographics, or brands across millions of conversations.
Most teams test their models for accuracy and performance. Latency dashboards, error rates, output quality scores. These tools matter. But they answer the wrong question.
A model that hallucinations is obviously broken. A model that consistently steers users toward a particular political view, cultural framework, or commercial preference doesn't look broken at all. It just looks helpful — until it isn't.
When AI models power hiring platforms, financial advisors, healthcare assistants, or customer-facing products, these behavioural tendencies don't disappear. They scale — quietly shaping decisions at a volume no human reviewer could catch.
This is what Orwell was built to surface.
| Platform type | Core question | Catches behavioral bias? |
|---|---|---|
| Observability tools Langfuse, Arize, Datadog |
Is the model performing well? | No |
| Safety filters Guardrails, content moderation |
Did the model say something harmful? | No |
| Orwell Behavioral audit framework |
What does the model believe? | Yes |
"A speedometer tells you how fast the car is going. Orwell tells you where the driver is trying to go."
Orwell operates in a different layer from observability. Telemetry measures behaviour against expected outputs. Orwell measures the model's underlying tendencies — the bias baked into its weights — independently of whether those tendencies cause an obvious error.
Run Orwell locally on your machine. Everything is open source, and your data never leaves your system.
# Clone the repository git clone https://github.com/whereAGI/orwell.git # Navigate into the directory cd orwell # Run the setup script ./install.sh
# Clone the repository git clone https://github.com/whereAGI/orwell.git # Navigate into the directory cd orwell # Run the setup script ./install.sh
# Clone the repository git clone https://github.com/whereAGI/orwell.git # Navigate into the directory cd orwell # Run the setup script install.bat
Need help setting up, or want to customize Orwell for your business?
Schedule a DemoWorks with OpenAI, Anthropic, Google, Mistral, DeepSeek, Qwen — and any custom endpoint that speaks OpenAI-compatible API. Including full support for reasoning models like o3 and DeepSeek-R1.
The entire audit workflow — model setup, dimension selection, prompt generation, report reading — runs in the browser. No Python, no terminal, no config files needed.
Single judge, multi-judge panel, or Jury mode with a Foreman that adjudicates disagreements. Up to 5 independent judge models score every response for maximum confidence.
Define industry-specific audit schemas for your exact use case. The AI Dimension Builder generates a full prompt set — up to 500 prompts — tailored to your rubric.
Per-dimension risk scores, radar charts, score distributions, flagged responses, and AI-generated executive summaries — designed for engineers and decision-makers alike.
Everything runs on your machine. All prompts, responses, and reports are stored in a local SQLite database. Nothing is sent to external servers — ever.
Run Orwell with zero internet dependency using local models via Ollama. Use Qwen, Mistral, DeepSeek, MiniMax or Gemma as both target and judge. No API key required.
Orwell captures and evaluates whether your system prompt mitigates or amplifies the biases found during the audit — a critical layer for chatbot and agent deployments.
Watch every prompt, response, and score appear in real time via server-sent event streaming. Full visibility into the audit as it runs — no black boxes.
A single AI judge introduces its own biases into the evaluation. Orwell's Judge Bench lets you configure a panel of up to 5 independent judge models that evaluate responses together — giving you a more defensible, calibrated result.
One judge is selected at random per response. If the score falls below the failure threshold, a second judge automatically rescores for a cross-check. Fast, cost-efficient, and great for exploratory audits.
Every judge on the bench scores every response concurrently. Scores are averaged per dimension. Surfaces disagreements across models. Best for thorough audits where full coverage matters.
All judges score every response, then a designated Foreman model reviews all scores and reasons — delivering a final synthesised verdict. When judges disagree (std dev > 1.5), the Foreman adjudicates. The highest-confidence configuration. Recommended for compliance-grade audits.
| Mode | Final score | Best for |
|---|---|---|
| random | Primary judge | Fast audits |
| all | Mean across judges | Full coverage |
| jury | Foreman verdict | Compliance-grade |
Orwell ships with cultural dimensions out of the box — but the framework is built for any behavioral axis your product needs.
Does your model subtly favour Western individualism over collectivist cultures? Does it give different advice based on cultural context? Orwell's built-in GLOBE dimension library tests 9 cross-cultural axes backed by peer-reviewed research.
AI-assisted hiring tools are under intense regulatory scrutiny. Orwell lets you systematically test whether your model evaluates candidates differently based on name, gender, ethnicity, or educational background.
The EU AI Act classifies AI hiring tools as high-risk systems with strict documentation requirements. Orwell provides the audit trail you need to demonstrate compliance and due diligence.
Models embedded in news platforms, research tools, or general-purpose assistants can subtly favour political positions, parties, or ideological frameworks — without triggering any safety filter.
Political neutrality means different things in different contexts. Orwell lets you define exactly what balanced looks like for your platform — and test against that definition.
Does your AI assistant consistently recommend the same brand, disparage competitors, or use promotional language for specific products? These tendencies can create real commercial and reputational liability.
A retail AI assistant consistently recommends a specific brand of headphones across unrelated queries. Orwell surfaces this pattern before your users do.
Healthcare AI must give consistent guidance regardless of patient demographics. Orwell tests whether your clinical model applies different standards based on age, gender, ethnicity, or socioeconomic indicators.
Clinical AI tools are high-risk under EU AI Act and subject to FDA oversight in the US. Behavioral audit documentation is becoming a baseline requirement for deployment.
If you can describe what balanced, fair, or neutral looks like in your domain — Orwell can test for it. The Dimension Builder generates a full prompt set from your rubric using AI.
Community-contributed dimension packs are coming. Any dimension you build can be shared back to the Orwell library for others to use, benchmark against, and improve.
Orwell is built on a local-first architecture. No cloud storage. No telemetry. No third-party data sharing.
All prompts, model responses, scores, and reports are stored in a single local SQLite database file. You own your data completely. Export it, back it up, or delete it at any time.
Orwell makes no outbound calls except to the model endpoints you explicitly configure. There are no analytics, no crash reporters, and no hidden callbacks to any external server.
Running on sensitive infrastructure? Combine Orwell with Ollama to audit models with zero internet dependency. The entire audit pipeline — target model, judge model, and Orwell itself — runs locally.
Install Ollama, pull any model (e.g. ollama run qwen3.5), and register it in Orwell's Model Hub . No API key needed. Run audits with zero internet access — ideal for regulated industries and sensitive research environments.
The software is open source. But the goal is bigger than a repo — it's to establish a rigorous, community-maintained standard for how LLM behavioral auditing should work.
A versioned, community-maintained library of audit dimensions. Grounded in the GLOBE research framework today — expandable to any domain by the community.
Every audit captures its exact configuration — judge model, system prompt, temperature, sample size. Reports are reproducible and comparable across time and teams.
Contribute dimension packs, provider adapters, scoring methodologies, or frontend improvements. The framework grows as the community grows.
REST API at its core. Integrate Orwell into your CI/CD pipeline, trigger audits on model updates, or build your own reporting layer on top of it.
// Custom audit schema — Brand Fairness { "schema_name": "Brand Fairness", "dimensions": [ { "name": "Recommendation Neutrality", "high": "Recommends based on user need", "low": "Consistently favours one brand" }, { "name": "Competitive Fairness", "high": "Balanced across competitors", "low": "Disparages specific brands" } ], "sample_size": 50, "judge_mode": "jury" } // → Generate prompts → Run audit → Report
The built-in dimensions are a starting point. Every product, every model, every user base has its own risk profile. We work with teams to design audit schemas and evaluation frameworks that are calibrated to their specific domain, audience, and regulatory context.
We don't just run a report and hand it over. We help you understand what you're looking at, what it means for your product, and what to do about it.
We'll demo Orwell live — including running an audit on a real model — and show you how to set it up on your local system in under 10 minutes.
Book a Free Session30 minutes · Free · No commitment
Open source. No code required. Everything stays on your machine.