A step-by-step guide from installation to production CI/CD integration.
pip install ragaliq
export ANTHROPIC_API_KEY=sk-ant-...
Verify the install:
ragaliq version
ragaliq list-evaluators
You should see RagaliQ version and the five built-in evaluators listed.
The core workflow is: create a RAGTestCase → pass it to RagaliQ.evaluate() → inspect the RAGTestResult.
from ragaliq import RagaliQ, RAGTestCase
# RagaliQ defaults to Claude as judge, and runs faithfulness + relevance
tester = RagaliQ(judge="claude")
test_case = RAGTestCase(
id="tutorial-1",
name="Capital query",
query="What is the capital of France?",
context=[
"France is a country in Western Europe.",
"The capital city of France is Paris, which is home to the Eiffel Tower.",
],
response="The capital of France is Paris.",
)
result = tester.evaluate(test_case)
print(f"Status: {result.status}")
print(f"Faithfulness: {result.scores['faithfulness']:.2f}")
print(f"Relevance: {result.scores['relevance']:.2f}")
print(f"Passed: {result.passed}")
What happens internally:
RagaliQ initialises lazily on the first call — it creates a ClaudeJudge and instantiates the FaithfulnessEvaluator and RelevanceEvaluator.asyncio.gather.FaithfulnessEvaluator first calls the judge to extract atomic claims from the response, then verifies each claim against the context. Score = supported claims / total claims.RelevanceEvaluator makes a single judge call to score how well the response answers the query.PASSED.RAGTestResult carries everything you need for reporting and debugging:
result.status # EvalStatus.PASSED | FAILED | ERROR | SKIPPED
result.passed # bool — True when status == PASSED
result.scores # {"faithfulness": 0.92, "relevance": 0.88}
result.execution_time_ms # wall-clock time in ms
result.judge_tokens_used # total tokens consumed across all evaluators
# Per-evaluator details (reasoning, raw judge response, errors)
faith_detail = result.details["faithfulness"]
print(faith_detail["reasoning"]) # Human-readable explanation
print(faith_detail["passed"]) # bool
print(faith_detail["raw"]) # Raw judge response dict for debugging
Error handling: If an evaluator fails (network error, bad API response, etc.), the result carries status=ERROR and the error is recorded in details["evaluator_name"]["error"]. The other evaluators in the batch still run — errors are isolated, not fatal.
Run many test cases efficiently with evaluate_batch():
import json
from ragaliq import RagaliQ, RAGTestCase
tester = RagaliQ(
judge="claude",
evaluators=["faithfulness", "relevance", "hallucination"],
default_threshold=0.75,
max_concurrency=5, # Up to 5 test cases run in parallel
)
test_cases = [
RAGTestCase(id=f"tc-{i}", name=f"Query {i}", query="...", context=["..."], response="...")
for i in range(20)
]
results = tester.evaluate_batch(test_cases)
passed = sum(1 for r in results if r.passed)
print(f"{passed}/{len(results)} passed")
# Inspect failures
for r in results:
if not r.passed:
for metric, score in r.scores.items():
if score < 0.75:
print(f" [{r.test_case.name}] {metric}: {score:.2f}")
Concurrency model: max_concurrency=5 means up to 5 test cases evaluate simultaneously. Each test case itself runs all its evaluators in parallel (bounded by max_judge_concurrency=20). For faithfulness, each claim is verified in parallel. This means a batch of 20 test cases makes many concurrent API calls — adjust max_concurrency based on your rate limits.
# Default: faithfulness + relevance
tester = RagaliQ(judge="claude")
# Override to specific evaluators
tester = RagaliQ(
judge="claude",
evaluators=["faithfulness", "relevance", "hallucination"],
)
# All five built-in evaluators
tester = RagaliQ(
judge="claude",
evaluators=["faithfulness", "relevance", "hallucination",
"context_precision", "context_recall"],
)
faithfulness — Decomposes the response into atomic claims, verifies each claim against the context. Score = supported claims / total claims. Use when you need to ensure the response doesn’t add facts beyond what the context contains.
relevance — Single judge call asking: “Does this response answer the query?” Score 0–1. Use when you need to ensure the response is on-topic.
hallucination — Same claim pipeline as faithfulness, but scored as 1 − (hallucinated / total). Threshold defaults to 0.8 (stricter). Use when the cost of made-up information is high.
context_precision — Evaluates whether the retrieved context is relevant to the query. Uses weighted rank-order precision (top-ranked docs contribute more). Measures retrieval quality, not response quality.
context_recall — Checks whether the context covers all expected_facts. Requires test_case.expected_facts to be set. Measures whether retrieval missed anything important.
# context_recall requires expected_facts
test_case = RAGTestCase(
id="recall-1",
name="Facts check",
query="Tell me about Python.",
context=["Python was created by Guido van Rossum and released in 1991."],
response="Python was created by Guido van Rossum in 1991.",
expected_facts=["created by Guido van Rossum", "released in 1991"],
)
{
"version": "1.0",
"metadata": {"source": "my-docs", "date": "2026-01"},
"test_cases": [
{
"id": "tc-001",
"name": "Python creation date",
"query": "When was Python created?",
"context": ["Python was created by Guido van Rossum and released in 1991."],
"response": "Python was first released in 1991 by Guido van Rossum.",
"expected_answer": "1991",
"expected_facts": ["released in 1991", "created by Guido van Rossum"],
"tags": ["python", "history"]
}
]
}
version: "1.0"
test_cases:
- id: tc-001
name: Python creation date
query: When was Python created?
context:
- Python was created by Guido van Rossum and released in 1991.
response: Python was first released in 1991 by Guido van Rossum.
tags:
- python
- history
Required columns: id, name, query, context, response
id,name,query,context,response,tags
tc-001,Python date,When was Python created?,Python was released in 1991.,Python was created in 1991.,python|history
List fields (context, expected_facts, tags) use pipe-separated values: doc1|doc2|doc3.
from ragaliq.datasets import DatasetLoader
dataset = DatasetLoader.load("my_dataset.json")
print(f"Loaded {len(dataset.test_cases)} test cases")
results = tester.evaluate_batch(dataset.test_cases)
Generate test cases automatically from your documentation:
import asyncio
from ragaliq import TestCaseGenerator
from ragaliq.judges import ClaudeJudge
judge = ClaudeJudge()
generator = TestCaseGenerator()
# From in-memory documents
docs = [
"Python is a high-level programming language created in 1991.",
"Python supports multiple programming paradigms including OOP and functional.",
]
test_cases = asyncio.run(
generator.generate_from_documents(documents=docs, n=5, judge=judge)
)
# Or via CLI (reads .txt files from a directory)
# ragaliq generate ./docs/ --num 50 --output dataset.json
Generated cases are tagged with ["generated"] and given UUID-based IDs.
ragaliq run# Basic run (faithfulness + relevance at threshold 0.7)
ragaliq run dataset.json
# Custom evaluators and threshold
ragaliq run dataset.json \
--evaluator faithfulness \
--evaluator hallucination \
--threshold 0.8
# Export to JSON (CI-friendly)
ragaliq run dataset.json --output json --output-file report.json
# Export to HTML (shareable)
ragaliq run dataset.json --output html --output-file report.html
# Stop on first evaluator error (debug mode)
ragaliq run dataset.json --fail-fast
Exit code is 0 when all tests pass, 1 when any fail — integrates naturally with CI.
ragaliq generate# From a directory of .txt files
ragaliq generate ./docs/ --num 20 --output dataset.json
# From a single document
ragaliq generate document.txt --num 10 --output dataset.json
ragaliq validate# Check dataset schema without making any LLM calls
ragaliq validate dataset.json
ragaliq list-evaluatorsragaliq list-evaluators
The pytest plugin is registered automatically when RagaliQ is installed. No conftest.py setup needed.
# test_rag.py
import pytest
from ragaliq import RAGTestCase
from ragaliq.integrations.pytest_plugin import assert_rag_quality
@pytest.mark.rag_test
def test_faithful_answer(rag_tester):
"""Use rag_tester fixture for direct evaluation."""
test_case = RAGTestCase(
id="pytest-1",
name="Capital query",
query="What is the capital of France?",
context=["France is a country in Western Europe. Its capital city is Paris."],
response="The capital of France is Paris.",
)
result = rag_tester.evaluate(test_case)
assert result.passed, f"Quality check failed: {result.scores}"
@pytest.mark.rag_test
def test_with_helper(ragaliq_judge):
"""Use assert_rag_quality for inline assertion."""
test_case = RAGTestCase(
id="pytest-2",
name="ML definition",
query="What is machine learning?",
context=["Machine learning is a subset of AI that enables systems to learn from data."],
response="Machine learning enables systems to learn from data.",
)
# Raises AssertionError with failing metric names+scores if any metric < threshold
assert_rag_quality(test_case, judge=ragaliq_judge, threshold=0.8)
@pytest.mark.rag_test
def test_strict_quality(rag_tester):
"""Use a stricter threshold for critical content."""
from ragaliq import RagaliQ
strict_tester = RagaliQ(judge=rag_tester._judge, default_threshold=0.9)
test_case = RAGTestCase(
id="pytest-strict-1",
name="Medical query",
query="What is the dosage?",
context=["The recommended dosage is 500mg twice daily."],
response="The dosage is 500mg twice per day.",
)
result = strict_tester.evaluate(test_case)
assert result.passed, f"Strict check failed: {result.scores}"
# Run with real API key
ANTHROPIC_API_KEY=sk-ant-... pytest tests/ -v
# Skip slow tests (marked with @pytest.mark.rag_slow)
pytest tests/ -m "not rag_slow"
# Limit spend per session
pytest tests/ --ragaliq-cost-limit 2.00
# Simulate latency (useful for testing resilience)
pytest tests/ --ragaliq-latency-ms 200
# Show RagaliQ stats in terminal output
pytest tests/ -v --ragaliq-judge claude
After the test session, RagaliQ appends a summary showing total API calls, tokens used, estimated cost, and any failures.
from ragaliq.reports import ConsoleReporter
from rich.console import Console
reporter = ConsoleReporter(threshold=0.7)
reporter.report(results)
# Verbose: show reasoning for all results, not just failures
reporter = ConsoleReporter(threshold=0.7, verbose=True)
reporter.report(results)
Renders a table of scores (green/red relative to threshold), failure reasoning, and a per-evaluator summary.
from ragaliq.reports import JSONReporter
from pathlib import Path
json_str = JSONReporter(threshold=0.7).export(results)
Path("report.json").write_text(json_str)
The JSON includes a summary block (total, passed, failed, pass_rate, per-evaluator stats) and a results array with full test case data. Suitable for downstream processing and CI artifact storage.
from ragaliq.reports import HTMLReporter
from pathlib import Path
html_str = HTMLReporter(threshold=0.7).export(results)
Path("report.html").write_text(html_str)
Self-contained HTML — no external dependencies or JavaScript. Shareable as a single file.
RagaliQ auto-detects GitHub Actions and activates CI mode:
::error:: annotations on diffstotal, passed, failed, pass_rate available to downstream stepsname: RagaliQ Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.14"
- run: pip install ragaliq
- name: Run evaluations
env:
ANTHROPIC_API_KEY: $
run: |
ragaliq run dataset.json \
--output json \
--output-file report.json \
--threshold 0.7
- uses: actions/upload-artifact@v4
if: always()
with:
name: ragaliq-report
path: report.json
See examples/ci_cd_example/ragaliq-ci.yml for the full example.
from ragaliq.integrations.github_actions import (
is_github_actions,
emit_ci_summary,
write_step_summary,
create_annotations,
)
if is_github_actions():
emit_ci_summary(results, threshold=0.7)
# Equivalent to:
# write_step_summary(format_summary_markdown(results))
# create_annotations(results)
# set_output("passed", str(passed_count))
Custom evaluators register themselves with the global registry and become available everywhere — CLI, pytest fixtures, RagaliQ(evaluators=[...]).
from ragaliq.evaluators import register_evaluator
from ragaliq.core.evaluator import Evaluator, EvaluationResult
from ragaliq.core.test_case import RAGTestCase
from ragaliq.judges.base import LLMJudge
@register_evaluator("conciseness")
class ConcisenessEvaluator(Evaluator):
name = "conciseness"
description = "Measures whether the response is appropriately brief"
threshold = 0.7
async def evaluate(self, test_case: RAGTestCase, judge: LLMJudge) -> EvaluationResult:
# Use an existing judge capability
result = await judge.evaluate_relevance(
query=f"Is this response concise and to the point for the query: {test_case.query}",
response=test_case.response,
)
return EvaluationResult(
evaluator_name=self.name,
score=result.score,
passed=self.is_passing(result.score),
reasoning=result.reasoning,
tokens_used=result.tokens_used,
raw_response={"score": result.score, "reasoning": result.reasoning},
)
# Now available anywhere
from ragaliq import RagaliQ
tester = RagaliQ(judge="claude", evaluators=["faithfulness", "conciseness"])
Key rules for custom evaluators:
Evaluatorasync def evaluate(self, test_case, judge) -> EvaluationResultscore must be in [0.0, 1.0]error field in EvaluationResult for graceful failure handlingRagaliQ instance that uses itControl which Claude model and generation parameters the judge uses:
from ragaliq import RagaliQ
from ragaliq.judges import ClaudeJudge, JudgeConfig
# Via JudgeConfig
config = JudgeConfig(
model="claude-sonnet-4-6", # default
temperature=0.0, # deterministic scoring
max_tokens=1024, # response length cap
)
tester = RagaliQ(judge="claude", judge_config=config)
# For complex multi-step or gold-standard judging
gold_config = JudgeConfig(model="claude-opus-4-6", temperature=0.0)
# Via pre-configured ClaudeJudge (gives access to trace_collector)
from ragaliq.judges.trace import TraceCollector
collector = TraceCollector()
judge = ClaudeJudge(config=config, trace_collector=collector)
tester = RagaliQ(judge=judge)
# After evaluation
print(f"Total tokens: {collector.total_tokens}")
print(f"Estimated cost: ${collector.total_cost_estimate:.4f}")
TraceCollector records every judge API call with timing, token counts, and success/failure:
from ragaliq.judges import ClaudeJudge
from ragaliq.judges.trace import TraceCollector
collector = TraceCollector()
judge = ClaudeJudge(trace_collector=collector)
from ragaliq import RagaliQ
tester = RagaliQ(judge=judge, evaluators=["faithfulness", "hallucination"])
results = tester.evaluate_batch(test_cases)
# Aggregate stats
print(f"API calls: {len(collector.traces)}")
print(f"Total tokens: {collector.total_tokens}")
print(f"Input tokens: {collector.total_input_tokens}")
print(f"Output tokens: {collector.total_output_tokens}")
print(f"Total latency: {collector.total_latency_ms}ms")
print(f"Est. cost: ${collector.total_cost_estimate:.4f}")
print(f"Failures: {collector.failure_count}")
# Per-operation breakdown
verify_traces = collector.get_by_operation("verify_claim")
print(f"Claim verifications: {len(verify_traces)}")
# Inspect failures
for trace in collector.get_failures():
print(f" Failed: {trace.operation} — {trace.error}")
For a complete runnable example, see examples/basic_usage.py.