cross-judge¶
Multi-vendor LLM ensemble-judge framework — majority / unanimous / Krippendorff-α voting across heterogeneous models.
Quick start¶
from cross_judge import Critic, Ensemble
critics = [
Critic(name="claude-strict", model="anthropic/claude-sonnet-4.5", vendor="openrouter"),
Critic(name="ds-pro-creative", model="deepseek-v4-pro", vendor="deepseek", temperature=0.7),
Critic(name="kimi-rigor", model="moonshot/kimi-k2", vendor="openrouter"),
]
ensemble = Ensemble(critics=critics, voting="majority")
result = ensemble.judge(query="Is this isomorphic to power-law tail scaling?")
print(result.consensus, result.krippendorff_alpha, result.agreement_pct)
Core API (v0.1)¶
Critic dataclass ¶
Critic(name: str, model: str, prompt_template: str = 'Judge the following query:\n{query}\n\nOutput JSON with kind (KEEP/REJECT/SPLIT/MERGE), confidence (0-1), reasoning.', vendor: str = 'deepseek', system_prompt: str = 'You are a careful judge. Output strict JSON only.', temperature: float = 0.0, max_tokens: int = 2000, api_key: str | None = None, base_url: str | None = None, http_client: Any = None, timeout: float = 60.0)
One LLM critic configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | unique critic identifier (e.g. 'claude-strict', 'deepseek-creative'). Used in EnsembleVerdict.verdicts and disagreement diagnostics. | required |
model | str | vendor model id (e.g. 'deepseek-v4-pro', 'gpt-4o', 'anthropic/claude-sonnet-4.5' via openrouter). | required |
prompt_template | str | A str.format() template with | 'Judge the following query:\n{query}\n\nOutput JSON with kind (KEEP/REJECT/SPLIT/MERGE), confidence (0-1), reasoning.' |
vendor | str | 'deepseek' (default) / 'openai' / 'openrouter' / 'custom'. For 'custom', pass base_url explicitly. | 'deepseek' |
system_prompt | str | optional system message ('' = none). | 'You are a careful judge. Output strict JSON only.' |
temperature | float | sampling temperature (default 0.0 for deterministic judging). | 0.0 |
max_tokens | int | output cap. | 2000 |
api_key | str | None | explicit API key (else read from env var per vendor). | None |
base_url | str | None | explicit base URL override. | None |
http_client | Any | inject an httpx.Client (or compatible mock) for testing. | None |
timeout | float | per-request timeout in seconds. | 60.0 |
Example
critic = Critic( name="claude-strict", model="anthropic/claude-sonnet-4.5", vendor="openrouter", prompt_template="Judge: {query}\nOutput JSON: ...", ) v = critic.judge("Is this isomorphic to power-law?", context={})
judge ¶
Run this critic on one query and return a Verdict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query | str | the item to judge. | required |
context | dict[str, Any] | None | additional template variables (merged into prompt_template). | None |
Returns:
| Type | Description |
|---|---|
Verdict | Verdict — kind/confidence/reasoning. Errors surfaced as |
Verdict | Verdict(kind='ERROR', error=...) rather than raised exceptions. |
Source code in packages/cross-judge/src/cross_judge/core.py
from_yaml_prompt classmethod ¶
Build a Critic from a YAML prompt file shipped under prompts/.
YAML schema (versioned): version: "0.1" system_prompt: "..." user_prompt_template: "Judge: {query}\n..."
Source code in packages/cross-judge/src/cross_judge/core.py
Ensemble dataclass ¶
Ensemble(critics: list[Critic], voting: str | VotingStrategy = 'majority', voting_kwargs: dict[str, Any] = dict())
A panel of Critics + a voting strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
critics | list[Critic] | list of Critic instances. All critics judge every query. | required |
voting | str | VotingStrategy | 'majority' (default) | 'unanimous' | custom callable returning (consensus_label, disagreement_bool). | 'majority' |
voting_kwargs | dict[str, Any] | passed through to the voting strategy (e.g. priority=['REJECT', 'KEEP'] for tie-breaking). | dict() |
judge ¶
judge(query: str, *, query_id: str | None = None, context: dict[str, Any] | None = None, meta: dict[str, Any] | None = None) -> EnsembleVerdict
Judge a query with all critics and aggregate consensus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query | str | the item to judge. | required |
query_id | str | None | optional explicit identifier (defaults to query truncated to 80 chars). | None |
context | dict[str, Any] | None | extra template variables passed to each Critic. | None |
meta | dict[str, Any] | None | caller-supplied metadata pass-through. | None |
Returns:
| Type | Description |
|---|---|
EnsembleVerdict | EnsembleVerdict — per-critic verdicts + consensus + agreement + |
EnsembleVerdict | Krippendorff α. |
Source code in packages/cross-judge/src/cross_judge/ensemble.py
aggregate_verdicts ¶
aggregate_verdicts(verdicts: list[Verdict], *, query_id: str, meta: dict[str, Any] | None = None) -> EnsembleVerdict
Aggregate a precomputed list of verdicts (useful for parallel-call orchestration outside this class).
Source code in packages/cross-judge/src/cross_judge/ensemble.py
Verdict ¶
Bases: BaseModel
A single critic's verdict on one query.
Attributes:
| Name | Type | Description |
|---|---|---|
kind | str | VerdictKind label (KEEP / REJECT / SPLIT / MERGE / UNCLEAR / ...). |
confidence | float | 0.0–1.0 self-reported confidence. |
reasoning | str | 1–4 sentence rationale. |
critic_id | str | which critic produced this verdict. |
raw_response | str | None | the raw LLM response (for audit / debugging). |
error | str | None | error string if the call failed; kind will be 'ERROR'. |
elapsed_s | float | wall-clock seconds of the underlying LLM call. |
The kind accepts free-form strings too (e.g. PASS / FAIL for code review), but the Literal type is the recommended vocabulary for B3/B4-style taxonomy review pipelines.
VerdictKind module-attribute ¶
Default verdict vocabulary for the B3 / B4 universality-class review pattern.
KEEP — accept the candidate as-is. REJECT — discard the candidate (does not meet universality-class standards). SPLIT — accept but split into multiple sub-classes (composite candidate). MERGE — accept but merge with an existing class (duplicate / overlap). UNCLEAR / ERROR / PARSE_FAIL — fallback labels for partial / failed verdicts.
EnsembleVerdict ¶
Bases: BaseModel
Aggregate result for one query across all critics in an ensemble.
Attributes:
| Name | Type | Description |
|---|---|---|
query_id | str | caller-supplied identifier for the judged item. |
verdicts | list[Verdict] | per-critic Verdict list (one per critic, in input order). |
consensus | str | the rolled-up consensus label per the ensemble's voting strategy. |
avg_confidence | float | mean confidence across all non-errored verdicts. |
disagreement | bool | True if not all critics produced the same |
agreement_pct | float | fraction of critics that agreed with the consensus label. |
krippendorff_alpha | float | None | Krippendorff's α inter-rater reliability coefficient (computed treating critics as raters and labels as nominal data). |
voting | str | name of the voting strategy used. |
meta | dict[str, Any] | caller-supplied metadata pass-through. |
VENDOR_DEFAULTS module-attribute ¶
VENDOR_DEFAULTS: dict[str, tuple[str, str]] = {'deepseek': ('https://api.deepseek.com/v1', 'DEEPSEEK_API_KEY'), 'openai': ('https://api.openai.com/v1', 'OPENAI_API_KEY'), 'openrouter': ('https://openrouter.ai/api/v1', 'OPENROUTER_API_KEY')}
Voting strategies¶
majority_vote ¶
majority_vote(verdicts: list[Verdict], *, priority: list[str] | None = None, fallback: str = 'UNCLEAR') -> tuple[str, bool]
Majority vote: most common label wins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verdicts | list[Verdict] | per-critic verdicts. | required |
priority | list[str] | None | tiebreaker order — labels earlier in the list win ties. | None |
fallback | str | returned if no valid verdicts. | 'UNCLEAR' |
Returns:
| Type | Description |
|---|---|
tuple[str, bool] | (consensus_label, disagreement_bool) |
Source code in packages/cross-judge/src/cross_judge/voting.py
unanimous ¶
Unanimous vote: return label only if all critics agree.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verdicts | list[Verdict] | per-critic verdicts. | required |
fallback | str | returned on any disagreement. | 'UNCLEAR' |
Returns:
| Type | Description |
|---|---|
tuple[str, bool] | (consensus_label, disagreement_bool) |
Source code in packages/cross-judge/src/cross_judge/voting.py
agreement_pct ¶
Fraction of valid critics whose kind == consensus.
Returns:
| Type | Description |
|---|---|
float | float in [0.0, 1.0]. 0.0 if no valid verdicts. |
Source code in packages/cross-judge/src/cross_judge/voting.py
krippendorff_alpha ¶
Krippendorff's α for nominal data.
Treats each critic as a rater and each label as nominal.
For the single-item, N-rater case: α = 1 - (D_observed / D_expected)
where
D_observed = number of disagreeing pairs of critics D_expected = sum over (cat_i, cat_j) of n_i * n_j (i != j) / (N-1) where n_i is the count of critics that voted cat_i, N is the total number of critics.
The (N-1) divisor in D_expected is the standard small-sample correction (Krippendorff 2011 eq. 4).
Returns:
| Type | Description |
|---|---|
float | None | α in [-1.0, 1.0]. None if fewer than 2 valid verdicts. |
float | None | 1.0 → perfect agreement |
float | None | 0.0 → agreement equal to chance |
float | None | <0.0 → systematic disagreement |
Source code in packages/cross-judge/src/cross_judge/voting.py
get_voting_strategy ¶
Resolve a voting strategy by name or pass through if already callable.
Source code in packages/cross-judge/src/cross_judge/voting.py
VOTING_STRATEGIES module-attribute ¶
Vendor configuration¶
VENDORS module-attribute ¶
VENDORS: dict[str, VendorConfig] = {'deepseek': VendorConfig(name='deepseek', base_url='https://api.deepseek.com/v1', api_key_env='DEEPSEEK_API_KEY'), 'openai': VendorConfig(name='openai', base_url='https://api.openai.com/v1', api_key_env='OPENAI_API_KEY'), 'openrouter': VendorConfig(name='openrouter', base_url='https://openrouter.ai/api/v1', api_key_env='OPENROUTER_API_KEY')}
VendorConfig dataclass ¶
Vendor-specific connection settings for OpenAI-compatible APIs.
get_vendor ¶
Look up a vendor by name. Raises KeyError on unknown vendor.
Source code in packages/cross-judge/src/cross_judge/vendors.py
make_client ¶
make_client(vendor: str = 'deepseek', api_key: str | None = None, base_url: str | None = None) -> Any
Build an OpenAI-compatible client for a vendor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vendor | str | One of 'deepseek', 'openai', 'openrouter' (default: deepseek). | 'deepseek' |
api_key | str | None | Explicit API key. If None, read from the vendor's env var. | None |
base_url | str | None | Override base URL (useful for custom endpoints / mock servers). | None |
Returns:
| Type | Description |
|---|---|
Any | An |
Raises:
| Type | Description |
|---|---|
RuntimeError | if the API key is missing. |
ImportError | if |
Source code in packages/cross-judge/src/cross_judge/vendors.py
Legacy API (v4 backwards compat)¶
Reviewer dataclass ¶
Reviewer(reviewer_id: str, model: str, vendor: str = 'deepseek', system_prompt: str = 'You are a careful judge. Output strict JSON only.', temperature: float = 0.0, max_tokens: int = 2000, weight: float = 1.0, client: Any = None, api_key: str | None = None, base_url: str | None = None)
One LLM reviewer configuration.
ask ¶
Call the LLM once and return a parsed Verdict.
Network / parse errors are caught and surfaced as Verdict(error=..., verdict='ERROR'). Callers can decide whether to retry or skip.
Source code in packages/cross-judge/src/cross_judge/reviewer.py
JudgePanel dataclass ¶
JudgePanel(reviewers: list[Reviewer], strategy: str | AggregationStrategy = 'majority', strategy_kwargs: dict[str, Any] = dict())
A panel of reviewers + an aggregation strategy.
ask ¶
Ask every reviewer to judge the item, then aggregate.
Source code in packages/cross-judge/src/cross_judge/panel.py
aggregate_verdicts ¶
aggregate_verdicts(item_id: str, verdicts: list[Verdict], *, meta: dict[str, Any] | None = None) -> EnsembleResult
Aggregate a precomputed list of verdicts (useful when calls were driven externally, e.g. via async / parallel orchestration).
Source code in packages/cross-judge/src/cross_judge/panel.py
EnsembleResult ¶
Bases: BaseModel
Aggregate result for one item across all reviewers in a panel.
LegacyVerdict ¶
Bases: BaseModel
A single reviewer's verdict for one item.
The verdict label vocabulary is caller-defined (e.g. KEEP/REJECT/UNCLEAR for taxonomy review, or PASS/FAIL/UNSURE for code review). The aggregation layer treats labels as opaque strings.
AggregationStrategy module-attribute ¶
majority ¶
majority(verdicts: list[Verdict], *, priority: list[str] | None = None, fallback: str = 'UNCLEAR') -> tuple[str, bool]
Most common label wins. Ties broken by priority order (if given), else by first-seen order.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verdicts | list[Verdict] | per-reviewer verdicts. | required |
priority | list[str] | None | tiebreaker order (labels earlier in the list win ties). | None |
fallback | str | label to return if no valid verdicts exist. | 'UNCLEAR' |
Source code in packages/cross-judge/src/cross_judge/aggregation.py
weighted ¶
weighted(verdicts: list[Verdict], *, weights: dict[str, float] | None = None, use_confidence: bool = True, fallback: str = 'UNCLEAR') -> tuple[str, bool]
Weighted vote: each verdict contributes weight = (reviewer_weight × confidence).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verdicts | list[Verdict] | per-reviewer verdicts. | required |
weights | dict[str, float] | None | optional per-reviewer weight overrides keyed by reviewer_id. Reviewers not in the dict default to weight=1.0. | None |
use_confidence | bool | if True, multiply weight by verdict.confidence. | True |
fallback | str | returned if no valid verdicts. | 'UNCLEAR' |
Source code in packages/cross-judge/src/cross_judge/aggregation.py
first_disagreement ¶
first_disagreement(verdicts: list[Verdict], *, disagree_label: str = 'DISAGREE', fallback: str = 'UNCLEAR') -> tuple[str, bool]
Returns disagree_label if any pair of reviewers differ; else the agreed label.
Source code in packages/cross-judge/src/cross_judge/aggregation.py
get_strategy ¶
Resolve a strategy by name (string) or pass through if already a callable.
Source code in packages/cross-judge/src/cross_judge/aggregation.py
avg_confidence ¶
Average confidence across valid (non-errored) verdicts.