Add tournament-style pairwise ranking aggregation by bledden · Pull Request #128 · karpathy/llm-council

bledden · 2026-01-06T02:08:13Z

Summary

Adds calculate_tournament_rankings() as an alternative ranking method alongside the existing mean-based aggregation.

Motivation

The current calculate_aggregate_rankings() averages position numbers, which has limitations:

Vulnerable to outlier rankings (one model ranking Response E first significantly affects E's score)
Position 1→2 treated same as position 4→5
Not theoretically principled

Tournament-style pairwise comparison is more robust:

Converts rankings to head-to-head matchups
Majority vote determines each matchup winner
Final score = win percentage across all matchups
Based on Condorcet voting theory

Algorithm

For rankings like:

Ranker 1: A > B > C
Ranker 2: A > C > B
Ranker 3: B > A > C

Extract pairwise preferences from each ranking
For each pair (A vs B), count votes: A wins 2, B wins 1 → A wins matchup
Calculate: A: 2 wins (100%), B: 1 win (50%), C: 0 wins (0%)

Changes

Add calculate_tournament_rankings() function in backend/council.py
Update run_full_council() to include tournament_rankings in metadata
Both methods now available: aggregate_rankings (mean) and tournament_rankings (pairwise)

Validation

Tested with 7 unit test scenarios:

✅ Unanimous rankings
✅ Split decisions (2:1 votes)
✅ Tie handling (0.5 points each)
✅ Single ranker edge case
✅ Empty rankings edge case
✅ Cyclic preferences (A>B, B>C, C>A)
✅ Outlier robustness comparison

End-to-end test with 5 models ranking 5 responses confirms tournament ranking is more robust to outliers.

Test plan

Verify tournament_rankings appears in metadata
Verify ranking order matches expected pairwise winners
Verify ties are handled correctly (0.5 points each)

🤖 Generated with Claude Code

Adds calculate_tournament_rankings() as an alternative to simple mean ranking. Algorithm: - Convert ordinal rankings to pairwise matchups - For each pair of models, majority vote determines winner - Ties awarded 0.5 points to each - Final score = wins / total_matchups Benefits over mean ranking: - More robust to outlier rankings - Theoretically principled (Condorcet-style) - Handles cyclic preferences gracefully Both ranking methods now included in metadata: - aggregate_rankings: mean position (existing) - tournament_rankings: pairwise win percentage (new) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Documents the tournament-style pairwise comparison algorithm with: - Explanation of why it's more robust than mean averaging - Concrete example showing self-promotion bias scenario - Tables comparing mean vs tournament results - Outlier robustness validation (mean degrades 1.0→1.5, tournament stays 100%) - Summary of validation test coverage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tier 1 (High Value, Low Risk): - PR #72: Use CHAIRMAN_MODEL for title generation (configurable) - PR #51: Validate OPENROUTER_API_KEY at startup (fail fast) - PR #5: Fix text overflow on chat interface (CSS fixes) - PR #69: Prevent conversation switching while streaming - PR karpathy#110: Copy functionality (copy buttons for responses) Tier 2 (Good Features, Moderate Complexity): - PR karpathy#126: Fix model.split error when model is array (defensive) - PR karpathy#127: Structured error propagation for API failures - PR #67: Continuous conversation mode + prevent empty convos - PR #90: Clear History button with confirmation - PR karpathy#128: Tournament-style pairwise ranking (Condorcet voting) Tier 3 (Nice-to-Have, More Complex): - PR karpathy#109: Multi-message conversation support with context - PR #24: Test suite infrastructure (pytest setup) New files: - backend/context.py: Smart conversation context management - frontend/src/utils.js: getModelDisplayName helper - frontend/src/components/CopyButton.jsx: Reusable copy button - tests/: Unit test infrastructure - pytest.ini, conftest.py: Test configuration

…opy, tests) (#1) * Integrate valuable PRs from abandoned upstream Tier 1 (High Value, Low Risk): - PR #72: Use CHAIRMAN_MODEL for title generation (configurable) - PR #51: Validate OPENROUTER_API_KEY at startup (fail fast) - PR #5: Fix text overflow on chat interface (CSS fixes) - PR #69: Prevent conversation switching while streaming - PR karpathy#110: Copy functionality (copy buttons for responses) Tier 2 (Good Features, Moderate Complexity): - PR karpathy#126: Fix model.split error when model is array (defensive) - PR karpathy#127: Structured error propagation for API failures - PR #67: Continuous conversation mode + prevent empty convos - PR #90: Clear History button with confirmation - PR karpathy#128: Tournament-style pairwise ranking (Condorcet voting) Tier 3 (Nice-to-Have, More Complex): - PR karpathy#109: Multi-message conversation support with context - PR #24: Test suite infrastructure (pytest setup) New files: - backend/context.py: Smart conversation context management - frontend/src/utils.js: getModelDisplayName helper - frontend/src/components/CopyButton.jsx: Reusable copy button - tests/: Unit test infrastructure - pytest.ini, conftest.py: Test configuration * Enhance backend and frontend functionality - Added defensive check in `run_full_council` to handle empty messages. - Improved error handling in `send_message_stream` with logging and sanitized error messages. - Updated `delete_all_conversations` to return a list of deletion results, including any failures. - Modified API call in frontend to require confirmation for deleting conversations. - Enhanced `getModelDisplayName` to handle multi-slash identifiers. - Updated `CopyButton` component to clear timeout on unmount and improve success state handling. - CSS adjustments for better styling and functionality across components. - Added unit tests for conversation retrieval and management functions. * Refactor tournament ranking calculation and enhance frontend message handling - Updated `calculate_tournament_rankings` to use actual matchups for win percentage calculation. - Integrated `calculate_tournament_rankings` into the message streaming process in `send_message_stream`. - Improved loading state management for message updates in the frontend, ensuring immutability and clarity in state changes.

bledden force-pushed the feature-tournament-ranking branch from 3aaa3e8 to b1bbb9a Compare January 6, 2026 02:26

This was referenced Jan 6, 2026

feat: add minority opinion detection for ranking disagreements #129

Open

feat: add ranking conflict detection between models #130

Open

This was referenced Jan 30, 2026

feat: Integrate community improvements (multi-turn, error handling, copy, tests) #147

Closed

feat: Integrate community improvements (multi-turn, error handling, copy, tests) eddiefleurent/llm-council#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tournament-style pairwise ranking aggregation#128

Add tournament-style pairwise ranking aggregation#128
bledden wants to merge 2 commits intokarpathy:masterfrom
bledden:feature-tournament-ranking

bledden commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bledden commented Jan 6, 2026

Summary

Motivation

Algorithm

Changes

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant