Skip to content

Add tournament-style pairwise ranking aggregation#128

Open
bledden wants to merge 2 commits intokarpathy:masterfrom
bledden:feature-tournament-ranking
Open

Add tournament-style pairwise ranking aggregation#128
bledden wants to merge 2 commits intokarpathy:masterfrom
bledden:feature-tournament-ranking

Conversation

@bledden
Copy link
Copy Markdown

@bledden bledden commented Jan 6, 2026

Summary

Adds calculate_tournament_rankings() as an alternative ranking method alongside the existing mean-based aggregation.

Motivation

The current calculate_aggregate_rankings() averages position numbers, which has limitations:

  • Vulnerable to outlier rankings (one model ranking Response E first significantly affects E's score)
  • Position 1→2 treated same as position 4→5
  • Not theoretically principled

Tournament-style pairwise comparison is more robust:

  • Converts rankings to head-to-head matchups
  • Majority vote determines each matchup winner
  • Final score = win percentage across all matchups
  • Based on Condorcet voting theory

Algorithm

For rankings like:

Ranker 1: A > B > C
Ranker 2: A > C > B
Ranker 3: B > A > C
  1. Extract pairwise preferences from each ranking
  2. For each pair (A vs B), count votes: A wins 2, B wins 1 → A wins matchup
  3. Calculate: A: 2 wins (100%), B: 1 win (50%), C: 0 wins (0%)

Changes

  • Add calculate_tournament_rankings() function in backend/council.py
  • Update run_full_council() to include tournament_rankings in metadata
  • Both methods now available: aggregate_rankings (mean) and tournament_rankings (pairwise)

Validation

Tested with 7 unit test scenarios:

  • ✅ Unanimous rankings
  • ✅ Split decisions (2:1 votes)
  • ✅ Tie handling (0.5 points each)
  • ✅ Single ranker edge case
  • ✅ Empty rankings edge case
  • ✅ Cyclic preferences (A>B, B>C, C>A)
  • ✅ Outlier robustness comparison

End-to-end test with 5 models ranking 5 responses confirms tournament ranking is more robust to outliers.

Test plan

  • Verify tournament_rankings appears in metadata
  • Verify ranking order matches expected pairwise winners
  • Verify ties are handled correctly (0.5 points each)

🤖 Generated with Claude Code

Adds calculate_tournament_rankings() as an alternative to simple mean ranking.

Algorithm:
- Convert ordinal rankings to pairwise matchups
- For each pair of models, majority vote determines winner
- Ties awarded 0.5 points to each
- Final score = wins / total_matchups

Benefits over mean ranking:
- More robust to outlier rankings
- Theoretically principled (Condorcet-style)
- Handles cyclic preferences gracefully

Both ranking methods now included in metadata:
- aggregate_rankings: mean position (existing)
- tournament_rankings: pairwise win percentage (new)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@bledden bledden force-pushed the feature-tournament-ranking branch from 3aaa3e8 to b1bbb9a Compare January 6, 2026 02:26
Documents the tournament-style pairwise comparison algorithm with:
- Explanation of why it's more robust than mean averaging
- Concrete example showing self-promotion bias scenario
- Tables comparing mean vs tournament results
- Outlier robustness validation (mean degrades 1.0→1.5, tournament stays 100%)
- Summary of validation test coverage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
eddiefleurent added a commit to eddiefleurent/llm-council that referenced this pull request Jan 30, 2026
Tier 1 (High Value, Low Risk):
- PR #72: Use CHAIRMAN_MODEL for title generation (configurable)
- PR #51: Validate OPENROUTER_API_KEY at startup (fail fast)
- PR #5: Fix text overflow on chat interface (CSS fixes)
- PR #69: Prevent conversation switching while streaming
- PR karpathy#110: Copy functionality (copy buttons for responses)

Tier 2 (Good Features, Moderate Complexity):
- PR karpathy#126: Fix model.split error when model is array (defensive)
- PR karpathy#127: Structured error propagation for API failures
- PR #67: Continuous conversation mode + prevent empty convos
- PR #90: Clear History button with confirmation
- PR karpathy#128: Tournament-style pairwise ranking (Condorcet voting)

Tier 3 (Nice-to-Have, More Complex):
- PR karpathy#109: Multi-message conversation support with context
- PR #24: Test suite infrastructure (pytest setup)

New files:
- backend/context.py: Smart conversation context management
- frontend/src/utils.js: getModelDisplayName helper
- frontend/src/components/CopyButton.jsx: Reusable copy button
- tests/: Unit test infrastructure
- pytest.ini, conftest.py: Test configuration
eddiefleurent added a commit to eddiefleurent/llm-council that referenced this pull request Jan 30, 2026
…opy, tests) (#1)

* Integrate valuable PRs from abandoned upstream

Tier 1 (High Value, Low Risk):
- PR #72: Use CHAIRMAN_MODEL for title generation (configurable)
- PR #51: Validate OPENROUTER_API_KEY at startup (fail fast)
- PR #5: Fix text overflow on chat interface (CSS fixes)
- PR #69: Prevent conversation switching while streaming
- PR karpathy#110: Copy functionality (copy buttons for responses)

Tier 2 (Good Features, Moderate Complexity):
- PR karpathy#126: Fix model.split error when model is array (defensive)
- PR karpathy#127: Structured error propagation for API failures
- PR #67: Continuous conversation mode + prevent empty convos
- PR #90: Clear History button with confirmation
- PR karpathy#128: Tournament-style pairwise ranking (Condorcet voting)

Tier 3 (Nice-to-Have, More Complex):
- PR karpathy#109: Multi-message conversation support with context
- PR #24: Test suite infrastructure (pytest setup)

New files:
- backend/context.py: Smart conversation context management
- frontend/src/utils.js: getModelDisplayName helper
- frontend/src/components/CopyButton.jsx: Reusable copy button
- tests/: Unit test infrastructure
- pytest.ini, conftest.py: Test configuration

* Enhance backend and frontend functionality

- Added defensive check in `run_full_council` to handle empty messages.
- Improved error handling in `send_message_stream` with logging and sanitized error messages.
- Updated `delete_all_conversations` to return a list of deletion results, including any failures.
- Modified API call in frontend to require confirmation for deleting conversations.
- Enhanced `getModelDisplayName` to handle multi-slash identifiers.
- Updated `CopyButton` component to clear timeout on unmount and improve success state handling.
- CSS adjustments for better styling and functionality across components.
- Added unit tests for conversation retrieval and management functions.

* Refactor tournament ranking calculation and enhance frontend message handling

- Updated `calculate_tournament_rankings` to use actual matchups for win percentage calculation.
- Integrated `calculate_tournament_rankings` into the message streaming process in `send_message_stream`.
- Improved loading state management for message updates in the frontend, ensuring immutability and clarity in state changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant