Add tournament-style pairwise ranking aggregation#128
Open
bledden wants to merge 2 commits intokarpathy:masterfrom
Open
Add tournament-style pairwise ranking aggregation#128bledden wants to merge 2 commits intokarpathy:masterfrom
bledden wants to merge 2 commits intokarpathy:masterfrom
Conversation
Adds calculate_tournament_rankings() as an alternative to simple mean ranking. Algorithm: - Convert ordinal rankings to pairwise matchups - For each pair of models, majority vote determines winner - Ties awarded 0.5 points to each - Final score = wins / total_matchups Benefits over mean ranking: - More robust to outlier rankings - Theoretically principled (Condorcet-style) - Handles cyclic preferences gracefully Both ranking methods now included in metadata: - aggregate_rankings: mean position (existing) - tournament_rankings: pairwise win percentage (new) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3aaa3e8 to
b1bbb9a
Compare
Documents the tournament-style pairwise comparison algorithm with: - Explanation of why it's more robust than mean averaging - Concrete example showing self-promotion bias scenario - Tables comparing mean vs tournament results - Outlier robustness validation (mean degrades 1.0→1.5, tournament stays 100%) - Summary of validation test coverage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was referenced Jan 6, 2026
eddiefleurent
added a commit
to eddiefleurent/llm-council
that referenced
this pull request
Jan 30, 2026
Tier 1 (High Value, Low Risk): - PR #72: Use CHAIRMAN_MODEL for title generation (configurable) - PR #51: Validate OPENROUTER_API_KEY at startup (fail fast) - PR #5: Fix text overflow on chat interface (CSS fixes) - PR #69: Prevent conversation switching while streaming - PR karpathy#110: Copy functionality (copy buttons for responses) Tier 2 (Good Features, Moderate Complexity): - PR karpathy#126: Fix model.split error when model is array (defensive) - PR karpathy#127: Structured error propagation for API failures - PR #67: Continuous conversation mode + prevent empty convos - PR #90: Clear History button with confirmation - PR karpathy#128: Tournament-style pairwise ranking (Condorcet voting) Tier 3 (Nice-to-Have, More Complex): - PR karpathy#109: Multi-message conversation support with context - PR #24: Test suite infrastructure (pytest setup) New files: - backend/context.py: Smart conversation context management - frontend/src/utils.js: getModelDisplayName helper - frontend/src/components/CopyButton.jsx: Reusable copy button - tests/: Unit test infrastructure - pytest.ini, conftest.py: Test configuration
eddiefleurent
added a commit
to eddiefleurent/llm-council
that referenced
this pull request
Jan 30, 2026
…opy, tests) (#1) * Integrate valuable PRs from abandoned upstream Tier 1 (High Value, Low Risk): - PR #72: Use CHAIRMAN_MODEL for title generation (configurable) - PR #51: Validate OPENROUTER_API_KEY at startup (fail fast) - PR #5: Fix text overflow on chat interface (CSS fixes) - PR #69: Prevent conversation switching while streaming - PR karpathy#110: Copy functionality (copy buttons for responses) Tier 2 (Good Features, Moderate Complexity): - PR karpathy#126: Fix model.split error when model is array (defensive) - PR karpathy#127: Structured error propagation for API failures - PR #67: Continuous conversation mode + prevent empty convos - PR #90: Clear History button with confirmation - PR karpathy#128: Tournament-style pairwise ranking (Condorcet voting) Tier 3 (Nice-to-Have, More Complex): - PR karpathy#109: Multi-message conversation support with context - PR #24: Test suite infrastructure (pytest setup) New files: - backend/context.py: Smart conversation context management - frontend/src/utils.js: getModelDisplayName helper - frontend/src/components/CopyButton.jsx: Reusable copy button - tests/: Unit test infrastructure - pytest.ini, conftest.py: Test configuration * Enhance backend and frontend functionality - Added defensive check in `run_full_council` to handle empty messages. - Improved error handling in `send_message_stream` with logging and sanitized error messages. - Updated `delete_all_conversations` to return a list of deletion results, including any failures. - Modified API call in frontend to require confirmation for deleting conversations. - Enhanced `getModelDisplayName` to handle multi-slash identifiers. - Updated `CopyButton` component to clear timeout on unmount and improve success state handling. - CSS adjustments for better styling and functionality across components. - Added unit tests for conversation retrieval and management functions. * Refactor tournament ranking calculation and enhance frontend message handling - Updated `calculate_tournament_rankings` to use actual matchups for win percentage calculation. - Integrated `calculate_tournament_rankings` into the message streaming process in `send_message_stream`. - Improved loading state management for message updates in the frontend, ensuring immutability and clarity in state changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
calculate_tournament_rankings()as an alternative ranking method alongside the existing mean-based aggregation.Motivation
The current
calculate_aggregate_rankings()averages position numbers, which has limitations:Tournament-style pairwise comparison is more robust:
Algorithm
For rankings like:
Changes
calculate_tournament_rankings()function inbackend/council.pyrun_full_council()to includetournament_rankingsin metadataaggregate_rankings(mean) andtournament_rankings(pairwise)Validation
Tested with 7 unit test scenarios:
End-to-end test with 5 models ranking 5 responses confirms tournament ranking is more robust to outliers.
Test plan
tournament_rankingsappears in metadata🤖 Generated with Claude Code