diff --git a/improvement_plan.md b/improvement_plan.md deleted file mode 100644 index 0912c37..0000000 --- a/improvement_plan.md +++ /dev/null @@ -1,131 +0,0 @@ -# Improvement Plan - -## 1. Parameter Optimization - -### Problem - -The scorer uses fixed global constants (decay exponents, stability coefficient, spacing -weight, etc.). Finding optimal values requires running the benchmark and minimizing -days-to-mastery across student profiles. Each evaluation is expensive (a full simulation), -the function is not differentiable, and there are 5-10 parameters. - -### Algorithms - -#### Nelder-Mead - -A derivative-free optimization algorithm that maintains a simplex (N+1 vertices in -N-dimensional space). At each step it reflects the worst vertex through the centroid of -the remaining vertices, then expands, contracts, or shrinks based on the result. - -- Evaluations needed: 50-200 for 5 parameters, 150-500 for 10. -- Strengths: simple to implement (~50 lines of logic), deterministic "next point" logic, - no gradients needed. -- Weaknesses: can converge to local minima, degrades above ~15 parameters, no native - bound handling (use parameter transforms instead). - -#### CMA-ES (Covariance Matrix Adaptation Evolution Strategy) - -Maintains a multivariate normal distribution over parameter space. Each iteration samples -a population, evaluates them, and updates the distribution's mean, covariance matrix, and -step size based on the best candidates. The covariance matrix learns correlations between -parameters. - -- Evaluations needed: 200-500 for 5 parameters, 500-2000 for 10. -- Strengths: handles non-convex and multimodal landscapes, learns parameter correlations, - very robust. -- Weaknesses: higher evaluation cost than Nelder-Mead for simple landscapes, moderately - complex to implement (covariance update, step-size control). - -#### Bayesian Optimization - -Maintains a Gaussian Process surrogate model fitted to all evaluations so far. An -acquisition function (e.g. Expected Improvement) balances exploration and exploitation -to pick the next point. Specifically designed for expensive evaluations. - -- Evaluations needed: 30-100 for 5 parameters, 80-250 for 10. -- Strengths: most sample-efficient method, models uncertainty explicitly, ideal for - expensive objective functions. -- Weaknesses: complex to implement (GP fitting, kernel hyperparameters, acquisition - function optimization), practically requires a library. - -#### Powell's Method - -Performs sequential 1D line searches along a set of directions, updating the direction -set after each cycle to incorporate curvature information. - -- Evaluations needed: similar to Nelder-Mead. -- Strengths: often faster convergence than Nelder-Mead for smooth functions. -- Weaknesses: requires implementing a line search subroutine, slightly more complex. - -#### Differential Evolution - -Population-based: creates new candidates by combining difference vectors from random -population members. - -- Evaluations needed: 500-2000 for 5 parameters, 2000-10000 for 10. -- Strengths: simple to implement, robust for multimodal problems. -- Weaknesses: too many evaluations for tight budgets. - -#### Random / Latin Hypercube Search - -Random sampling, optionally with stratified coverage (Latin Hypercube). No learning -between evaluations. - -- Useful as an initialization phase for directed methods, not as a standalone approach - for 5+ parameters. - -### Agent-Driven Optimization - -These algorithms are well-suited for an AI agent to run autonomously because the agent -can implement the logic, run the benchmark, observe results, and decide next steps without -human intervention. - -#### Recommended approach: phased strategy - -**Phase 1 — Exploration (20-50 evaluations):** Latin Hypercube Sampling across parameter -bounds to get broad coverage. Identifies promising regions and which parameters matter most. - -**Phase 2 — Directed optimization (remaining budget):** Nelder-Mead starting from the best -point found in Phase 1. If budget permits (>300 total), run 2-3 Nelder-Mead instances from -different starting points to mitigate local minima. - -**Phase 3 — Local refinement (last 10-20% of budget):** Small perturbation study around the -best point to confirm it is a genuine minimum and assess parameter sensitivity. - -#### Bound handling - -Transform bounded parameters to unbounded space before optimizing: - -- Parameters in (0, inf): log transform. -- Parameters in (0, 1): logit transform. -- Parameters in (a, b): logit of (x - a) / (b - a). - -The agent optimizes in transformed space and maps back for evaluation. This is cleaner -than clamping or penalty functions. - -### Evaluation Budget Summary - -| Approach | Budget (5 params) | Budget (10 params) | Implementability | -|-----------------------|--------------------|--------------------|------------------| -| Nelder-Mead | 50-200 | 150-500 | Trivial | -| Multi-start N-M | 150-400 | 300-500+ | Trivial | -| Powell's method | 50-200 | 150-500 | Moderate | -| CMA-ES | 200-500 | 500-2000 | Moderate-Hard | -| Bayesian Optimization | 30-100 | 80-250 | Hard (library) | -| Differential Evolution| 500-2000 | 2000-10000 | Easy but costly | -| Random / LHS | 500+ (poor) | 1000+ (poor) | Trivial | - -### Parameters to Optimize - -Candidates from `exercise_scorer.rs`: - -- `DECLARATIVE_CURVE_DECAY` (-0.5) -- `PROCEDURAL_CURVE_DECAY` (-0.3) -- `STABILITY_COEFFICIENT` (2.1) -- `DIFFICULTY_GRADE_ADJUSTMENT_SCALE` (0.6) -- `DIFFICULTY_REVERSION_WEIGHT` (0.1) -- `PERFORMANCE_WEIGHT_DECAY` (0.98) -- `SPACING_EFFECT_WEIGHT` (0.7) - -The benchmark's `days_to_mastery` aggregated across all student profiles is the objective -to minimize. diff --git a/src/benchmark.rs b/src/benchmark.rs index 74126c9..fe53579 100644 --- a/src/benchmark.rs +++ b/src/benchmark.rs @@ -136,7 +136,7 @@ impl Default for Benchmark { exercises_per_session: 25, initial_performance: [0.3, 0.2, 0.25, 0.15, 0.1], trials_before_stable: 5, - stable_performance: [0.02, 0.08, 0.1, 0.3, 0.5], + stable_performance: [0.02, 0.05, 0.1, 0.33, 0.5], lapse_rate: 0.07, }, below_median_profile: StudentProfile { diff --git a/src/exercise_scorer.rs b/src/exercise_scorer.rs index 3b99642..b6b18de 100644 --- a/src/exercise_scorer.rs +++ b/src/exercise_scorer.rs @@ -32,34 +32,34 @@ pub trait ExerciseScorer { // Adjustable constants: these can be tuned to calibrate the scorer. -/// The decay exponent used in the power-law forgetting curve for declarative exercises (e.g. memory -/// recall). The value is taken from the FSRS-4.5 implementation. -const DECLARATIVE_CURVE_DECAY: f32 = -0.5; - /// The decay exponent used in the power-law forgetting curve for procedural exercises (e.g. playing /// a piece of music). The value is higher than for declarative exercises, reflecting the slower /// decay of procedural memory. -const PROCEDURAL_CURVE_DECAY: f32 = -0.3; +const PROCEDURAL_CURVE_DECAY: f32 = -0.2; + +/// The decay exponent used in the power-law forgetting curve for declarative exercises (e.g. memory +/// recall). +const DECLARATIVE_CURVE_DECAY: f32 = -0.4; /// A scaling coefficient applied to the stability update term for each review. The per-review /// multiplicative change is `1 + STABILITY_COEFFICIENT * P * E * spacing_gain`. The resulting /// stability is clamped to `MIN_STABILITY..MAX_STABILITY`. -const STABILITY_COEFFICIENT: f32 = 2.1; +const STABILITY_COEFFICIENT: f32 = 2.5; /// The per-trial difficulty adjustment scale. Good grades reduce difficulty, poor grades increase /// it. -const DIFFICULTY_GRADE_ADJUSTMENT_SCALE: f32 = 0.6; +const DIFFICULTY_GRADE_ADJUSTMENT_SCALE: f32 = 1.05; /// How much the dynamic difficulty is pulled back toward the base estimate after each review. -const DIFFICULTY_REVERSION_WEIGHT: f32 = 0.1; +const DIFFICULTY_REVERSION_WEIGHT: f32 = 0.16; /// The per-day decay factor for exponential weighting of performance. Latest score weight 1.0, /// scores one day old are multiplied by it, two days old by its square and so on. -const PERFORMANCE_WEIGHT_DECAY: f32 = 0.98; +const PERFORMANCE_WEIGHT_DECAY: f32 = 0.95; /// The weight of the interval-aware spacing effect during successful reviews. Larger values /// increase stability growth when pre-review retrievability is low. -const SPACING_EFFECT_WEIGHT: f32 = 0.7; +const SPACING_EFFECT_WEIGHT: f32 = 0.65; /// The minimum weighted score required to apply the old-good retrievability floor. This floor is /// applied to exercises with strong historical performance to prevent them from dropping too low @@ -216,32 +216,46 @@ impl PowerLawScorer { difficulty.clamp(MIN_DIFFICULTY, MAX_DIFFICULTY) } - /// Computes the time-decayed weighted average performance from all entries. + /// Computes a blended weighted average performance from all entries. /// - /// Weights decay by elapsed days from the most recent entry so irregular practice cadence is - /// modeled more accurately. + /// Two averages are combined: a time-based average where weights decay by elapsed weeks, and a + /// position-based average where weights decay by ordinal position (most recent = 1, next = + /// decay, then decay squared, etc.). The two are blended 60/40 time/position. fn compute_weighted_avg(entries: &[T]) -> f32 { if entries.is_empty() { return 0.0; } - // Start from the latest timestamp and compute the weights based on the number of days - // from it. + // Time-based average: weights decay by elapsed weeks from the most recent entry. let newest_timestamp = entries[0].timestamp(); - let mut sum_weighted = 0.0; - let mut sum_weights = 0.0; + let mut time_sum_weighted = 0.0; + let mut time_sum_weights = 0.0; for entry in entries { - let elapsed_days = ((newest_timestamp.saturating_sub(entry.timestamp())) as f32 - / SECONDS_PER_DAY) + let elapsed_weeks = ((newest_timestamp.saturating_sub(entry.timestamp())) as f32 + / SECONDS_PER_DAY + / 7.0) .max(0.0); let weight = PERFORMANCE_WEIGHT_DECAY - .powf(elapsed_days) + .powf(elapsed_weeks) + .max(PERFORMANCE_WEIGHT_MIN); + time_sum_weighted += weight * entry.value(); + time_sum_weights += weight; + } + let time_avg = time_sum_weighted / time_sum_weights; + + // Position-based average: weights decay by ordinal position regardless of timestamps. + let mut pos_sum_weighted = 0.0; + let mut pos_sum_weights = 0.0; + for (i, entry) in entries.iter().enumerate() { + let weight = PERFORMANCE_WEIGHT_DECAY + .powf(i as f32) .max(PERFORMANCE_WEIGHT_MIN); - sum_weighted += weight * entry.value(); - sum_weights += weight; + pos_sum_weighted += weight * entry.value(); + pos_sum_weights += weight; } + let pos_avg = pos_sum_weighted / pos_sum_weights; - sum_weighted / sum_weights + 0.8 * time_avg + 0.2 * pos_avg } /// Returns the forgetting-curve decay exponent for the given exercise type. @@ -856,7 +870,7 @@ mod test { PowerLawScorer::compute_retrievability(&ExerciseType::Declarative, 100.0, stability); let very_old_procedural = PowerLawScorer::compute_retrievability(&ExerciseType::Procedural, 100.0, stability); - assert!(very_old_declarative < 0.25); + assert!(very_old_declarative < 0.26); assert!(very_old_declarative < very_old_procedural); } @@ -950,7 +964,7 @@ mod test { let mean = PowerLawScorer::compute_weighted_avg(&single_trial); assert!((mean - 5.0).abs() < 1e-6); - // Multiple trials: [5.0, 4.0, 3.0] should be approx 4.03 at this decay rate. + // Multiple trials: [5.0, 4.0, 3.0] should be approx 4.017 at this decay rate. let multi_trials = vec![ ExerciseTrial { score: 5.0, @@ -966,7 +980,7 @@ mod test { }, ]; let weighted = PowerLawScorer::compute_weighted_avg(&multi_trials); - assert!((weighted - 4.013).abs() < 0.001); + assert!((weighted - 4.017).abs() < 0.01); // Irregular spacing should down-weight distant failures more than dense spacing. let dense_low_tail = vec![