Skip to content

[P2][Phase-3][experimental] Porto Dataset Beam Search Ablation Study #44

@matercomus

Description

@matercomus

Metadata

Problem Statement

Issue #8 demonstrated that distillation benefits are search-method dependent on Beijing dataset. However, Porto dataset shows fundamentally different baseline characteristics (vanilla succeeds with 88-89% OD match vs Beijing's 12-18% failure), raising questions about cross-dataset generalizability of the search-method interaction.

Evidence

Beijing Results (Issue #8):

  • A* Search: Distilled 6.0x faster than vanilla (0.30 vs 0.05 traj/s)
  • Beam Search: Vanilla 1.4x faster than distilled (2.46 vs 1.79 traj/s)
  • Conclusion: Search method determines whether distillation helps or hurts speed

Porto Characteristics (Different from Beijing):

| Characteristic          | Beijing      | Porto        | Impact                    |
|------------------------|--------------|--------------|---------------------------|
| Vanilla OD Match (Beam)| 19%          | 88-89%       | Porto vanilla succeeds!   |
| Average trip distance  | 5.16 km      | 3.66 km      | 29% shorter trips         |
| Distillation benefit   | +486% OD     | -2% OD       | Minimal advantage         |
| Road network           | Grid         | Complex      | Different topology        |
| Models available       | 6 (2×3)      | 9 (3×3)      | More comprehensive        |

Key Question: Does the search-method interaction:

  • Only manifest when vanilla baseline fails (Beijing-specific)?
  • Show different patterns with successful vanilla baseline (Porto)?
  • Depend on dataset characteristics (trip length, network topology)?

Proposed Experiment

Option A: Selective Ablation (Recommended - 10 hours)

Models to test (2 models only):

  • vanilla_25epoch_seed42.pth (best Porto baseline: 91.7% train, 89.6% test)
  • distill_phase2_seed42.pth (optimized hyperparameters: λ=0.00598, τ=2.515)

Configuration:

  • Search methods: A* + Beam (width=4)
  • OD sources: train + test
  • Trajectories: 1,000 per run
  • Total runs: 2 models × 2 OD × 2 search = 8 runs
  • Estimated time: ~10 hours

Rationale:

  • Tests key hypotheses with 75% less computation than full study
  • Phase 2 hyperparameters optimized for Porto (unlike Phase 1 used in existing eval)
  • Best vanilla baseline for fair comparison
  • Sufficient for cross-dataset validation

Option B: Full Ablation (Thorough - 40 hours)

All models (9 total):

  • 3× vanilla (seeds 42, 43, 44)
  • 3× distill_phase1 (seeds 42, 43, 44)
  • 3× distill_phase2 (seeds 42, 43, 44)

Configuration:

  • Search methods: A* + Beam (width=4)
  • OD sources: train + test
  • Trajectories: 1,000 per run
  • Total runs: 9 models × 2 OD × 2 search = 36 runs
  • Estimated time: ~40 hours

Benefits:

  • Comprehensive cross-dataset validation
  • Compares Phase 1 vs Phase 2 hyperparameter impact
  • Tests seed robustness (3 seeds per model type)
  • Publication-quality rigor

Expected Outcomes

Hypothesis A: No Interaction (Most Likely)

Since Porto vanilla already succeeds:

  • Both A* and Beam show similar performance
  • Distillation benefit remains minimal (~-2% to +1%)
  • Implication: Search-method interaction is Beijing-specific (occurs when vanilla fails)

Hypothesis B: Reversed Interaction

Porto's shorter trips might show opposite pattern:

  • Vanilla might excel with A* (simpler navigation)
  • Distilled might benefit more from beam (exploration helps in complex topology)
  • Implication: Interaction pattern depends on task complexity

Hypothesis C: Phase 2 Reveals Benefits

Optimized hyperparameters might show:

  • Phase 2 distilled performs better than Phase 1
  • Search-method interaction appears with proper hyperparameters
  • Implication: Hyperparameter tuning quality affects interaction

Validation Steps

  • Experiment designed (Option A or B selected)
  • Script adapted for Porto dataset
  • A* generation complete (if not already existing)
  • Beam generation complete
  • A* evaluation with normalized metrics
  • Beam evaluation with normalized metrics
  • Results analyzed and compared to Beijing
  • Cross-dataset conclusions documented

Scientific Value

Publication Impact

Strengthens Paper:

  • ✅ Cross-dataset validation (2 cities vs 1)
  • ✅ Tests generalizability of search-method interaction
  • ✅ Shows dataset-dependent vs universal effects
  • ✅ Demonstrates when distillation benefits apply

Negative Results Also Valuable:

  • "No interaction when vanilla succeeds" is a finding
  • Shows interaction is context-dependent, not universal
  • Provides deployment guidance based on baseline performance

Research Questions Answered

  1. Is search-method interaction universal?

    • Beijing: Yes, dramatic interaction
    • Porto: To be determined
  2. Does interaction depend on baseline performance?

    • When vanilla fails (Beijing): Strong interaction
    • When vanilla succeeds (Porto): ?
  3. Do optimized hyperparameters change interaction?

    • Phase 1 (suboptimal): Minimal benefit on Porto
    • Phase 2 (optimized): To be tested

Files to Generate

Trajectories (if not already exist):

  • hoser-distill-optuna-porto-eval-*/gene/Porto/seed42/*_astar_*.csv (A* search)
  • Beam search trajectories already exist from previous eval

Evaluations (with normalized metrics):

  • eval/2025-11-*/results.json (with Hausdorff_norm, DTW_norm)
  • Performance metrics: *_perf.json

Documentation:

  • Comparison report: Porto vs Beijing search-method interaction
  • Update to docs/EVALUATION_COMPARISON.md

Coordination with Other Issues

Dependencies:

Blocks:

  • None (independent validation study)

Relates to:

Execution Plan

Immediate (for Issue #8 closure):

Short-term (if pursuing Option A):

  1. Verify A* trajectories exist for vanilla + distill_phase2 seed42
  2. Generate missing A* trajectories if needed (~5 hours)
  3. Re-evaluate with normalized metrics (~1 hour)
  4. Generate beam trajectories (~4 hours)
  5. Evaluate beam with normalized metrics (~1 hour)
  6. Analyze and document results (~1 hour)

Long-term (if pursuing Option B):

  1. Full A* generation for all 9 models (~20 hours)
  2. Full beam generation for all 9 models (~20 hours)
  3. Comprehensive analysis (~2 hours)

Success Criteria

Minimum (Option A):

  • 8 evaluation runs completed with normalized metrics
  • Clear comparison: Porto vs Beijing interaction patterns
  • Documented: When distillation benefits depend on search method

Comprehensive (Option B):

  • 36 evaluation runs completed across all models
  • Phase 1 vs Phase 2 impact on search-method interaction
  • Seed robustness analysis
  • Cross-dataset comparison table

Notes

From Beijing Ablation (Issue #8):

  • Normalized metrics (Hausdorff_norm, DTW_norm) essential for fair comparison
  • Speed-quality tradeoffs must be considered
  • OD match rate is critical differentiator

Porto-Specific Considerations:

  • Shorter trips (3.66 km) may reduce search depth impact
  • Higher road network connectivity might minimize A* vs Beam differences
  • Vanilla baseline strength reduces distillation's relative advantage

Reference: Issue #8 (Beijing Beam Search Ablation)
Related: hoser-distill-optuna-porto-eval-eb0e88ab-20251026_152732/EVALUATION_ANALYSIS_PHASE1.md
Status: Future work - to be scheduled after P1-Major issues complete
Recommendation: Start with Option A (selective ablation) for efficient cross-validation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions