-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Metadata
- Priority: P2-Moderate
- Phase: Phase 3 (Future Work)
- Feasibility: experimental
- Category: validation
- Effort: 10-40 hours (depending on scope)
- Dependencies: Issue [P0][Phase-2][document] Issue 1.8: Beam Search Evaluation Dependence #8 (Beijing ablation complete)
Problem Statement
Issue #8 demonstrated that distillation benefits are search-method dependent on Beijing dataset. However, Porto dataset shows fundamentally different baseline characteristics (vanilla succeeds with 88-89% OD match vs Beijing's 12-18% failure), raising questions about cross-dataset generalizability of the search-method interaction.
Evidence
Beijing Results (Issue #8):
- A* Search: Distilled 6.0x faster than vanilla (0.30 vs 0.05 traj/s)
- Beam Search: Vanilla 1.4x faster than distilled (2.46 vs 1.79 traj/s)
- Conclusion: Search method determines whether distillation helps or hurts speed
Porto Characteristics (Different from Beijing):
| Characteristic | Beijing | Porto | Impact |
|------------------------|--------------|--------------|---------------------------|
| Vanilla OD Match (Beam)| 19% | 88-89% | Porto vanilla succeeds! |
| Average trip distance | 5.16 km | 3.66 km | 29% shorter trips |
| Distillation benefit | +486% OD | -2% OD | Minimal advantage |
| Road network | Grid | Complex | Different topology |
| Models available | 6 (2×3) | 9 (3×3) | More comprehensive |
Key Question: Does the search-method interaction:
- Only manifest when vanilla baseline fails (Beijing-specific)?
- Show different patterns with successful vanilla baseline (Porto)?
- Depend on dataset characteristics (trip length, network topology)?
Proposed Experiment
Option A: Selective Ablation (Recommended - 10 hours)
Models to test (2 models only):
vanilla_25epoch_seed42.pth(best Porto baseline: 91.7% train, 89.6% test)distill_phase2_seed42.pth(optimized hyperparameters: λ=0.00598, τ=2.515)
Configuration:
- Search methods: A* + Beam (width=4)
- OD sources: train + test
- Trajectories: 1,000 per run
- Total runs: 2 models × 2 OD × 2 search = 8 runs
- Estimated time: ~10 hours
Rationale:
- Tests key hypotheses with 75% less computation than full study
- Phase 2 hyperparameters optimized for Porto (unlike Phase 1 used in existing eval)
- Best vanilla baseline for fair comparison
- Sufficient for cross-dataset validation
Option B: Full Ablation (Thorough - 40 hours)
All models (9 total):
- 3× vanilla (seeds 42, 43, 44)
- 3× distill_phase1 (seeds 42, 43, 44)
- 3× distill_phase2 (seeds 42, 43, 44)
Configuration:
- Search methods: A* + Beam (width=4)
- OD sources: train + test
- Trajectories: 1,000 per run
- Total runs: 9 models × 2 OD × 2 search = 36 runs
- Estimated time: ~40 hours
Benefits:
- Comprehensive cross-dataset validation
- Compares Phase 1 vs Phase 2 hyperparameter impact
- Tests seed robustness (3 seeds per model type)
- Publication-quality rigor
Expected Outcomes
Hypothesis A: No Interaction (Most Likely)
Since Porto vanilla already succeeds:
- Both A* and Beam show similar performance
- Distillation benefit remains minimal (~-2% to +1%)
- Implication: Search-method interaction is Beijing-specific (occurs when vanilla fails)
Hypothesis B: Reversed Interaction
Porto's shorter trips might show opposite pattern:
- Vanilla might excel with A* (simpler navigation)
- Distilled might benefit more from beam (exploration helps in complex topology)
- Implication: Interaction pattern depends on task complexity
Hypothesis C: Phase 2 Reveals Benefits
Optimized hyperparameters might show:
- Phase 2 distilled performs better than Phase 1
- Search-method interaction appears with proper hyperparameters
- Implication: Hyperparameter tuning quality affects interaction
Validation Steps
- Experiment designed (Option A or B selected)
- Script adapted for Porto dataset
- A* generation complete (if not already existing)
- Beam generation complete
- A* evaluation with normalized metrics
- Beam evaluation with normalized metrics
- Results analyzed and compared to Beijing
- Cross-dataset conclusions documented
Scientific Value
Publication Impact
Strengthens Paper:
- ✅ Cross-dataset validation (2 cities vs 1)
- ✅ Tests generalizability of search-method interaction
- ✅ Shows dataset-dependent vs universal effects
- ✅ Demonstrates when distillation benefits apply
Negative Results Also Valuable:
- "No interaction when vanilla succeeds" is a finding
- Shows interaction is context-dependent, not universal
- Provides deployment guidance based on baseline performance
Research Questions Answered
-
Is search-method interaction universal?
- Beijing: Yes, dramatic interaction
- Porto: To be determined
-
Does interaction depend on baseline performance?
- When vanilla fails (Beijing): Strong interaction
- When vanilla succeeds (Porto): ?
-
Do optimized hyperparameters change interaction?
- Phase 1 (suboptimal): Minimal benefit on Porto
- Phase 2 (optimized): To be tested
Files to Generate
Trajectories (if not already exist):
hoser-distill-optuna-porto-eval-*/gene/Porto/seed42/*_astar_*.csv(A* search)- Beam search trajectories already exist from previous eval
Evaluations (with normalized metrics):
eval/2025-11-*/results.json(with Hausdorff_norm, DTW_norm)- Performance metrics:
*_perf.json
Documentation:
- Comparison report: Porto vs Beijing search-method interaction
- Update to
docs/EVALUATION_COMPARISON.md
Coordination with Other Issues
Dependencies:
- ✅ Issue [P0][Phase-2][document] Issue 1.8: Beam Search Evaluation Dependence #8 (Beijing ablation) - Complete
- ✅ Issue [P1][Phase-2][fix] Issue 2.6: Local Metrics Not Normalized #14 (Normalized metrics) - Merged
- ✅ Issue [P2][Phase-3][document] Issue 3.4: Coefficient of Variation Misuse #22 (CV handling) - Merged
Blocks:
- None (independent validation study)
Relates to:
- Issue [P3][Phase-3][document] Issue 5.3: Cross-Model Comparisons Limited #34: Cross-Model Comparisons Limited
- Issue [P3][Phase-3][document] Issue 4.4: Porto Dataset Discussion Weak #30: Porto Dataset Discussion Weak
Execution Plan
Immediate (for Issue #8 closure):
- Document as future work in Issue [P0][Phase-2][document] Issue 1.8: Beam Search Evaluation Dependence #8
- Create this issue for tracking
- Close Issue [P0][Phase-2][document] Issue 1.8: Beam Search Evaluation Dependence #8 with Beijing results
Short-term (if pursuing Option A):
- Verify A* trajectories exist for vanilla + distill_phase2 seed42
- Generate missing A* trajectories if needed (~5 hours)
- Re-evaluate with normalized metrics (~1 hour)
- Generate beam trajectories (~4 hours)
- Evaluate beam with normalized metrics (~1 hour)
- Analyze and document results (~1 hour)
Long-term (if pursuing Option B):
- Full A* generation for all 9 models (~20 hours)
- Full beam generation for all 9 models (~20 hours)
- Comprehensive analysis (~2 hours)
Success Criteria
Minimum (Option A):
- 8 evaluation runs completed with normalized metrics
- Clear comparison: Porto vs Beijing interaction patterns
- Documented: When distillation benefits depend on search method
Comprehensive (Option B):
- 36 evaluation runs completed across all models
- Phase 1 vs Phase 2 impact on search-method interaction
- Seed robustness analysis
- Cross-dataset comparison table
Notes
From Beijing Ablation (Issue #8):
- Normalized metrics (
Hausdorff_norm,DTW_norm) essential for fair comparison - Speed-quality tradeoffs must be considered
- OD match rate is critical differentiator
Porto-Specific Considerations:
- Shorter trips (3.66 km) may reduce search depth impact
- Higher road network connectivity might minimize A* vs Beam differences
- Vanilla baseline strength reduces distillation's relative advantage
Reference: Issue #8 (Beijing Beam Search Ablation)
Related: hoser-distill-optuna-porto-eval-eb0e88ab-20251026_152732/EVALUATION_ANALYSIS_PHASE1.md
Status: Future work - to be scheduled after P1-Major issues complete
Recommendation: Start with Option A (selective ablation) for efficient cross-validation