AutoResearchClaw/prompts.default.yaml at main · aiming-lab/AutoResearchClaw · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
# =============================================================================
# AutoResearchClaw — Default Prompt Templates
# =============================================================================
#
# Copy this file, edit any prompt you want to customize, and point your config
# to the copy:
#
#   prompts:
#     custom_file: "my_prompts.yaml"
#
# Template variables use {var_name} syntax — see docs/integration-guide.md
# for a list of available variables per stage.
#
# Stages without an entry here (experiment_run, citation_verify) do not call
# the LLM and therefore have no prompts to customize.
# =============================================================================

blocks:
  compute_budget: |
    ## Compute Budget Constraint
    - Total execution time limit: {time_budget_sec} seconds
    - You MUST design experiments that complete within this budget
    - Estimate: a simple numpy loop runs ~10M iterations/sec; a nested loop over
      conditions runs proportionally slower
    - SCALING RULES (mandatory):
      - If total conditions > 100: reduce seeds to 3-5 (not 20)
      - If total conditions > 500: reduce to 2-3 representative conditions per factor
      - If time_budget < 300s: limit total optimization steps to ≤5,000 per run
      - If time_budget < 120s: limit total optimization steps to ≤1,000 per run
      - Always print intermediate results so partial data is captured on timeout
    - MANDATORY: print a "TIME_ESTIMATE: Xs" line before the main loop,
      estimating total runtime based on a small pilot (run 1 condition, extrapolate)
    - MANDATORY: implement a time guard — check elapsed time periodically and
      stop gracefully if approaching 80% of budget, saving all results collected so far

  pkg_hint_sandbox: '

    AVAILABLE PACKAGES (sandbox mode): Python stdlib, numpy, math, random, statistics, json.

    Do NOT use: torch, tensorflow, jax, sklearn, pandas, scipy, matplotlib, or any deep learning framework.

    Write the experiment using ONLY numpy and stdlib.

    '
  topic_constraint: '


    === HARD TOPIC CONSTRAINT ===

    The paper MUST be about: {topic}

    PROHIBITED content (unless user explicitly specifies case-study mode):

    - Do NOT treat environment setup, dependency installation, or infrastructure failures as a research contribution.

    - Do NOT present debugging logs, system errors, or configuration issues as experimental findings.

    - Do NOT drift to tangential topics not directly related to the stated topic.

    - Every section MUST connect back to the core research question.

    - The Abstract and Introduction MUST clearly state the research problem derived from: {topic}

    - The Method section MUST describe a technical approach, not a workflow.

    - The Results section MUST report quantitative outcomes of experiments, not environment status.

    === END CONSTRAINT ===

    '
stages:
  code_generation:
    max_tokens: 8192
    system: You are a computational scientist who writes real, runnable experiments. Your code implements actual algorithms
      with real mathematical operations. You NEVER fake results with random number generators. Always use the ```filename:xxx.py
      format for each file. Use numpy for numerical computation. Keep code self-contained and deterministic.
    user: "Generate a Python experiment project for the following research topic:\nTOPIC: {topic}\n\nCRITICAL REQUIREMENTS\
      \ — your code MUST satisfy ALL of these:\n1. Implement REAL algorithms (e.g., gradient descent, Adam, SGD, etc.)\n \
      \  using numpy arrays — NOT random.uniform() loops that fake results.\n2. Define REAL objective/loss functions (e.g.,\
      \ Rosenbrock, quadratic,\n   cross-entropy on synthetic data) with proper mathematical formulas.\n3. Run REAL optimization\
      \ loops that compute gradients and update parameters.\n4. Collect REAL metrics (loss values, convergence rates) from\
      \ the optimization.\n5. The code must be scientifically meaningful — a reviewer should see\n   actual algorithm implementations,\
      \ not random number generators.\n\nOUTPUT FORMAT — return multiple files using this exact format:\n```filename:main.py\n\
      # entry point code\n```\n\n```filename:optimizers.py\n# optimizer implementations\n```\n\nCODE STRUCTURE:\n- main.py:\
      \ entry point that runs experiments and prints metrics\n- Additional modules for algorithms, objective functions, utilities\n\
      - Primary metric key: {metric}\n- main.py must print metric lines as `name: value` (one per line)\n- main.py must ALSO\
      \ write a `results.json` file with structured experiment results\n  (e.g. per-algorithm, per-function, per-dimension metrics\
      \ as nested dicts/lists)\n- Use deterministic seeds (numpy.random.seed or random.seed)\n- No external data files, no\
      \ network calls, no GPU required\n- FORBIDDEN: subprocess, os.system, eval, exec, shutil, socket\n- MUST implement convergence\
      \ stopping criteria (e.g. stop when objective change < 1e-8 for\n  N consecutive iterations) — do NOT just run a fixed\
      \ number of iterations\n{pkg_hint}\nANTI-PATTERNS (do NOT do these):\n- Do NOT generate random numbers and pretend they\
      \ are experiment results\n- Do NOT use `random.uniform()` to simulate a decreasing loss curve\n- Do NOT hardcode metric\
      \ values or use trivial arithmetic as metrics\n- Do NOT run a fixed number of iterations without any convergence check\n- Do NOT implement convergence_rate or similar metrics as dummy return values\n  (e.g. returning 1.0 or a constant) — measure actual iterations to convergence\n- If you report convergence_rate, define it as iterations_to_convergence / max_iterations\n  or similar — it MUST differ between algorithms\n\nNUMPY 2.x COMPATIBILITY (CRITICAL):\n- np.trapz is REMOVED → use np.trapezoid\n- np.erfinv does NOT exist → use scipy.special.erfinv\n- np.bool, np.int, np.float, np.complex are REMOVED → use Python builtins\n- np.str, np.object are REMOVED → use str, object\n- np.math is REMOVED → use math module\n\nExperiment plan:\n{exp_plan}"
  experiment_design:
    system: You are a principal investigator designing ML experiments.
    user: '{preamble}


      Design an experiment plan as YAML.

      Required keys: objectives,datasets,baselines,proposed_methods,ablations,metrics,risks,compute_budget.

      Hypotheses:

      {hypotheses}'
  export_publish:
    max_tokens: 16384
    system: You are a publication formatting editor.
    user: 'Format revised paper into clean final markdown for publication export.

      Preserve content quality and readability.

      Input paper:

      {revised}'
  hypothesis_gen:
    system: You formulate testable scientific hypotheses.
    user: 'Generate at least 2 falsifiable hypotheses from synthesis.

      Output markdown and for each hypothesis provide rationale, measurable prediction, failure condition.

      Synthesis:

      {synthesis}'
  knowledge_archive:
    system: You produce reproducibility-focused research retrospectives.
    user: '{preamble}


      Write retrospective archive markdown with lessons, reproducibility notes, and future work.

      Decision:

      {decision}


      Analysis:

      {analysis}


      Revised paper:

      {revised}'
  knowledge_extract:
    json_mode: true
    system: You extract high-signal evidence cards from papers.
    user: 'Extract structured knowledge cards from shortlist.

      Return JSON: {cards:[{card_id,title,cite_key,problem,method,data,metrics,findings,limitations,citation}]}.

      IMPORTANT: If the input contains cite_key fields, preserve them exactly in the output.

      Shortlist:

      {shortlist}'
  literature_collect:
    json_mode: true
    system: You are a literature mining assistant.
    user: 'Generate candidate papers from the search plan.

      Return JSON: {candidates:[...]} with >=20 rows.

      Each candidate must include id,title,source,url,year,abstract,collected_at.

      Topic: {topic}

      Search plan:

      {plan_text}'
  literature_screen:
    json_mode: true
    system: You are a strict domain-aware reviewer. Reject off-topic papers aggressively.
    user: 'Perform merged relevance+quality screening and return shortlist.

      Return JSON: {shortlist:[...]} each with title, cite_key (if present), relevance_score (0-1), quality_score (0-1), keep_reason.

      Preserve all original fields (paper_id, doi, arxiv_id, cite_key, etc.) from the input.

      Topic: {topic}

      Domains: {domains}

      Threshold: {quality_threshold}

      IMPORTANT: Only keep papers genuinely relevant to the topic above. Reject papers about unrelated domains even if they
      are high quality.

      Candidates JSONL:

      {candidates_text}'
  paper_draft:
    max_tokens: 32768
    system: "You are a top-tier ML paper author writing for NeurIPS/ICML/ICLR.\n\n\
      KEY PRINCIPLES (from accepted paper analyses):\n\
      1. NOVELTY: A good paper has 1-2 key ideas and keeps the rest simple. Think sushi, not curry.\n\
      2. NARRATIVE: The paper is a short, rigorous, evidence-based technical story with a takeaway readers care about.\n\
      3. FIGURE 1: The most important figure. It should convey whatever is most important — many readers go straight to Figure 1.\n\
      4. STRONG BASELINES: Invest real effort in making baselines competitive. Reviewers catch weak baselines.\n\
      5. ABLATIONS: Remove one component at a time and measure the effect. Without ablations, reviewers cannot tell which parts matter.\n\
      6. HONESTY: Acknowledge limitations explicitly. Papers that don't are substantially weaker.\n\
      7. CONTRIBUTIONS: State contributions clearly in Abstract AND Introduction. Many reviewers stop reading carefully after the intro.\n\
      8. REPRODUCIBILITY: Include all details needed to reproduce: hyperparameters, data processing, random seeds, hardware specs.\n\n\
      COMMON REJECTION REASONS (avoid these):\n\
      - Overclaiming: match claims to evidence\n\
      - Missing ablations: systematically demonstrate each component's contribution\n\
      - Weak baselines: tune baselines with the same effort as your method\n\
      - Poor reproducibility: include every detail needed to replicate\n\n\
      You ONLY use real experimental data — never fabricate or approximate numbers. Every metric value must exactly match the provided experiment output.\n\
      You write at the depth and length expected for a 9-page conference paper (approximately 5000-6500 words in the main body, excluding references)."
    user: '{preamble}


      Write a FULL-LENGTH paper draft section by section in markdown. This paper must be suitable for submission to a top-tier ML conference (NeurIPS, ICML, ICLR).

      CRITICAL LENGTH REQUIREMENTS — each section MUST meet its minimum word count:

      1. **Title**: Concise, informative (10-15 words)
      2. **Abstract** (150-250 words): Problem, method, key results with numbers, conclusion
      3. **Introduction** (800-1000 words): Motivation with real-world context, problem statement, research gap analysis, brief method overview, contribution list (3-4 bullet points), paper organization
      4. **Related Work** (600-800 words): Organized by 3-4 thematic groups, each with 4-5 citations. Compare and contrast approaches, identify limitations of prior work, position this work clearly
      5. **Method** (1000-1500 words): Formal problem definition with mathematical notation, detailed algorithm description with equations, complexity analysis, design rationale for key choices
      6. **Experiments** (800-1200 words): Detailed experimental setup (datasets, preprocessing, data splits), baselines and their implementations, hyperparameter settings (in a table), evaluation metrics with justification, hardware and runtime information
      7. **Results** (600-800 words): Main results table(s) with ALL metrics, per-condition analysis, statistical significance discussion, ablation studies, qualitative analysis where relevant
      8. **Discussion** (400-600 words): Interpretation of key findings, unexpected results analysis, comparison with prior work, practical implications
      9. **Limitations** (200-300 words): Honest assessment of scope, dataset, methodology, and generalizability limitations
      10. **Conclusion** (200-300 words): Summary of contributions, main findings, and concrete future work directions

      TOTAL TARGET: 5000-6500 words in the main body. If any section is shorter than its minimum, EXPAND it with substantive technical content — NOT filler.

      QUALITY STANDARDS:
      - Use formal academic language throughout
      - Include mathematical notation where appropriate (use LaTeX-style $...$ for inline math)
      - Every claim must be supported by either a citation or experimental evidence
      - Results tables should use markdown table format with proper column headers
      - Provide algorithm pseudocode in the Method section when applicable

      Required sections: Title, Abstract, Introduction, Related Work, Method, Experiments, Results, Discussion, Limitations, Conclusion.
      Do NOT include a References section — it will be auto-generated.

      {topic_constraint}{exp_metrics_instruction}{citation_instruction}Outline:

      {outline}'
  paper_outline:
    max_tokens: 8192
    system: You are an academic writing planner.
    user: '{preamble}


      Create a detailed paper outline in markdown.

      Include per-section goals and evidence links.

      {topic_constraint}{feedback}Analysis:

      {analysis}


      Decision:

      {decision}'
  paper_revision:
    max_tokens: 32768
    system: You are a paper revision expert for NeurIPS/ICML/ICLR submissions. When revising, NEVER shorten existing sections — only expand, improve, and add content. The final paper must be at least as long as the draft.
    user: 'Revise the paper draft to address all review comments.

      CRITICAL: Maintain or INCREASE the paper length. Each section must meet its minimum word count:
      Abstract (150-250), Introduction (800-1000), Related Work (600-800), Method (1000-1500),
      Experiments (800-1200), Results (600-800), Discussion (400-600), Limitations (200-300), Conclusion (200-300).

      Return revised markdown only.

      {topic_constraint}Draft:

      {draft}


      Reviews:

      {reviews}'
  peer_review:
    max_tokens: 8192
    system: You are a balanced conference reviewer who is rigorous about
      methodology-evidence consistency.
    user: 'Simulate peer review from at least 2 reviewer perspectives.

      Output markdown with Reviewer A and Reviewer B, each including strengths,
      weaknesses, and actionable revisions.

      Check specifically:

      1. Does the paper stay on topic ({topic})? Flag any sections where the paper
      drifts to unrelated topics or presents environment issues as contributions.

      2. METHODOLOGY-EVIDENCE CONSISTENCY: Compare the paper''s claims about
      experimental setup (number of trials, statistical tests, hyperparameters,
      baselines) against the actual experiment evidence provided below. Flag any
      discrepancies where the paper claims something that is NOT supported by the
      actual code or results. For example:
      - Paper claims N trials but code shows a different number
      - Paper claims statistical tests (ANOVA, t-test) but code has none
      - Paper reports metrics not present in actual results
      - Paper describes methods not implemented in code

      3. TRIAL COUNT: The actual number of experiment runs is stated in the evidence below. If the paper claims a DIFFERENT number of trials (e.g., "100 independent trials" when only 1 was run), flag this as a CRITICAL fabrication that MUST be corrected.

      4. PAPER LENGTH: This paper targets NeurIPS/ICML submission (9 pages). Check that each section has adequate depth. Flag sections that are too short: Abstract (<150 words), Introduction (<700 words), Related Work (<500 words), Method (<800 words), Experiments (<600 words), Results (<500 words). A paper with fewer than 4000 total words is CRITICALLY under-length.

      5. REVIEW LIKE A TOP-CONFERENCE REVIEWER:
      - Is the contribution novel, or is it incremental over well-known work?
      - Are baselines properly tuned and competitive?
      - Are ablation studies present and meaningful?
      - Is every claim supported by evidence from the experiments?
      - Does the paper acknowledge its limitations honestly?
      - Would you recommend this paper be presented at NeurIPS/ICML? Why or why not?
      - Score the paper 1-10 following this rubric: 1-3 Reject (fundamental flaws), 4-5 Borderline (significant weaknesses), 6-7 Weak Accept (solid but not exciting), 8-9 Accept (strong contribution), 10 Strong Accept (exceptional).

      Paper draft:

      {draft}

      {experiment_evidence}'
  problem_decompose:
    system: You are a senior research strategist.
    user: 'Decompose this research problem into at least 4 prioritized sub-questions.

      Topic: {topic}

      Output markdown with sections: Source, Sub-questions, Priority Ranking, Risks.

      Goal context:

      {goal_text}'
  quality_gate:
    json_mode: true
    system: You are a final quality gate evaluator.
    user: 'Evaluate revised paper quality and return JSON.

      Schema: {score_1_to_10:number, verdict:string, strengths:[...], weaknesses:[...], required_actions:[...]}.

      Threshold: {quality_threshold}

      Paper:

      {revised}'
  research_decision:
    system: You are a research program lead making go/no-go decisions.
    user: 'Make a PROCEED or PIVOT decision from analysis.

      Output markdown with: Decision, Justification, Evidence, Next Actions.

      Analysis:

      {analysis}'
  resource_planning:
    json_mode: true
    system: You are an experiment scheduler.
    user: 'Create schedule JSON with GPU/time estimates.

      Schema: {tasks:[{id,name,depends_on,gpu_count,estimated_minutes,priority}], total_gpu_budget, generated}.

      Experiment plan:

      {exp_plan}'
  result_analysis:
    system: You are a quantitative ML analyst. Always cite exact numbers from the provided data.
    user: '{preamble}


      {data_context}


      Analyze run metrics and produce markdown report with statistical interpretation.

      Use the ACTUAL quantitative values provided above — do NOT invent numbers.

      Required sections: Metrics Summary (with real values), Comparative Findings, Statistical Checks, Limitations, Conclusion.

      Run context:

      {context}'
  search_strategy:
    json_mode: true
    system: You design literature retrieval strategies and source verification plans. You aim for COMPREHENSIVE coverage — a good research paper needs 30-60 references.
    user: 'Create a merged search strategy package.

      Return a JSON object with keys: search_plan_yaml, sources.

      search_plan_yaml must be valid YAML text with search_strategies containing at least 3 strategies,
      each with 3-5 diverse keyword queries (short, 3-6 words each). Generate at least 8 total queries.
      Cover: core topic, related methods, benchmarks/datasets, theoretical foundations, applications.

      sources must include id,name,type,url,status,query,verified_at.

      Topic: {topic}

      Problem tree:

      {problem_tree}'
  synthesis:
    system: You are a synthesis specialist for literature reviews.
    user: 'Produce merged synthesis output (topic clusters + research gaps).

      Output markdown with sections: Cluster Overview, Cluster 1..N, Gap 1..N, Prioritized Opportunities.

      Topic: {topic}

      Cards context:

      {cards_context}'
  topic_init:
    system: You are a rigorous research planner.
    user: 'Create a SMART research goal in markdown.

      Topic: {topic}

      Domains: {domains}

      Project: {project_name}

      Quality threshold: {quality_threshold}

      Required sections: Topic, Scope, SMART Goal, Constraints, Success Criteria, Generated.'
sub_prompts:
  code_repair:
    system: You fix Python code validation errors while preserving functionality.
    user: 'The file `{fname}` in the experiment project has validation errors. Fix ALL issues and return ONLY the corrected
      file.


      ## Validation Issues in {fname}

      {issues_text}


      ## All Project Files

      {all_files_ctx}


      IMPORTANT: Do NOT use subprocess, os.system, eval, exec, or any network/shell calls.

      Return ONLY the corrected code for `{fname}`.'
  iterative_improve:
    max_tokens: 8192
    system: You improve experiment projects and return valid executable Python code. Use ```filename:xxx.py format for each
      file.
    user: 'Improve the experiment code based on prior run results.

      Return the improved files using ```filename:xxx.py format for each file.

      Primary metric key: {metric_key}

      Metric direction: {metric_direction}

      Do not use subprocess, os.system, eval, exec, or any network/shell calls.

      Current project files:

      {files_context}

      Run summaries (JSON):

      {run_summaries}'
  iterative_repair:
    system: You fix Python code issues — both static validation errors and runtime
      bugs (NaN, Inf, division by zero, overflow). Diagnose the ROOT CAUSE from
      warnings and error messages. Do not add unsafe behavior.
    user: 'Fix all issues in the experiment code and return corrected Python code
      using ```filename:xxx.py format for each file.

      IMPORTANT: If you see NaN/Inf or RuntimeWarning about division or invalid values,
      trace the bug to its source (e.g. division by zero, uninitialized array, missing
      convergence check) and fix the actual code logic — do NOT just add try/except
      to suppress the error.


      ## Issues Found

      {issue_text}


      ## All Project Files

      {all_files_ctx}'
version: '1.0'