You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -69,19 +69,33 @@ python TEforest.py \
69
69
--ref_model <path/to/reference_model.pkl> \
70
70
--fq_base_path <path/to/fastq/files> \
71
71
--cleanup_intermediates \
72
+
--disable_reference_detection \
73
+
--dry_run \
72
74
--samples A1 A2 A3
73
75
```
74
76
75
77
- **`--workflow_dir`**: Directory containing the `Snakefile` (`workflow/Snakefile`).
76
78
- **`--workdir`**: Directory to store outputs and logs.
77
79
- **`--threads`**: Number of CPU threads to use. 16 per sample is recommended.
78
80
- **`--consensusTEs`, `--ref_genome`, `--ref_te_locations`, `--euchromatin`**: Input reference files forTE detection. All calls outside of the regions denotedin euchromatin will be filtered. Example files used forDrosophila melanogaster are locatedin example_files/. Be aware the BWA-mem2 will treat IUPAC bases as missing, so TEforest may have reduced performance on consensus sequences with high IUPAC content.
81
+
- Current reference BED usage in inference: columns 1/2/3 are used as genomic coordinates, and column 7 is used as the TE family ID.
82
+
- Columns 4/5/6 and any trailing columns are accepted but are not used by the pipeline.
83
+
- BED can be tab-delimited or whitespace-delimited.
79
84
- **`--model`**: Path to the non-reference model (optional). If omitted, TEforest auto-selects a model based on the observed coverage (5X/10X/20X/30X/40X/50X). If the data are not downsampled (e.g., 48X), the next highest model is chosen (50X).
80
85
- **`--ref_model`**: Path to the reference model (optional). Auto-selection follows the same coverage logic as above.
81
86
- **`--fq_base_path`**: Directory containing FASTQ files. TEforest will match common read naming conventions (e.g., `_R1/_R2`, `_1/_2`, `.1/.2`, `R1/R2`, lane tokens like `_L001_R1_001`) as long as the sample name appears in the filename.
82
87
- **`--cleanup_intermediates`**: Optional flag to delete large intermediate files after they are used (e.g., `fastp/`, `aligned/`, `downsampled/`, `candidate_regions_data/`). Omit this if you want to keep read alignments or candidate-region BAMs for debugging.
88
+
- **`--disable_reference_detection`**: Optional flag to skip reference TE feature-vector creation and reference model prediction. This can be useful for genomes with very large numbers of old reference TEs, where reference detection can dominate runtime.
89
+
- **`--dry_run`**: Optional flag to run `snakemake --dry-run` through the wrapper, so you can validate file naming, inputs, and DAG construction before launching compute-heavy jobs.
83
90
- **`--samples`**: List of sample identifiers to process (space-separated). Note more than one sample can be run in parallel.
84
91
92
+
Input validation (runs for both normal execution and `--dry_run`):
93
+
- FASTQs are resolved per sample/read, must exist, be non-empty, and have a valid first FASTQ record.
94
+
- Reference BED must have at least 7 whitespace-delimited columns.
95
+
- Reference BED chromosome names (column 1) must match sequence headers in`ref_genome`.
96
+
- Every TE ID in reference BED column 7 must be present in`consensusTEs` FASTA headers.
97
+
- Extra TE families in the consensus FASTA are allowed.
98
+
85
99
The script will generate:
86
100
- A `config.yaml`in your specified `workdir` with all parameters.
0 commit comments