Skip to content

Commit 4ddd108

Browse files
Merge pull request #3644 from nf-core/chris-storage
docs: Managing Nextflow work directory growth
2 parents 2c85a42 + 036858f commit 4ddd108

File tree

1 file changed

+145
-0
lines changed

1 file changed

+145
-0
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
title: "Managing Nextflow work directory growth"
3+
subtitle: "A guide for efficient storage utilization"
4+
---
5+
6+
The management of intermediate files generated during Nextflow pipeline execution is a challenge for some workflows. As pipelines increase in complexity and scale, work directories can rapidly consume available storage, potentially leading to pipeline failures. This guide summarizes strategies for managing work directory growth while maintaining pipeline reproducibility and debugging capabilities.
7+
8+
## Work directory accumulation
9+
10+
The Nextflow work directory serves as an important component of the execution model, providing caching capabilities and resume functionality. During pipeline execution, Nextflow creates unique subdirectories for each task (e.g., `work/3f/70944c7a549b6221e1ccc7b4b21b62`) containing:
11+
12+
- Symbolic links to input files
13+
- Intermediate output files
14+
- Command scripts and execution logs
15+
- Temporary files created during tasks
16+
17+
In production environments processing large-scale genomic datasets, work directories can expand rapidly.
18+
19+
## Storage management options
20+
21+
The following sections describe ways to manage the storage of files.
22+
23+
### Selective post-execution cleanup
24+
25+
Nextflow's built-in `clean` command enables targeted removal of work directories. The following command preserves work directories from the current execution while removing directories from previous executions:
26+
27+
```bash
28+
nextflow clean -f -before $(nextflow log -q | tail -n 1)
29+
```
30+
31+
Command components:
32+
33+
- `nextflow log -q{:bash}`: Returns a list of run names
34+
- `tail -n 1{:bash}`: Isolates the most recent execution identifier
35+
- `-before{:bash}`: Specifies cleanup of executions preceding the specified run
36+
- `-f{:bash}`: Executes deletion without confirmation
37+
38+
For verification, perform a dry run using the `-n{:bash}` option:
39+
40+
```bash
41+
nextflow clean -n -before $(nextflow log -q | tail -n 1)
42+
```
43+
44+
### Automated cleanup configuration
45+
46+
Nextflow supports automatic work directory cleanup upon successful pipeline completion through configuration directives:
47+
48+
```groovy title="nextflow.config"
49+
cleanup = true
50+
```
51+
52+
:::note
53+
Enabling automatic cleanup prevents the use of resume functionality for the affected pipeline execution. This configuration suits production pipelines where output reproducibility is assured and resume capability isn't required.
54+
:::
55+
56+
### Scratch directory implementation
57+
58+
The scratch directive enables process execution in temporary directories, typically local to compute nodes, with selective output staging to the work directory:
59+
60+
```groovy
61+
process SEQUENCE_ALIGNMENT {
62+
scratch true
63+
64+
input:
65+
tuple val(sample_id), path(reads)
66+
67+
output:
68+
tuple val(sample_id), path("*.bam")
69+
70+
script:
71+
"""
72+
bwa mem reference.fa ${reads} > aligned.sam
73+
samtools sort aligned.sam > sorted.bam
74+
samtools index sorted.bam
75+
# Only sorted.bam transfers to work directory
76+
"""
77+
}
78+
```
79+
80+
:::tip
81+
This configuration is particularly beneficial in HPC environments where it reduces network filesystem overhead.
82+
:::
83+
84+
### Dynamic intermediate file management with nf-boost
85+
86+
The [`nf-boost`](https://registry.nextflow.io/plugins/nf-boost) plugin implements intelligent cleanup mechanisms that remove intermediate files during pipeline execution as they become unnecessary:
87+
88+
```groovy title="nextflow.config"
89+
plugins {
90+
id 'nf-boost'
91+
}
92+
93+
boost {
94+
cleanup = true
95+
cleanupInterval = '180s' // Cleanup evaluation interval
96+
}
97+
```
98+
99+
See [nf-boost](https://github.com/bentsherman/nf-boost) for more information.
100+
101+
### Pipeline optimization strategies
102+
103+
Minimize intermediate file generation through process optimization:
104+
105+
```groovy
106+
process OPTIMIZED_ANALYSIS {
107+
input:
108+
path input_data
109+
110+
output:
111+
path "final_results.txt"
112+
113+
script:
114+
"""
115+
# Utilize pipe operations to avoid intermediate files
116+
initial_process ${input_data} | \\
117+
intermediate_transform | \\
118+
final_analysis > final_results.txt
119+
120+
# Implement named pipes for file-dependent tools
121+
mkfifo temp_pipe
122+
producer_process ${input_data} > temp_pipe &
123+
consumer_process temp_pipe > final_results.txt
124+
rm temp_pipe
125+
126+
# Use process substitution to avoid intermediate files
127+
paste <( cat ${input_data} ) <( cat ${input_data} ) > final_results.txt
128+
"""
129+
}
130+
```
131+
132+
## Recommendations
133+
134+
Effective management of Nextflow work directories requires a tailored approach. The Nextflow `clean` command provides essential functionality for storage recovery and work directory maintenance, though implementation must balance storage optimization with requirements for pipeline resumption and debugging.
135+
136+
For development environments, combine reduced test datasets with manual cleanup for optimal flexibility. Production deployments benefit from automatic cleanup or dynamic solutions like `nf-boost`. HPC installations should leverage scratch directories to minimize shared storage impact.
137+
138+
You should establish storage management policies incorporating:
139+
140+
- Regular maintenance schedules
141+
- Environment-specific configuration profiles
142+
- Capacity planning procedures
143+
- Documentation of cleanup strategies
144+
145+
Through systematic implementation of these strategies, you can maintain efficient pipeline operations while preventing storage exhaustion, ensuring sustainable computational workflow execution at scale.

0 commit comments

Comments
 (0)