Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions Environment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: workflow_dispatcher
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- _openmp_mutex=4.5=20_gnu
- bzip2=1.0.8=hda65f42_9
- ca-certificates=2026.2.25=hbd8a1cb_0
- icu=78.2=h33c6efd_0
- ld_impl_linux-64=2.45.1=default_hbd61a6d_101
- libexpat=2.7.4=hecca717_0
- libffi=3.5.2=h3435931_0
- libgcc=15.2.0=he0feb66_18
- libgomp=15.2.0=he0feb66_18
- liblzma=5.8.2=hb03c661_0
- libmpdec=4.0.0=hb03c661_1
- libsqlite=3.51.2=hf4e2dac_0
- libstdcxx=15.2.0=h934c35e_18
- libuuid=2.41.3=h5347b49_0
- libzlib=1.3.1=hb9d3cd8_2
- ncurses=6.5=h2d0b736_3
- openssl=3.6.1=h35e630c_1
- pip=26.0.1=pyh145f28c_0
- python=3.14.3=h32b2ec7_101_cp314
- python_abi=3.14=8_cp314
- readline=8.3=h853b02a_0
- tk=8.6.13=noxft_h366c992_103
- tzdata=2025c=hc9c84f9_1
- zstd=1.5.7=hb78ec9c_6
82 changes: 81 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,81 @@
# workflow_automation
# workflow_automation

## How to Run

Run on a slurm node:
- Clone repo
- Add your workflow to `workflows` folder
- Set up env: `conda env create --file=Environment.yaml`
- Activate env: `conda activate workflow_dispatcher`
- Run dispatcher: `python workflow_dispatcher.py`

## Workflow logic (summary)
- Workflow configurations are loaded from CSV files in workflows/.
- Each configuration specifies:
- input_data_path → root folder containing runs
- data_regex → pattern to match FASTQ files
- workflow_path → location of workflow scripts/config
- command → command to execute the workflow
- Run detection:
- All subfolders under input_data_path are scanned recursively (rglob("*")).
- Folders named workflow_status are skipped.
- FASTQ pairing:
- All FASTQ files in a run folder matching data_regex are collected.
- _R1 and _R2 in filenames are removed to determine sample names.
- Only complete R1/R2 pairs are kept.
- Ignores all files starting with "Undetermined"
- Sample sheet creation:
- For each run, a CSV file samples.csv is written in workflow_path/config/pep.
- Workflow status handling:
- A workflow_status folder may exist in the run folder (not automatically created).
- .run and .done flags indicate whether a workflow is running or completed.
- If another workflow is running in a different run folder, submission is skipped.
- Workflow submission:
- Each run is submitted as a Slurm job.
- On job start, a .run flag is created; on success, .done is created; on failure, .failed is created.
- Execution order:
- Only one workflow is submitted at a time per configuration.
- Subfolders are treated as separate runs, independent from each other.

## Troubleshooting

### Folder locked
**Problem:**
LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/groups/ds/automation/qc_pipeline_test/QC_pre_NextSeq

**Solution:**
`cd /path/to/snakemake/workflow`
`conda activate snakemake_9_slurm`
`snakemake --unlock`

## Add your workflow to the running automation

- Create a .csv in the `workflows` folder
- The csv has to contain: `name,input_data_path,data_regex,workflow_path,command`
- `name`: Name of your workflow
- `input_data_path`: folder that shall be monitored for new fasta files
- `data_regex`: regex that all files that shall be processed match
- `workflow_path`: Path to the snakemake worfklow version that shall be executed
- `command`: Terminal command that has to be executed from the `workflow_path` to start the workflow
- Each csv file can contain multiple lines with different regex and commands for the same workflow
- Example: `workflows/qc_test.csv`

## Containerise your workflow
See `containerise/README.md`

## TBD
- Shall a new sample.csv sheet be created for every run? -> Can we just overwrite sample.csv? -> Yes
- config.yaml run_date: "" has to be overwritten as well. Is it the exactly the same everywhere? -> Shall be overwritten every time
- Is it sufficient to have only one active container per workflow? (Serial instead of parallel processing) -> Yes
- Base sm environment to start the container -> See shared user
- The containers use the unpacked DBs on ds/groups -> Whats the maximum size here? -> 140 GB
- What happens if the same workflow starts for two different sequencer outputs? Will that happen? One Workflow container per sequencer? -> One container is sufficient
- Can multiple users controll one cron job? Automation user that can be used from different people? -> Ask Marcel
- Clean up -> move results to "output" folder and delete everything else? -> Not needed

## ToDo
- Ignore empty fasta.gz files like "/projects/seqlab/incoming-humgen/20260323_LH00204_0121_B23JH5HLT3/Analysis/1/Data/BCLConvert/fastq/WWOZ2501*"
- Remove hard coded conda env from workflow_dispatcher.py
- Add test
42 changes: 42 additions & 0 deletions containerisation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Containerise your workflow

All steps must be completed with your own workflow. Copy the `containerisation` folder from this repository into your own snakemake workflow.

## Generate Container

### Create Dockerfile

`snakemake --containerize > containerization/Dockerfile`

### Generate .def

`python containerization/dockerfile_to_singularity.py containerization/Dockerfile --output containerization/my_container.def`

### Generate .sif

`apptainer build containerization/my_container.sif containerization/my_container.def`

## Execute workflow with container

### Bind container to workflow

Make sure your `Snakefile` links to the container via:
`containerized: "containerization/my_container.sif"`

## Run workflow with container

Activate env with snakemake 9
`snakemake --cores all --software-deployment-method conda apptainer --singularity-args "--bind /groups/ds/databases_refGenomes/databases"`

Run via slurm
`nice snakemake --cores all --software-deployment-method conda apptainer --singularity-args "--bind /groups/ds/databases_refGenomes/databases" --jobs 2 -n`

### Mount Database
The --singularity-args option allows passing additional arguments to the container runtime (Apptainer/Singularity). In this case, `--bind /groups/ds/databases_refGenomes/databases` mounts a host directory into the container so that reference databases are accessible during execution.

If your databases are stored in a different location, you must adjust this path accordingly. The general format is `--bind <host_path>:<container_path>`, where `<host_path>` is the directory on your system and `<container_path>` is the path inside the container (if omitted, the same path is used inside the container).
## Further information

https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#containerization-of-conda-based-workflows

Example workflow: https://github.com/IKIM-Essen/QC_pre_NextSeq/tree/sm_9_automation
138 changes: 138 additions & 0 deletions containerisation/dockerfile_to_singularity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
#!/usr/bin/env python3
"""
Convert a Snakemake --containerize Dockerfile to a Singularity .def file.

Usage:
python dockerfile_to_singularity.py Dockerfile [output.def]
"""

import argparse
import sys
from pathlib import Path


def convert(dockerfile_path: str, output_path: str = None):
if dockerfile_path == "-":
lines = sys.stdin.read().splitlines()
if output_path is None:
print("Error: -o/--output is required when reading from stdin")
sys.exit(1)
else:
dockerfile = Path(dockerfile_path)
if not dockerfile.exists():
print(f"Error: {dockerfile_path} not found")
sys.exit(1)
if output_path is None:
output_path = str(dockerfile.with_suffix(".def"))
lines = dockerfile.read_text().splitlines()

joined_lines = []
buffer = ""

for line in lines:
stripped = line.rstrip()

if stripped.endswith("\\"):
buffer += stripped[:-1] + " "
else:
buffer += stripped
joined_lines.append(buffer)
buffer = ""

lines = joined_lines

bootstrap = "docker"
base_image = None
local_files = [] # (src, dst) from COPY
remote_files = [] # (url, dst) from ADD
mkdirs = []
conda_envs = []

for line in lines:
line = line.strip()

if line.startswith("FROM "):
base_image = line[5:].strip()

elif line.startswith("RUN mkdir -p "):
mkdir_paths = line[13:].strip().split()
mkdirs.extend(mkdir_paths)

elif line.startswith("COPY "):
parts = line[5:].strip().split()
if len(parts) == 2:
local_files.append((parts[0], parts[1]))

elif line.startswith("ADD "):
parts = line[4:].strip().split()
if len(parts) == 2:
remote_files.append((parts[0], parts[1]))

elif line.startswith("RUN "):
cmd = line[4:].strip() # alles nach RUN

# Split chained commands
parts = [c.strip() for c in cmd.split("&&")]

for part in parts:
if part.startswith("conda env create"):
conda_envs.append(part)

if base_image is None:
print("Error: could not find FROM in Dockerfile")
sys.exit(1)

out = []
out.append(f"Bootstrap: {bootstrap}")
out.append(f"From: {base_image}")
out.append("")

# %files section (local files only)
if local_files:
out.append("%files")
for src, dst in local_files:
out.append(f" {src} {dst}")
out.append("")

# %post section
out.append("%post")

# mkdirs
for cmd in mkdirs:
out.append(f" mkdir -p {cmd}")
out.append("")

# install build tools (needed for source-compiled packages like whatshap)
out.append(" apt-get update && apt-get install -y g++ gcc")
out.append("")

# download remote files with wget
if remote_files:
for url, dst in remote_files:
out.append(f" wget -q {url} -O {dst}")
out.append("")

# conda env creates
for cmd in conda_envs:
out.append(f" {cmd}")
out.append(" conda clean --all -y")
out.append("")

result = "\n".join(out)
Path(output_path).write_text(result)
print(f"Written to {output_path}")


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Convert a Snakemake --containerize Dockerfile to a Singularity .def file."
)
parser.add_argument("dockerfile", help="Path to the input Dockerfile")
parser.add_argument(
"-o",
"--output",
help="Path to the output .def file (default: input path with .def extension)",
default=None,
)
args = parser.parse_args()
convert(args.dockerfile, args.output)
Loading