IKIM-Essen · Julian-W98 · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026
diff --git a/Environment.yaml b/Environment.yaml
@@ -0,0 +1,30 @@
+name: workflow_dispatcher
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - _openmp_mutex=4.5=20_gnu
+  - bzip2=1.0.8=hda65f42_9
+  - ca-certificates=2026.2.25=hbd8a1cb_0
+  - icu=78.2=h33c6efd_0
+  - ld_impl_linux-64=2.45.1=default_hbd61a6d_101
+  - libexpat=2.7.4=hecca717_0
+  - libffi=3.5.2=h3435931_0
+  - libgcc=15.2.0=he0feb66_18
+  - libgomp=15.2.0=he0feb66_18
+  - liblzma=5.8.2=hb03c661_0
+  - libmpdec=4.0.0=hb03c661_1
+  - libsqlite=3.51.2=hf4e2dac_0
+  - libstdcxx=15.2.0=h934c35e_18
+  - libuuid=2.41.3=h5347b49_0
+  - libzlib=1.3.1=hb9d3cd8_2
+  - ncurses=6.5=h2d0b736_3
+  - openssl=3.6.1=h35e630c_1
+  - pip=26.0.1=pyh145f28c_0
+  - python=3.14.3=h32b2ec7_101_cp314
+  - python_abi=3.14=8_cp314
+  - readline=8.3=h853b02a_0
+  - tk=8.6.13=noxft_h366c992_103
+  - tzdata=2025c=hc9c84f9_1
+  - zstd=1.5.7=hb78ec9c_6
diff --git a/README.md b/README.md
@@ -1 +1,81 @@
-# workflow_automation
+# workflow_automation
+
+## How to Run
+
+Run on a slurm node:
+- Clone repo
+- Add your workflow to `workflows` folder
+- Set up env: `conda env create --file=Environment.yaml`
+- Activate env: `conda activate workflow_dispatcher`
+- Run dispatcher: `python workflow_dispatcher.py`
+
+## Workflow logic (summary)
+- Workflow configurations are loaded from CSV files in workflows/.
+- Each configuration specifies:
+    - input_data_path → root folder containing runs
+    - data_regex → pattern to match FASTQ files
+    - workflow_path → location of workflow scripts/config
+    - command → command to execute the workflow
+- Run detection:
+    - All subfolders under input_data_path are scanned recursively (rglob("*")).
+    - Folders named workflow_status are skipped.
+- FASTQ pairing:
+    - All FASTQ files in a run folder matching data_regex are collected.
+    - _R1 and _R2 in filenames are removed to determine sample names.
+    - Only complete R1/R2 pairs are kept.
+    - Ignores all files starting with "Undetermined"
+- Sample sheet creation:
+    - For each run, a CSV file samples.csv is written in workflow_path/config/pep.
+- Workflow status handling:
+    - A workflow_status folder may exist in the run folder (not automatically created).
+    - .run and .done flags indicate whether a workflow is running or completed.
+    - If another workflow is running in a different run folder, submission is skipped.
+- Workflow submission:
+    - Each run is submitted as a Slurm job.
+    - On job start, a .run flag is created; on success, .done is created; on failure, .failed is created.
+- Execution order:
+    - Only one workflow is submitted at a time per configuration.
+    - Subfolders are treated as separate runs, independent from each other.
+
+## Troubleshooting
+
+### Folder locked
+**Problem:**
+LockException:
+Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
+/groups/ds/automation/qc_pipeline_test/QC_pre_NextSeq
+
+**Solution:**
+`cd /path/to/snakemake/workflow`
+`conda activate snakemake_9_slurm`
+`snakemake --unlock`
+
+## Add your workflow to the running automation
+
+- Create a .csv in the `workflows` folder
+- The csv has to contain: `name,input_data_path,data_regex,workflow_path,command`
+    - `name`: Name of your workflow
+    - `input_data_path`: folder that shall be monitored for new fasta files
+    - `data_regex`: regex that all files that shall be processed match
+    - `workflow_path`: Path to the snakemake worfklow version that shall be executed
+    - `command`: Terminal command that has to be executed from the `workflow_path` to start the workflow
+- Each csv file can contain multiple lines with different regex and commands for the same workflow
+- Example: `workflows/qc_test.csv`
+
+## Containerise your workflow
+See `containerise/README.md`
+
+## TBD
+- Shall a new sample.csv sheet be created for every run? -> Can we just overwrite sample.csv? -> Yes
+- config.yaml run_date: "" has to be overwritten as well. Is it the exactly the same everywhere? -> Shall be overwritten every time
+- Is it sufficient to have only one active container per workflow? (Serial instead of parallel processing) -> Yes
+- Base sm environment to start the container -> See shared user
+- The containers use the unpacked DBs on ds/groups -> Whats the maximum size here? -> 140 GB
+- What happens if the same workflow starts for two different sequencer outputs? Will that happen? One Workflow container per sequencer? -> One container is sufficient
+- Can multiple users controll one cron job? Automation user that can be used from different people? -> Ask Marcel
+- Clean up -> move results to "output" folder and delete everything else? -> Not needed
+
+## ToDo
+- Ignore empty fasta.gz files like "/projects/seqlab/incoming-humgen/20260323_LH00204_0121_B23JH5HLT3/Analysis/1/Data/BCLConvert/fastq/WWOZ2501*"
+- Remove hard coded conda env from workflow_dispatcher.py
+- Add test
diff --git a/containerisation/README.md b/containerisation/README.md
@@ -0,0 +1,42 @@
+# Containerise your workflow
+
+All steps must be completed with your own workflow. Copy the  `containerisation` folder from this repository into your own snakemake workflow.
+
+## Generate Container
+
+### Create Dockerfile
+
+`snakemake --containerize > containerization/Dockerfile`
+
+### Generate .def
+
+`python containerization/dockerfile_to_singularity.py containerization/Dockerfile --output containerization/my_container.def`
+
+### Generate .sif
+
+`apptainer build containerization/my_container.sif containerization/my_container.def`
+
+## Execute workflow with container
+
+### Bind container to workflow
+
+Make sure your `Snakefile` links to the container via:
+`containerized: "containerization/my_container.sif"`
+
+## Run workflow with container
+
+Activate env with snakemake 9
+`snakemake --cores all --software-deployment-method conda apptainer --singularity-args "--bind /groups/ds/databases_refGenomes/databases"`
+
+Run via slurm
+`nice snakemake --cores all --software-deployment-method conda apptainer --singularity-args "--bind /groups/ds/databases_refGenomes/databases" --jobs 2 -n`
+
+### Mount Database
+The --singularity-args option allows passing additional arguments to the container runtime (Apptainer/Singularity). In this case, `--bind /groups/ds/databases_refGenomes/databases` mounts a host directory into the container so that reference databases are accessible during execution.
+
+If your databases are stored in a different location, you must adjust this path accordingly. The general format is `--bind <host_path>:<container_path>`, where `<host_path>` is the directory on your system and `<container_path>` is the path inside the container (if omitted, the same path is used inside the container).
+## Further information
+
+https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#containerization-of-conda-based-workflows 
+
+Example workflow: https://github.com/IKIM-Essen/QC_pre_NextSeq/tree/sm_9_automation
diff --git a/containerisation/dockerfile_to_singularity.py b/containerisation/dockerfile_to_singularity.py
@@ -0,0 +1,138 @@
+#!/usr/bin/env python3
+"""
+Convert a Snakemake --containerize Dockerfile to a Singularity .def file.
+
+Usage:
+    python dockerfile_to_singularity.py Dockerfile [output.def]
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+
+def convert(dockerfile_path: str, output_path: str = None):
+    if dockerfile_path == "-":
+        lines = sys.stdin.read().splitlines()
+        if output_path is None:
+            print("Error: -o/--output is required when reading from stdin")
+            sys.exit(1)
+    else:
+        dockerfile = Path(dockerfile_path)
+        if not dockerfile.exists():
+            print(f"Error: {dockerfile_path} not found")
+            sys.exit(1)
+        if output_path is None:
+            output_path = str(dockerfile.with_suffix(".def"))
+        lines = dockerfile.read_text().splitlines()
+
+    joined_lines = []
+    buffer = ""
+
+    for line in lines:
+        stripped = line.rstrip()
+
+        if stripped.endswith("\\"):
+            buffer += stripped[:-1] + " "
+        else:
+            buffer += stripped
+            joined_lines.append(buffer)
+            buffer = ""
+
+    lines = joined_lines
+
+    bootstrap = "docker"
+    base_image = None
+    local_files = []  # (src, dst) from COPY
+    remote_files = []  # (url, dst) from ADD
+    mkdirs = []
+    conda_envs = []
+
+    for line in lines:
+        line = line.strip()
+
+        if line.startswith("FROM "):
+            base_image = line[5:].strip()
+
+        elif line.startswith("RUN mkdir -p "):
+            mkdir_paths = line[13:].strip().split()
+            mkdirs.extend(mkdir_paths)
+
+        elif line.startswith("COPY "):
+            parts = line[5:].strip().split()
+            if len(parts) == 2:
+                local_files.append((parts[0], parts[1]))
+
+        elif line.startswith("ADD "):
+            parts = line[4:].strip().split()
+            if len(parts) == 2:
+                remote_files.append((parts[0], parts[1]))
+
+        elif line.startswith("RUN "):
+            cmd = line[4:].strip()  # alles nach RUN
+
+            # Split chained commands
+            parts = [c.strip() for c in cmd.split("&&")]
+
+            for part in parts:
+                if part.startswith("conda env create"):
+                    conda_envs.append(part)
+
+    if base_image is None:
+        print("Error: could not find FROM in Dockerfile")
+        sys.exit(1)
+
+    out = []
+    out.append(f"Bootstrap: {bootstrap}")
+    out.append(f"From: {base_image}")
+    out.append("")
+
+    # %files section (local files only)
+    if local_files:
+        out.append("%files")
+        for src, dst in local_files:
+            out.append(f"    {src} {dst}")
+        out.append("")
+
+    # %post section
+    out.append("%post")
+
+    # mkdirs
+    for cmd in mkdirs:
+        out.append(f"    mkdir -p {cmd}")
+    out.append("")
+
+    # install build tools (needed for source-compiled packages like whatshap)
+    out.append("    apt-get update && apt-get install -y g++ gcc")
+    out.append("")
+
+    # download remote files with wget
+    if remote_files:
+        for url, dst in remote_files:
+            out.append(f"    wget -q {url} -O {dst}")
+        out.append("")
+
+    # conda env creates
+    for cmd in conda_envs:
+        out.append(f"    {cmd}")
+    out.append("    conda clean --all -y")
+    out.append("")
+
+    result = "\n".join(out)
+    Path(output_path).write_text(result)
+    print(f"Written to {output_path}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Convert a Snakemake --containerize Dockerfile to a Singularity .def file."
+    )
+    parser.add_argument("dockerfile", help="Path to the input Dockerfile")
+    parser.add_argument(
+        "-o",
+        "--output",
+        help="Path to the output .def file (default: input path with .def extension)",
+        default=None,
+    )
+    args = parser.parse_args()
+    convert(args.dockerfile, args.output)