speedup demultiplex barcode by parallelisation #198

bpenaud · 2025-02-05T13:09:25Z

Description of feature

Hello Pavel,

I work in the same lab as BELKHIR, and on my side I tried to speed up the demultiplexing step by parallelizing the rule with the GNU parallel tool.

My method uses cpu to split the Undetermined_[RS][12].fastq.gz files into a number of files equal to the number of cpu in the rule with the seqkit split2 tool.

Then I use GNU parallel to run the demuxGen1 binary on my sequence blocks (split).

Once the BX Tag has been added, I rebuild the files in the same way as the Undetermined files.

This method has the advantage of considerably speeding up the rule, on 2*2.5 billion reads the demultiplexing step in harpy run during 6d, 20h, 30min, 43 sec. With my method the demultiplexing step run during 8h, 54min, 13 sec with 60 cores. (time only for rule demultiplex_barcodes)

On the other hand, this method requires temporary files to be written, which is going to take up a lot of disk space and can be a problem.

I've modified the rule :

rule demultiplex_barcodes:
    input:
        collect(outdir + "/DATA_{IR}{ext}_001.fastq.gz", IR = ["R","I"], ext = [1,2]),
        collect(outdir + "/BC_{letter}.txt", letter = ["A","C","B","D"])
    output:
        temp(collect(outdir + "/demux_R{ext}_001.fastq.gz", ext = [1,2]))
    params:
        outdir
    container:
        None
    shell:
        """
        cd {params}
        demuxGen1 DATA_ demux
        mv demux*BC.log logs
        """

By

rule demultiplex_barcodes:
    input:
        collect(outdir + "/DATA_{IR}{ext}_001.fastq.gz", IR = ["R","I"], ext = [1,2]),
        collect(outdir + "/BC_{letter}.txt", letter = ["A","C","B","D"])
    output:
        demux=temp(collect(outdir + "/demux_R{ext}_001.fastq.gz", ext = [1,2])),
        demux_dir=temp(directory(outdir+"/demux_temp/"))
    params:
        outdir
    threads:
        100
    container:
        None
    benchmark:
        outdir +"/benchmark_demultiplex/demultiplex_barcode_benchmark.txt"
    shell:
        """
        r1=$(echo {input} | grep -o "\\w*_R1_\\w*.fastq.gz")
        r2=$(echo {input} | grep -o "\\w*_R2_\\w*.fastq.gz")
        i1=$(echo {input} | grep -o "\\w*_I1_\\w*.fastq.gz")
        i2=$(echo {input} | grep -o "\\w*_I2_\\w*.fastq.gz")

        seqkit split2 -1 {params}/$r1 -2 {params}/$r2 -p {threads} -O {output.demux_dir} -e .gz -j {threads}
        rename 's/DATA(_R[12]_001).part_(\\d+).fastq.gz/DATA$2$1.fastq.gz/' {output.demux_dir}/DATA_R[12]_*.fastq.gz

        seqkit split2 -1 {params}/$i1 -2 {params}/$i2 -p {threads} -O {output.demux_dir} -e .gz -j {threads}
        rename 's/DATA(_I[12]_001).part_(\\d+).fastq.gz/DATA$2$1.fastq.gz/' {output.demux_dir}/DATA_I[12]_*.fastq.gz

        cd {output.demux_dir}
        cp {params}/BC_*.txt {output.demux_dir}/
        parallel -j {threads} demuxGen1 DATA{{}}_ demux{{}}  ::: $(ls *_R1_*.fastq.gz | sed 's/DATA\\([0-9]\\+\\)_R1_001.fastq.gz/\\1/')

        rm DATA*_001.fastq.gz
        find -name "demux*_R1_*.fastq.gz" |sort -V | xargs cat > {params}/demux_R1_001.fastq.gz
        find -name "demux*_R2_*.fastq.gz" |sort -V | xargs cat > {params}/demux_R2_001.fastq.gz

        mkdir -p {params}/logs/demultiplex
        mv demux*BC.log {params}/logs/demultiplex/
        rm demux*.fastq.gz
        """

My rule needs GNU parallel and seqkit install in the conda environment.

conda install conda-forge::parallel
conda install bioconda::seqkit

Best regards,
Benjamin

The text was updated successfully, but these errors were encountered:

pdimens · 2025-02-05T15:19:31Z

Hi Benjamin, thanks for writing. The benchmarks you provide on real data (6 days) is kind of unacceptable, so thanks for bringing it to my attention.

I had a similar-ish line of thinking last night regarding a speedup that I would like to try today or this week. I would like to modify the new python demuxing script originally conceived by @BELKHIR in #190 to focus on a single sample at a time, and have that be parallelized through snakemake. If this works the way I'm hoping it will, it will also be flexible to changes in schema, such as the inclusion of a new sample or something (probably a rare case, but it would work regardless)

The idea is such (psuedocode):

rule demux:
  input:
    data_R1.fq
    data_R2.fq
    data_I1.fq
    dataI2.fq
    segments a,b,c,d
  output:
    pipe({sample}.R1.fq)
    pipe({sample}.R2.fq)
  params:
    sample = lambda wc: wc.get("sample")
    id_segment = lambda wc: sample_dict(wc.sample)
  script:
    scripts/demux_gen1.py {params} {input}

rule compress:
  input:
    {sample}.R{FR}.fq
  output:
    {sample.R{FR}.fq.gz
  shell:
    "gzip {input}"

pdimens · 2025-02-05T21:45:49Z

The work for that can be seen in the demux_parallelized branch here https://github.com/pdimens/harpy/tree/demux_parallelized

And this PR #200

pdimens · 2025-02-06T15:43:56Z

On the [admittedly tiny] test data, the parallelized-demux-by-sample seems to be performant. Once all checks pass, would you be willing to try a dev build on your data to see how it performs in a real setting?

I'd also like to rope @BELKHIR into this, as I made some modifications to their python script that are worth noting.

Preamble

Given that the pythonic approach you provided essentially liberates us from the original Chan method, there is a lot of freedom to make it work in a way that seems more sensible for general use. With that in mind:

Input files are no longer hardcoded into the script. However, to use the levenshtein package, a new conda environment for demuxing was required, and the script needed to be in Snakemake-script form, i.e. using snakemake.input... variables, etc. So it's a victory but at a slight cost.
The script processes a single sample at a time, allowing snakemake to parallelize it arbitrarily
I was unable to get a workable/performant solution that wrote gzipped fastqs directly, so the demuxing first writes uncompressed fasta. However, since gzipping is a separate step and also parallelized, snakemake should simultaneously start compressing outputs as they become available. That's better than all the fastqs being created and then compressed.
The barcode log outputs are unified into a single file. Each sample gets their own file. To make that work, I renamed the columns:

Barcode    Total    Correct_Reads    Corrected_Reads

This format allows the "unclear" barcodes to sit at the bottom. You will be able to recognize if they are "unclear" by them having a number >0 for Total, but zeroes for Correct and Corrected _Reads:

Barcode    Total    Correct_Reads    Corrected_Reads
A41B00C00D82   4    0    0

bpenaud · 2025-02-06T16:21:07Z

Yes no problem to try on my big dataset once all checks pass.

Regards,
Benjamin

pdimens · 2025-02-06T17:05:00Z

@bpenaud thanks for your willingness to test it. The dev version can be installed using the instructions provided here

pdimens · 2025-02-07T14:43:07Z

@bpenaud it's ready for testing off the main branch

bpenaud · 2025-02-10T09:21:32Z

Hi @pdimens ,

I launched the job on Friday afternoon and it's currently still running. I can see that the demultiplexing job isn't halfway through yet, so I don't think the modifications will save any computing time.

From what I can see of your modification, I think the problem is that the demultiplexing job is run as many times as there are samples. As a result, the entire Undetermined file is unzipped and read as many times as there are samples, which takes time.

BELKHIR's solution was to read the Undetermined file once and then write to the sample files directly. This solution could be done with a single job and therefore could be parrallelized as I proposed by first splitting the Undetermined file and running the python script on each group of the Undetermined file.

Don't hesitate if you want help to write the snakefile and python script.

Best Regards,
Benjamin

pdimens · 2025-02-10T15:07:17Z

That's really unfortunate that the current implementation is slow-- I was concerned about it for the very reasons you outlined. When I have time this week, I'll investigate your solution in better detail. Ideally, it would be best to split the rule you provided above into separate rules, one to split/chunk and the other to demux the chunk.

pdimens · 2025-02-12T19:08:17Z

@bpenaud the new divide-and-conquer approach has been merged into main. Would you be willing to test it?

bpenaud · 2025-02-13T08:22:02Z

Yes I can launch it before the week end.

Benjamin

bpenaud · 2025-02-17T09:42:34Z

Hi,

I try to run the new workflow, but the workflow has a problem to resolve the DAG when I set a big number of threads (i.e. dry run time) :

1 threads

time snakemake --rerun-incomplete --show-failed-logs --rerun-triggers input mtime params --nolock --conda-prefix .environments --conda-cleanup-pkgs cache --apptainer-prefix .environments --directory . --software-deployment-method conda --cores 1 --snakefile /home/bpenaud/Results/Haplotagging/Demultiplex_V2/workflow/demultiplex_gen1.smk --configfile /home/bpenaud/Results/Haplotagging/Demultiplex_V2/workflow/config.yaml -n

real 0m8.204s
user 0m7.466s
sys 0m0.698s

2 threads

time snakemake --rerun-incomplete --show-failed-logs --rerun-triggers input mtime params --nolock --conda-prefix .environments --conda-cleanup-pkgs cache --apptainer-prefix .environments --directory . --software-deployment-method conda --cores 1 --snakefile /home/bpenaud/Results/Haplotagging/Demultiplex_V2/workflow/demultiplex_gen1.smk --configfile /home/bpenaud/Results/Haplotagging/Demultiplex_V2/workflow/config.yaml -n

real 0m17.795s
user 0m16.983s
sys 0m0.769s

3 threads
real 0m35.016s
user 0m33.948s
sys 0m1.026s
4 threads
real 1m1.851s
user 1m0.562s
sys 0m1.251s
5 threads
real 1m39.477s
user 1m37.365s
sys 0m2.059s

-6 threads
real 2m27.727s
user 2m24.808s
sys 0m2.854s

-8 threads
real 4m34.012s
user 4m28.808s
sys 0m5.130s

10 threads
real 7m42.499s
user 7m33.478s
sys 0m8.920s

With 60 threads the DAG was never resolve during the week end.

For the moment, I don't find the reason of this behavior.

Regards,
Benjamin

pdimens · 2025-02-17T11:57:21Z

So two things are happening, if I understand correctly:

The dag doesn't resolve
Runtime increases with thread count? Or that's just dry run DAG resolving time?

bpenaud · 2025-02-17T12:11:01Z

All runtime above are to display the DAG (dry run). But since the DAG was never resolve with 60 cores the demultiplex workflow doesn't start.

So with a small amount of threads the DAG is resolve but by increasing it, the DAg is not resolve.

pdimens · 2025-02-17T12:27:57Z

That's so interesting. Thanks for letting me know, I'll look into it

Update: can reproduce the error on my system

pdimens · 2025-02-17T16:37:27Z

Update 2:
The issue (hopefully the only issue) is an infinite recursion in the sample wildcard, seen here when using --dry-run --debug-dag:

candidate job merge_partitions
    wildcards: sample=Sample_17.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001.001.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001.001.001.001.001, FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001.001.001.001.001.001, 
FR=1
candidate job merge_partitions
    wildcards: sample=Sample_17.001.001.001.001.001.001.001.001.001.001,
FR=1

I'll get this fixed.

pdimens · 2025-02-17T17:36:23Z

@bpenaud alright, I think I fixed the issue by setting proper wildcard_constraints. It seems to work (just about immediately) on my laptop up to --threads 999. When you have a chance, please replace the demultiplex_gen1.smk in OUTDIR/workflow/ with the one below and run

# OUTDIR being your output directory
harpy resume OUTDIR

demultiplex_gen1.smk

containerized: "docker://pdimens/harpy:latest"

import os
import logging

outdir = config["output_directory"]
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
samplefile = config["inputs"]["demultiplex_schema"]
skip_reports = config["reports"]["skip"]
keep_unknown = config["keep_unknown"]

onstart:
    logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
    os.makedirs(f"{outdir}/reports/data", exist_ok = True)
onsuccess:
    os.remove(logger.logfile)
onerror:
    os.remove(logger.logfile)
wildcard_constraints:
    sample = r"[a-zA-Z0-9._-]+",
    FR = r"[12]",
    part = r"\d{3}"

def parse_schema(smpl, keep_unknown):
    d = {}
    with open(smpl, "r") as f:
        for i in f.readlines():
            # a casual way to ignore empty lines or lines with !=2 fields
            try:
                sample, bc = i.split()
                id_segment = bc[0]
                if sample not in d:
                    d[sample] = [bc]
                else:
                    d[sample].append(bc)
            except ValueError:
                continue
    if keep_unknown:
        d["_unknown_sample"] = f"{id_segment}00"
    return d

samples = parse_schema(samplefile, keep_unknown)
samplenames = [i for i in samples]
print(samplenames)
fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]

rule barcode_segments:
    output:
        collect(outdir + "/workflow/segment_{letter}.bc", letter = ["A","C","B","D"])
    params:
        f"{outdir}/workflow"
    container:
        None
    shell:
        "haplotag_acbd.py {params}"

rule partition_reads:
    input:
        r1 = config["inputs"]["R1"],
        r2 = config["inputs"]["R2"]       
    output:
        r1 = temp(f"{outdir}/reads.R1.fq.gz"),
        r2 = temp(f"{outdir}/reads.R2.fq.gz"),
        parts = temp(collect(outdir + "/reads_chunks/reads.R{FR}.part_{part}.fq.gz", part = fastq_parts, FR = [1,2]))
    log:
        outdir + "/logs/partition.reads.log"
    threads:
        workflow.cores
    params:
        chunks = min(workflow.cores, 999),
        outdir = f"{outdir}/reads_chunks"
    conda:
        f"{envdir}/demultiplex.yaml"
    shell:
        """
        ln -sr {input.r1} {output.r1}
        ln -sr {input.r2} {output.r2}
        seqkit split2 -f --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}
        """

use rule partition_reads as partition_index with:
    input:
        r1 = config["inputs"]["I1"],
        r2 = config["inputs"]["I2"]       
    output:
        r1 = temp(f"{outdir}/reads.I1.fq.gz"),
        r2 = temp(f"{outdir}/reads.I2.fq.gz"),
        parts = temp(collect(outdir + "/index_chunks/reads.I{FR}.part_{part}.fq.gz", part = fastq_parts, FR = [1,2]))
    log:
        outdir + "/logs/partition.index.log"
    params:
        chunks = min(workflow.cores, 999),
        outdir = f"{outdir}/index_chunks"

rule demultiplex:
    input:
        R1 = outdir + "/reads_chunks/reads.R1.part_{part}.fq.gz",
        R2 = outdir + "/reads_chunks/reads.R2.part_{part}.fq.gz",
        I1 = outdir + "/index_chunks/reads.I1.part_{part}.fq.gz",
        I2 = outdir + "/index_chunks/reads.I2.part_{part}.fq.gz",
        segment_a = f"{outdir}/workflow/segment_A.bc",
        segment_b = f"{outdir}/workflow/segment_B.bc",
        segment_c = f"{outdir}/workflow/segment_C.bc",
        segment_d = f"{outdir}/workflow/segment_D.bc",
        schema = samplefile
    output:
        temp(collect(outdir + "/{sample}.{{part}}.R{FR}.fq", sample = samplenames, FR = [1,2])),
        bx_info = temp(f"{outdir}/logs/part.{{part}}.barcodes")
    log:
        f"{outdir}/logs/demultiplex.{{part}}.log"
    params:
        outdir = outdir,
        qxrx = config["include_qx_rx_tags"],
        keep_unknown = keep_unknown,
        part = lambda wc: wc.get("part")
    conda:
        f"{envdir}/demultiplex.yaml"
    script:
        "scripts/demultiplex_gen1.py"

rule merge_partitions:
    input:
        collect(outdir + "/{{sample}}.{part}.R{{FR}}.fq", part = fastq_parts)
    output:
        outdir + "/{sample}.R{FR}.fq.gz"
    log:
        outdir + "/logs/{sample}.{FR}.concat.log"
    container:
        None
    shell:
        "cat {input} | gzip > {output} 2> {log}"

rule merge_barcode_logs:
    input:
        bc = collect(outdir + "/logs/part.{part}.barcodes", part = fastq_parts)
    output:
        log = f"{outdir}/logs/barcodes.log"
    run:
        bc_dict = {}
        for i in input.bc:
            with open(i, "r") as bc_log:
                # skip first row of column names
                _ = bc_log.readline()
                for line in bc_log:
                    barcode,total,correct,corrected = line.split()
                    bc_stats = [int(total), int(correct), int(corrected)]
                    if barcode not in bc_dict:
                        bc_dict[barcode] = bc_stats
                    else:
                        bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
        with open(output.log, "w") as f:
            f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
            for k,v in bc_dict.items():
                f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n")

rule assess_quality:
    input:
        outdir + "/{sample}.R{FR}.fq.gz"
    output: 
        outdir + "/reports/data/{sample}.R{FR}.fastqc"
    log:
        outdir + "/logs/{sample}.R{FR}.qc.log"
    threads:
        1
    conda:
        f"{envdir}/qc.yaml"
    shell:
        """
        ( falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 ||
cat <<EOF > {output}
##Falco	1.2.4
>>Basic Statistics	fail
#Measure	Value
Filename	{wildcards.sample}.R{wildcards.FR}.fq.gz
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	0
Sequences flagged as poor quality	0
Sequence length	0
%GC	0
>>END_MODULE
EOF      
        """

rule report_config:
    output:
        outdir + "/workflow/multiqc.yaml"
    run:
        import yaml
        configs = {
            "sp": {"fastqc/data": {"fn" : "*.fastqc"}},
            "table_sample_merge": {
                "R1": ".R1",
                "R2": ".R2"
            },
            "title": "Quality Assessment of Demultiplexed Samples",
            "subtitle": "This report aggregates the QA results created by falco",
            "report_comment": "Generated as part of the Harpy demultiplex workflow",
            "report_header_info": [
                {"Submit an issue": "https://github.com/pdimens/harpy/issues/new/choose"},
                {"Read the Docs": "https://pdimens.github.io/harpy/"},
                {"Project Homepage": "https://github.com/pdimens/harpy"}
            ]
        }
        with open(output[0], "w", encoding="utf-8") as yml:
            yaml.dump(configs, yml, default_flow_style= False, sort_keys=False, width=float('inf'))

rule quality_report:
    input:
        fqc = collect(outdir + "/reports/data/{sample}.R{FR}.fastqc", sample = samplenames, FR = [1,2]),
        mqc_yaml = outdir + "/workflow/multiqc.yaml"
    output:
        outdir + "/reports/demultiplex.QA.html"
    log:
        f"{outdir}/logs/multiqc.log"
    params:
        options = "--no-version-check --force --quiet --no-data-dir",
        module = " --module fastqc",
        logdir = outdir + "/reports/data/"
    conda:
        f"{envdir}/qc.yaml"
    shell:
        "multiqc --filename {output} --config {input.mqc_yaml} {params} 2> {log}"

rule workflow_summary:
    default_target: True
    input:
        fq = collect(outdir + "/{sample}.R{FR}.fq.gz", sample = samplenames, FR = [1,2]),
        barcode_logs = f"{outdir}/logs/barcodes.log",
        reports = outdir + "/reports/demultiplex.QA.html" if not skip_reports else []
    params:
        R1 = config["inputs"]["R1"],
        R2 = config["inputs"]["R2"],
        I1 = config["inputs"]["I1"],
        I2 = config["inputs"]["I2"]
    run:
        summary = ["The harpy demultiplex workflow ran using these parameters:"]
        summary.append("Linked Read Barcode Design: Generation I")
        inputs = "The multiplexed input files:\n"
        inputs += f"\tread 1: {params.R1}\n"
        inputs += f"\tread 2: {params.R2}\n"
        inputs += f"\tindex 1: {params.I1}\n"
        inputs += f"\tindex 2: {params.I2}"
        inputs += f"Sample demultiplexing schema: {samplefile}"
        summary.append(inputs)
        demux = "Samples were demultiplexed using:\n"
        demux += "\tworkflow/scripts/demultiplex_gen1.py"
        summary.append(demux)
        qc = "QC checks were performed on demultiplexed FASTQ files using:\n"
        qc += "\tfalco -skip-report -skip-summary -data-filename output input.fq.gz"
        summary.append(qc)
        sm = "The Snakemake workflow was called via command line:\n"
        sm += f"\t{config['workflow_call']}"
        summary.append(sm)
        with open(outdir + "/workflow/demux.gen1.summary", "w") as f:
            f.write("\n\n".join(summary))

bpenaud · 2025-02-20T10:18:33Z

Hi,

The job finished, the run time was 12:05:49 with 60 threads. I have attached a picture of the output directory disk space history. My input files were 211G for R1 and 201G for R2 (and 53G for I1 and 51G for I2). We can see that this part uses about 4 times more disk space than input files.

pdimens · 2025-02-20T12:16:45Z

The runtime seems to have improved substantially, so that's a comfort.

Regarding disk space, the peak should make sense though, right? The input files are first symlinked (free), then split into chunks (doubles disk space usage), and disk space should continue going up as sample fastq files are being created. Finally, when a chunk is done being processed, the R12 and I12 of a chunk get deleted, and the sample files gets compressed (one per thread). Since there are exactly as many chunks as cores provided, disk space usage should max out as all chunks are undergoing this simultaneously. A small workaround would be to create more chunks than threads provided, which would see a smaller peak due to some chunks finishing first and being deleted before a new chunk is processed.

Thank you (genuinely) for continuing to test this, I'm glad to see we are making significant progress on it. Would you like to try the more-chunks-than-cores solution? I can whip that up relatively quickly.

BELKHIR · 2025-02-20T12:24:43Z

Good ideas here @pdimens .

One way is to change chunking procedure.
Create a number of small chunks that fit into available threads, then pause until some slots are idle then regenerate ...

pdimens · 2025-02-20T12:32:14Z

Ad hoc chunking would be also be a big boost to this. I'm not sure how to accomplish that. There must be some kind of way to arbitrarily extract a section of a fastq file, because the point of FAI files is to allow random access. So, it would take getting the seq count of the source fastq file(s), dividing by threads × 2 (or similar), and somehow extracting the reads from a given range in the file to create the chunk as needed. This is a really clever suggestion and I'll have to investigate how to accomplish it.

pdimens · 2025-02-20T21:31:40Z

@bpenaud I'm struggling to convince snakemake to execute the partitions simultaneously, such that part_001 gets prioritized, but it's an improvement over how it was. The solution was to manually create the start-stop partitions and use seqkit range -r START:STOP to manually extract the partitions. When you have the chance, please try this:

containerized: "docker://pdimens/harpy:latest"

import os
import logging
import subprocess

outdir = config["output_directory"]
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
samplefile = config["inputs"]["demultiplex_schema"]
skip_reports = config["reports"]["skip"]
keep_unknown = config["keep_unknown"]
n_chunks = min(int(workflow.cores * 2.5), 999)
onstart:
    logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
    os.makedirs(f"{outdir}/reports/data", exist_ok = True)
onsuccess:
    os.remove(logger.logfile)
    try:
        os.remove(f"{outdir}/reads.R1.fq.gz")
        os.remove(f"{outdir}/reads.R2.fq.gz")
        os.remove(f"{outdir}/reads.I1.fq.gz")
        os.remove(f"{outdir}/reads.I2.fq.gz")
    except FileNotFoundError:
        pass
onerror:
    os.remove(logger.logfile)
wildcard_constraints:
    sample = r"[a-zA-Z0-9._-]+",
    FR = r"[12]",
    part = r"\d{3}"

def parse_schema(smpl: str, keep_unknown: bool) -> dict:
    d = {}
    with open(smpl, "r") as f:
        for i in f.readlines():
            # a casual way to ignore empty lines or lines with !=2 fields
            try:
                sample, bc = i.split()
                id_segment = bc[0]
                if sample not in d:
                    d[sample] = [bc]
                else:
                    d[sample].append(bc)
            except ValueError:
                continue
    if keep_unknown:
        d["_unknown_sample"] = f"{id_segment}00"
    return d

samples = parse_schema(samplefile, keep_unknown)
samplenames = [i for i in samples]

def setup_chunks(fq1: str, fq2: str, parts: int) -> dict:
    # find the minimum number of reads between R1 and R2 files
    count_r1 = subprocess.check_output(["zgrep", "-c", "-x", '+', fq1])
    count_r2 = subprocess.check_output(["zgrep", "-c", "-x", '+', fq2])
    read_min = min(int(count_r1), int(count_r2))
    chunks_length = read_min // parts
    starts = list(range(1, read_min, max(chunks_length,20)))
    ends = [i-1 for i in starts[1:]]
    # the last end should be -1, which is the "end" in a seqkit range
    ends[-1] = -1
    formatted_parts = [f"{i:03d}" for i in range(1, parts + 1)]
    # format  {part : (start, end)}
    # example {"001": (1, 5000)}
    return dict(zip(formatted_parts, zip(starts, ends)))

chunk_dict = setup_chunks(config["inputs"]["R1"], config["inputs"]["R2"], n_chunks)
fastq_parts = list(chunk_dict.keys())
rule barcode_segments:
    output:
        collect(outdir + "/workflow/segment_{letter}.bc", letter = ["A","C","B","D"])
    params:
        f"{outdir}/workflow"
    container:
        None
    shell:
        "haplotag_acbd.py {params}"

rule link_input:
    input:
        r1 = config["inputs"]["R1"],
        r2 = config["inputs"]["R2"],
        i1 = config["inputs"]["I1"],
        i2 = config["inputs"]["I2"]
    output:
        r1 = temp(f"{outdir}/reads.R1.fq.gz"),
        r2 = temp(f"{outdir}/reads.R2.fq.gz"),
        i1 = temp(f"{outdir}/reads.I1.fq.gz"),
        i2 = temp(f"{outdir}/reads.I2.fq.gz")
    run:
        for i,o in zip(input,output):
            if os.path.exists(o) or os.path.islink(o):
                os.remove(o)
            os.symlink(i, o)

rule partition_reads:
    group: "partition"
    input:
        outdir + "/reads.R{FR}.fq.gz"
    output:
        temp(outdir + "/reads_chunks/reads.R{FR}.part_{part}.fq.gz")
    params:
        lambda wc: f"-r {chunk_dict[wc.part][0]}:{chunk_dict[wc.part][1]}"
    conda:
        f"{envdir}/demultiplex.yaml"
    shell:
        "seqkit range {params} -o {output} {input}"

use rule partition_reads as partition_index with:
    group: "partition"
    input:
        outdir + "/reads.I{FR}.fq.gz"
    output:
        temp(outdir + "/index_chunks/reads.I{FR}.part_{part}.fq.gz")

checkpoint demultiplex:
    group: "partition"
    priority: 100
    input:
        R1 = outdir + "/reads_chunks/reads.R1.part_{part}.fq.gz",
        R2 = outdir + "/reads_chunks/reads.R2.part_{part}.fq.gz",
        I1 = outdir + "/index_chunks/reads.I1.part_{part}.fq.gz",
        I2 = outdir + "/index_chunks/reads.I2.part_{part}.fq.gz",
        segment_a = f"{outdir}/workflow/segment_A.bc",
        segment_b = f"{outdir}/workflow/segment_B.bc",
        segment_c = f"{outdir}/workflow/segment_C.bc",
        segment_d = f"{outdir}/workflow/segment_D.bc",
        schema = samplefile
    output:
        temp(collect(outdir + "/{sample}.{{part}}.R{FR}.fq", sample = samplenames, FR = [1,2])),
        bx_info = temp(f"{outdir}/logs/part.{{part}}.barcodes")
    log:
        f"{outdir}/logs/demultiplex.{{part}}.log"
    params:
        outdir = outdir,
        qxrx = config["include_qx_rx_tags"],
        keep_unknown = keep_unknown,
        part = lambda wc: wc.get("part")
    conda:
        f"{envdir}/demultiplex.yaml"
    script:
        "scripts/demultiplex_gen1.py"

rule merge_partitions:
    input:
        collect(outdir + "/{{sample}}.{part}.R{{FR}}.fq", part = fastq_parts)
    output:
        outdir + "/{sample}.R{FR}.fq.gz"
    log:
        outdir + "/logs/{sample}.{FR}.concat.log"
    container:
        None
    shell:
        "cat {input} | gzip > {output} 2> {log}"

rule merge_barcode_logs:
    input:
        bc = collect(outdir + "/logs/part.{part}.barcodes", part = fastq_parts)
    output:
        log = f"{outdir}/logs/barcodes.log"
    run:
        bc_dict = {}
        for i in input.bc:
            with open(i, "r") as bc_log:
                # skip first row of column names
                _ = bc_log.readline()
                for line in bc_log:
                    barcode,total,correct,corrected = line.split()
                    bc_stats = [int(total), int(correct), int(corrected)]
                    if barcode not in bc_dict:
                        bc_dict[barcode] = bc_stats
                    else:
                        bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
        with open(output.log, "w") as f:
            f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
            for k,v in bc_dict.items():
                f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n")

rule assess_quality:
    input:
        outdir + "/{sample}.R{FR}.fq.gz"
    output: 
        outdir + "/reports/data/{sample}.R{FR}.fastqc"
    log:
        outdir + "/logs/{sample}.R{FR}.qc.log"
    threads:
        1
    conda:
        f"{envdir}/qc.yaml"
    shell:
        """
        ( falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 ||
cat <<EOF > {output}
##Falco	1.2.4
>>Basic Statistics	fail
#Measure	Value
Filename	{wildcards.sample}.R{wildcards.FR}.fq.gz
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	0
Sequences flagged as poor quality	0
Sequence length	0
%GC	0
>>END_MODULE
EOF      
        """

rule report_config:
    output:
        outdir + "/workflow/multiqc.yaml"
    run:
        import yaml
        configs = {
            "sp": {"fastqc/data": {"fn" : "*.fastqc"}},
            "table_sample_merge": {
                "R1": ".R1",
                "R2": ".R2"
            },
            "title": "Quality Assessment of Demultiplexed Samples",
            "subtitle": "This report aggregates the QA results created by falco",
            "report_comment": "Generated as part of the Harpy demultiplex workflow",
            "report_header_info": [
                {"Submit an issue": "https://github.com/pdimens/harpy/issues/new/choose"},
                {"Read the Docs": "https://pdimens.github.io/harpy/"},
                {"Project Homepage": "https://github.com/pdimens/harpy"}
            ]
        }
        with open(output[0], "w", encoding="utf-8") as yml:
            yaml.dump(configs, yml, default_flow_style= False, sort_keys=False, width=float('inf'))

rule quality_report:
    input:
        fqc = collect(outdir + "/reports/data/{sample}.R{FR}.fastqc", sample = samplenames, FR = [1,2]),
        mqc_yaml = outdir + "/workflow/multiqc.yaml"
    output:
        outdir + "/reports/demultiplex.QA.html"
    log:
        f"{outdir}/logs/multiqc.log"
    params:
        options = "--no-version-check --force --quiet --no-data-dir",
        module = " --module fastqc",
        logdir = outdir + "/reports/data/"
    conda:
        f"{envdir}/qc.yaml"
    shell:
        "multiqc --filename {output} --config {input.mqc_yaml} {params} 2> {log}"

rule workflow_summary:
    default_target: True
    input:
        fq = collect(outdir + "/{sample}.R{FR}.fq.gz", sample = samplenames, FR = [1,2]),
        barcode_logs = f"{outdir}/logs/barcodes.log",
        reports = outdir + "/reports/demultiplex.QA.html" if not skip_reports else []
    params:
        R1 = config["inputs"]["R1"],
        R2 = config["inputs"]["R2"],
        I1 = config["inputs"]["I1"],
        I2 = config["inputs"]["I2"]
    run:
        summary = ["The harpy demultiplex workflow ran using these parameters:"]
        summary.append("Linked Read Barcode Design: Generation I")
        inputs = "The multiplexed input files:\n"
        inputs += f"\tread 1: {params.R1}\n"
        inputs += f"\tread 2: {params.R2}\n"
        inputs += f"\tindex 1: {params.I1}\n"
        inputs += f"\tindex 2: {params.I2}"
        inputs += f"Sample demultiplexing schema: {samplefile}"
        summary.append(inputs)
        chunking = "Input data was partitioned into smaller chunks using:\n"
        chunking += "\tseqkit -r start:stop -o output.fq input.fq"
        summary.append(chunking)
        demux = "Samples were demultiplexed using:\n"
        demux += "\tworkflow/scripts/demultiplex_gen1.py"
        summary.append(demux)
        qc = "QC checks were performed on demultiplexed FASTQ files using:\n"
        qc += "\tfalco -skip-report -skip-summary -data-filename output input.fq.gz"
        summary.append(qc)
        sm = "The Snakemake workflow was called via command line:\n"
        sm += f"\t{config['workflow_call']}"
        summary.append(sm)
        with open(outdir + "/workflow/demux.gen1.summary", "w") as f:
            f.write("\n\n".join(summary))

bpenaud · 2025-02-27T10:47:10Z

Hi sorry for the delay,

Below is an image to follow the disk space and time of the workflow. This new script seems to be longer (42h50min) and not resolving disk space.
Benjamin

pdimens · 2025-02-27T11:43:59Z

Damn, I thought this would have been the winner. So the previous one seems like the best option thus far. It seems like more chunks, even if smaller, has more overhead. I'd like to try one more thing where there is the n_cores number of chunks but each demuxing job writes directly to a gzip file via subprocess. I'll notify you when I get that working

pdimens · 2025-02-27T16:15:26Z

@bpenaud okay, let's try this as a possibly final option:

the barcode merging no longer stores the barcodes in memory (hybrid bash and python processing approach)
reverted to seqkit split2 partioning
demultiplex python script writes demux'd samples directly to gzip format via a write to subprocess.Popen(["gzip"])
- there's a chance the subprocess approach is a slowdown, hopefully not 🤞

`demultiplex_gen1.py`:

#!/usr/bin/env python
import sys

f = open(snakemake.log[0], "w")
sys.stderr = sys.stdout = f

import subprocess
import pysam
from Levenshtein import distance

def read_barcodes(file_path, segment):
    """Read and parse input barcode (segment) file of segment<tab>sequence"""
    data_dict = {}
    with open(file_path, 'r') as file:
        for line in file:
            try:
                code, seq = line.rstrip().split()
                if code[0].upper() != segment:
                    sys.stderr.write(f"Segments in {file_path} are expected to begin with {segment}, but begin with {code[0].upper()}\n")
                    sys.exit(1)
                data_dict[seq] = code
            except ValueError:
                # skip rows without two columns
                continue
    return data_dict

def read_schema(file_path):
    """Read and parse schema file of sample<tab>id_segment"""
    # one sample can have more than one code
    # {segment : sample}
    data_dict = {}
    # codes can be Axx, Bxx, Cxx, Dxx
    code_letters = set()
    with open(file_path, 'r') as file:
        for line in file:
            try:
                sample, segment_id = line.rstrip().split()
                data_dict[segment_id] = sample
                code_letters.add(segment_id[0])
            except ValueError:
                # skip rows without two columns
                continue
    id_letter = code_letters.pop()
    return id_letter, data_dict

def get_min_dist(needle, code_letter):
    minDist = 999
    nbFound = 0
    minSeq =""
    for seq in bar_codes[code_letter]:
        d = distance(needle, seq)
        if (d < minDist):
            minDist = d
            nbFound = 1
            minSeq= seq
        elif (d == minDist):
            nbFound += 1   
    if (nbFound>1):
        code_min_dist = f"{code_letter}00"
    else:
        code_min_dist =  bar_codes[code_letter][minSeq]   
    return code_min_dist

def get_read_codes(index_read, left_segment, right_segment):
    left  = index_read[0:6] # protocol-dependent
    right = index_read[7:]
    status = "found"
    if left in bar_codes[left_segment]:
        lc = bar_codes[left_segment][left]
    else:
        lc = get_min_dist(left, left_segment)
        status = "corrected"

    if right in bar_codes[right_segment]:
        rc = bar_codes[right_segment][right]
    else:
        rc = get_min_dist(right, right_segment)
        status = "corrected"

    if (lc == f"{left_segment}00" or rc == f"{right_segment}00"):
        status = "unclear"
    return rc, lc, status


qxrx = snakemake.params.qxrx
part = snakemake.params.part
schema = snakemake.input.schema
keep_unknown = snakemake.params.keep_unknown
outdir = snakemake.params.outdir
r1 = snakemake.input.R1
r2 = snakemake.input.R2
i1 = snakemake.input.I1
i2 = snakemake.input.I2
bx_a = snakemake.input.segment_a
bx_b = snakemake.input.segment_b
bx_c = snakemake.input.segment_c
bx_d = snakemake.input.segment_d
bar_codes = {
    "A" : read_barcodes(bx_a, "A"),
    "B" : read_barcodes(bx_b, "B"),
    "C" : read_barcodes(bx_c, "C"),
    "D" : read_barcodes(bx_d, "D"),
}

#read schema
id_letter, samples_dict = read_schema(schema)
samples = list(set(samples_dict.values()))
if keep_unknown:
    samples.append("_unknown_sample")
# create an array of files (one per sample) for writing
R1_output = {}
R2_output = {}
for sample in samples:
    R1_output[sample] = subprocess.Popen(["gzip"], stdin = subprocess.PIPE , stdout = open(f"{outdir}/{sample}.{part}.R1.fq.gz", 'wb'))
    R2_output[sample] = subprocess.Popen(["gzip"], stdin = subprocess.PIPE , stdout = open(f"{outdir}/{sample}.{part}.R2.fq.gz", 'wb'))

segments = {'A':'', 'B':'', 'C':'', 'D':''}
clear_read_map={}
with (
    pysam.FastxFile(r1) as R1,
    pysam.FastxFile(r2) as R2,
    pysam.FastxFile(i1, persist = False) as I1,
    pysam.FastxFile(i2, persist = False) as I2,
    open(snakemake.output.bx_info, 'w') as BC_log
):
    for r1_rec, r2_rec, i1_rec, i2_rec in zip(R1, R2, I1, I2):
        segments['A'], segments['C'], R1_status = get_read_codes(i1_rec.sequence, "C", "A")
        segments['B'], segments['D'], R2_status = get_read_codes(i2_rec.sequence, "D", "B")
        statuses = [R1_status, R2_status]
        BX_code = segments['A'] + segments['C'] + segments['B']+ segments['D']
        bc_tags = f"BX:Z:{BX_code}"
        if qxrx:
            bc_tags = f"RX:Z:{i1_rec.sequence}+{i2_rec.sequence}\tQX:Z:{i1_rec.quality}+{i2_rec.quality}\t{bc_tags}"
        r1_rec.comment += f"\t{bc_tags}"
        r2_rec.comment += f"\t{bc_tags}"
        # search sample name
        sample_name = samples_dict.get(segments[id_letter], "_unknown_sample")
        if sample_name == "_unknown_sample" and not keep_unknown:
            continue
        R1_output[sample_name].stdin.write(f"{r1_rec}\n".encode("utf-8"))
        R2_output[sample_name].stdin.write(f"{r2_rec}\n".encode("utf-8"))

        if "unclear" in statuses:
            continue
        if "corrected" in statuses:
            if  BX_code in clear_read_map:
                clear_read_map[BX_code][1] += 1
            else:
                clear_read_map[BX_code] = [0,1] 
        else:
            if all(status == "found" for status in statuses):
                if  BX_code in clear_read_map:
                    clear_read_map[BX_code][0] += 1
                else:
                    clear_read_map[BX_code] = [1,0]          

    #for sample_name in samples:
    #    R1_output[sample_name].terminate()
    #    R2_output[sample_name].terminate()

    BC_log.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
    for code in clear_read_map:
        BC_log.write(f"{code}\t{sum(clear_read_map[code])}\t{clear_read_map[code][0]}\t{clear_read_map[code][1]}\n")
f.close()

`demultiplex_gen1.smk`

containerized: "docker://pdimens/harpy:latest"

import os
import logging

outdir = config["output_directory"]
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
samplefile = config["inputs"]["demultiplex_schema"]
skip_reports = config["reports"]["skip"]
keep_unknown = config["keep_unknown"]

onstart:
    logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
    os.makedirs(f"{outdir}/reports/data", exist_ok = True)
onsuccess:
    os.remove(logger.logfile)
onerror:
    os.remove(logger.logfile)
wildcard_constraints:
    sample = r"[a-zA-Z0-9._-]+",
    FR = r"[12]",
    part = r"\d{3}"

def parse_schema(smpl, keep_unknown):
    d = {}
    with open(smpl, "r") as f:
        for i in f.readlines():
            # ignore empty lines or lines with !=2 fields
            try:
                sample, bc = i.split()
                id_segment = bc[0]
                if sample not in d:
                    d[sample] = [bc]
                else:
                    d[sample].append(bc)
            except ValueError:
                continue
    if keep_unknown:
        d["_unknown_sample"] = f"{id_segment}00"
    return d

samples = parse_schema(samplefile, keep_unknown)
samplenames = [i for i in samples]
fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]

rule barcode_segments:
    output:
        collect(outdir + "/workflow/segment_{letter}.bc", letter = ["A","C","B","D"])
    params:
        f"{outdir}/workflow"
    container:
        None
    shell:
        "haplotag_acbd.py {params}"

rule partition_reads:
    input:
        r1 = config["inputs"]["R1"],
        r2 = config["inputs"]["R2"]       
    output:
        r1 = temp(f"{outdir}/reads.R1.fq.gz"),
        r2 = temp(f"{outdir}/reads.R2.fq.gz"),
        parts = temp(collect(outdir + "/reads_chunks/reads.R{FR}.part_{part}.fq.gz", part = fastq_parts, FR = [1,2]))
    log:
        outdir + "/logs/partition.reads.log"
    threads:
        workflow.cores
    params:
        chunks = min(workflow.cores, 999),
        outdir = f"{outdir}/reads_chunks"
    conda:
        f"{envdir}/demultiplex.yaml"
    shell:
        """
        ln -sr {input.r1} {output.r1}
        ln -sr {input.r2} {output.r2}
        seqkit split2 -f --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}
        """

use rule partition_reads as partition_index with:
    input:
        r1 = config["inputs"]["I1"],
        r2 = config["inputs"]["I2"]       
    output:
        r1 = temp(f"{outdir}/reads.I1.fq.gz"),
        r2 = temp(f"{outdir}/reads.I2.fq.gz"),
        parts = temp(collect(outdir + "/index_chunks/reads.I{FR}.part_{part}.fq.gz", part = fastq_parts, FR = [1,2]))
    log:
        outdir + "/logs/partition.index.log"
    params:
        chunks = min(workflow.cores, 999),
        outdir = f"{outdir}/index_chunks"

rule demultiplex:
    priority: 100
    input:
        R1 = outdir + "/reads_chunks/reads.R1.part_{part}.fq.gz",
        R2 = outdir + "/reads_chunks/reads.R2.part_{part}.fq.gz",
        I1 = outdir + "/index_chunks/reads.I1.part_{part}.fq.gz",
        I2 = outdir + "/index_chunks/reads.I2.part_{part}.fq.gz",
        segment_a = f"{outdir}/workflow/segment_A.bc",
        segment_b = f"{outdir}/workflow/segment_B.bc",
        segment_c = f"{outdir}/workflow/segment_C.bc",
        segment_d = f"{outdir}/workflow/segment_D.bc",
        schema = samplefile
    output:
        temp(collect(outdir + "/{sample}.{{part}}.R{FR}.fq.gz", sample = samplenames, FR = [1,2])),
        bx_info = temp(f"{outdir}/logs/part.{{part}}.barcodes")
    log:
        f"{outdir}/logs/demultiplex.{{part}}.log"
    params:
        outdir = outdir,
        qxrx = config["include_qx_rx_tags"],
        keep_unknown = keep_unknown,
        part = lambda wc: wc.get("part")
    conda:
        f"{envdir}/demultiplex.yaml"
    script:
        "scripts/demultiplex_gen1.py"

rule merge_partitions:
    input:
        collect(outdir + "/{{sample}}.{part}.R{{FR}}.fq.gz", part = fastq_parts)
    output:
        outdir + "/{sample}.R{FR}.fq.gz"
    log:
        outdir + "/logs/{sample}.{FR}.concat.log"
    container:
        None
    shell:
        "cat {input} > {output} 2> {log}"

rule merge_barcode_logs:
    input:
        bc = collect(outdir + "/logs/part.{part}.barcodes", part = fastq_parts)
    output:
        concat = temp(f"{outdir}/logs/barcodes.concat"),
        log = f"{outdir}/logs/barcodes.log"
    run:
        shell("cat {input.bc} | sort -k1,1 > {output.concat}")
        #shell("cat {input.bc} | sort -k1,1 > /home/pdimens/test.concat")
        with open(output.concat, "r") as file, open(output.log, "w") as file_out:
            file_out.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
            prev_bc, prev_1, prev_2, prev_3 = file.readline().split()
            # protect against the headers appearing at the top, just in case
            while prev_bc == "Barcode":
                prev_bc, prev_1, prev_2, prev_3 = file.readline.split()
            for line in file:
                # another redundancy
                if line.startswith("Barcode"):
                    continue
                bc, c1, c2, c3 = line.split()
                if bc != prev_bc:
                    # the barcode is different, write the previous line to file
                    _ = file_out.write(f"{prev_bc}\t{prev_1}\t{prev_2}\t{prev_3}\n")
                    # current becomes the basis for comparison (previous)
                    prev_bc, prev_1, prev_2, prev_3 = bc, c1, c2, c3
                else:
                    # the barcode is the same, sum the count columns
                    # the summed row becomes the basis of comparison
                    prev_bc = bc
                    prev_1 = int(prev_1) + int(c1)
                    prev_2 = int(prev_2) + int(c2)
                    prev_3 = int(prev_3) + int(c3)

rule assess_quality:
    input:
        outdir + "/{sample}.R{FR}.fq.gz"
    output: 
        outdir + "/reports/data/{sample}.R{FR}.fastqc"
    log:
        outdir + "/logs/{sample}.R{FR}.qc.log"
    conda:
        f"{envdir}/qc.yaml"
    shell:
        """
        ( falco --quiet --threads 1 -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 ||
cat <<EOF > {output}
##Falco	1.2.4
>>Basic Statistics	fail
#Measure	Value
Filename	{wildcards.sample}.R{wildcards.FR}.fq.gz
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	0
Sequences flagged as poor quality	0
Sequence length	0
%GC	0
>>END_MODULE
EOF      
        """

rule report_config:
    output:
        outdir + "/workflow/multiqc.yaml"
    run:
        import yaml
        configs = {
            "sp": {"fastqc/data": {"fn" : "*.fastqc"}},
            "table_sample_merge": {
                "R1": ".R1",
                "R2": ".R2"
            },
            "title": "Quality Assessment of Demultiplexed Samples",
            "subtitle": "This report aggregates the QA results created by falco",
            "report_comment": "Generated as part of the Harpy demultiplex workflow",
            "report_header_info": [
                {"Submit an issue": "https://github.com/pdimens/harpy/issues/new/choose"},
                {"Read the Docs": "https://pdimens.github.io/harpy/"},
                {"Project Homepage": "https://github.com/pdimens/harpy"}
            ]
        }
        with open(output[0], "w", encoding="utf-8") as yml:
            yaml.dump(configs, yml, default_flow_style= False, sort_keys=False, width=float('inf'))

rule quality_report:
    input:
        fqc = collect(outdir + "/reports/data/{sample}.R{FR}.fastqc", sample = samplenames, FR = [1,2]),
        mqc_yaml = outdir + "/workflow/multiqc.yaml"
    output:
        outdir + "/reports/demultiplex.QA.html"
    log:
        f"{outdir}/logs/multiqc.log"
    params:
        options = "--no-version-check --force --quiet --no-data-dir",
        module = " --module fastqc",
        logdir = outdir + "/reports/data/"
    conda:
        f"{envdir}/qc.yaml"
    shell:
        "multiqc --filename {output} --config {input.mqc_yaml} {params} 2> {log}"

rule workflow_summary:
    default_target: True
    input:
        fq = collect(outdir + "/{sample}.R{FR}.fq.gz", sample = samplenames, FR = [1,2]),
        barcode_logs = f"{outdir}/logs/barcodes.log",
        reports = outdir + "/reports/demultiplex.QA.html" if not skip_reports else []
    params:
        R1 = config["inputs"]["R1"],
        R2 = config["inputs"]["R2"],
        I1 = config["inputs"]["I1"],
        I2 = config["inputs"]["I2"]
    run:
        summary = ["The harpy demultiplex workflow ran using these parameters:"]
        summary.append("Linked Read Barcode Design: Generation I")
        inputs = "The multiplexed input files:\n"
        inputs += f"\tread 1: {params.R1}\n"
        inputs += f"\tread 2: {params.R2}\n"
        inputs += f"\tindex 1: {params.I1}\n"
        inputs += f"\tindex 2: {params.I2}"
        inputs += f"Sample demultiplexing schema: {samplefile}"
        summary.append(inputs)
        demux = "Samples were demultiplexed using:\n"
        demux += "\tworkflow/scripts/demultiplex_gen1.py"
        summary.append(demux)
        qc = "QC checks were performed on demultiplexed FASTQ files using:\n"
        qc += "\tfalco -skip-report -skip-summary -data-filename output input.fq.gz"
        summary.append(qc)
        sm = "The Snakemake workflow was called via command line:\n"
        sm += f"\t{config['workflow_call']}"
        summary.append(sm)
        with open(outdir + "/workflow/demux.gen1.summary", "w") as f:
            f.write("\n\n".join(summary))

bpenaud added the enhancement New feature or request label Feb 5, 2025

pdimens mentioned this issue Feb 5, 2025

attempt parallelized demultiplexing #200

Merged

pdimens added this to the 2.0 milestone Feb 5, 2025

pdimens closed this as completed in #200 Feb 6, 2025

pdimens reopened this Feb 6, 2025

pdimens mentioned this issue Feb 12, 2025

divide-and-conquer demultiplexing #202

Merged

pdimens closed this as completed in #202 Feb 12, 2025

pdimens reopened this Feb 12, 2025

pdimens mentioned this issue Feb 17, 2025

more 2.0 milestones #208

Merged

pdimens closed this as completed in #208 Feb 17, 2025

pdimens reopened this Feb 17, 2025

pdimens self-assigned this Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup demultiplex barcode by parallelisation #198

speedup demultiplex barcode by parallelisation #198

bpenaud commented Feb 5, 2025

pdimens commented Feb 5, 2025 •

edited

Loading

pdimens commented Feb 5, 2025 •

edited

Loading

pdimens commented Feb 6, 2025

bpenaud commented Feb 6, 2025

pdimens commented Feb 6, 2025

pdimens commented Feb 7, 2025

bpenaud commented Feb 10, 2025

pdimens commented Feb 10, 2025

pdimens commented Feb 12, 2025

bpenaud commented Feb 13, 2025

bpenaud commented Feb 17, 2025

pdimens commented Feb 17, 2025

bpenaud commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 17, 2025 •

edited

Loading

bpenaud commented Feb 20, 2025

pdimens commented Feb 20, 2025

BELKHIR commented Feb 20, 2025

pdimens commented Feb 20, 2025

pdimens commented Feb 20, 2025 •

edited

Loading

bpenaud commented Feb 27, 2025

pdimens commented Feb 27, 2025 •

edited

Loading

pdimens commented Feb 27, 2025

speedup demultiplex barcode by parallelisation #198

speedup demultiplex barcode by parallelisation #198

Comments

bpenaud commented Feb 5, 2025

Description of feature

pdimens commented Feb 5, 2025 • edited Loading

pdimens commented Feb 5, 2025 • edited Loading

pdimens commented Feb 6, 2025

Preamble

bpenaud commented Feb 6, 2025

pdimens commented Feb 6, 2025

pdimens commented Feb 7, 2025

bpenaud commented Feb 10, 2025

pdimens commented Feb 10, 2025

pdimens commented Feb 12, 2025

bpenaud commented Feb 13, 2025

bpenaud commented Feb 17, 2025

pdimens commented Feb 17, 2025

bpenaud commented Feb 17, 2025 • edited Loading

pdimens commented Feb 17, 2025 • edited Loading

pdimens commented Feb 17, 2025 • edited Loading

pdimens commented Feb 17, 2025 • edited Loading

demultiplex_gen1.smk

bpenaud commented Feb 20, 2025

pdimens commented Feb 20, 2025

BELKHIR commented Feb 20, 2025

pdimens commented Feb 20, 2025

pdimens commented Feb 20, 2025 • edited Loading

bpenaud commented Feb 27, 2025

pdimens commented Feb 27, 2025 • edited Loading

pdimens commented Feb 27, 2025

demultiplex_gen1.py:

demultiplex_gen1.smk

pdimens commented Feb 5, 2025 •

edited

Loading

pdimens commented Feb 5, 2025 •

edited

Loading

bpenaud commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 17, 2025 •

edited

Loading

pdimens commented Feb 20, 2025 •

edited

Loading

pdimens commented Feb 27, 2025 •

edited

Loading

`demultiplex_gen1.py`:

`demultiplex_gen1.smk`