-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speedup demultiplex barcode by parallelisation #198
Comments
Hi Benjamin, thanks for writing. The benchmarks you provide on real data (6 days) is kind of unacceptable, so thanks for bringing it to my attention. I had a similar-ish line of thinking last night regarding a speedup that I would like to try today or this week. I would like to modify the new python demuxing script originally conceived by @BELKHIR in #190 to focus on a single sample at a time, and have that be parallelized through snakemake. If this works the way I'm hoping it will, it will also be flexible to changes in schema, such as the inclusion of a new sample or something (probably a rare case, but it would work regardless) The idea is such (psuedocode): rule demux:
input:
data_R1.fq
data_R2.fq
data_I1.fq
dataI2.fq
segments a,b,c,d
output:
pipe({sample}.R1.fq)
pipe({sample}.R2.fq)
params:
sample = lambda wc: wc.get("sample")
id_segment = lambda wc: sample_dict(wc.sample)
script:
scripts/demux_gen1.py {params} {input}
rule compress:
input:
{sample}.R{FR}.fq
output:
{sample.R{FR}.fq.gz
shell:
"gzip {input}" |
The work for that can be seen in the And this PR #200 |
On the [admittedly tiny] test data, the parallelized-demux-by-sample seems to be performant. Once all checks pass, would you be willing to try a dev build on your data to see how it performs in a real setting? I'd also like to rope @BELKHIR into this, as I made some modifications to their python script that are worth noting. PreambleGiven that the pythonic approach you provided essentially liberates us from the original Chan method, there is a lot of freedom to make it work in a way that seems more sensible for general use. With that in mind:
This format allows the "unclear" barcodes to sit at the bottom. You will be able to recognize if they are "unclear" by them having a number >0 for
|
Yes no problem to try on my big dataset once all checks pass. Regards, |
@bpenaud thanks for your willingness to test it. The dev version can be installed using the instructions provided here |
@bpenaud it's ready for testing off the |
Hi @pdimens , I launched the job on Friday afternoon and it's currently still running. I can see that the demultiplexing job isn't halfway through yet, so I don't think the modifications will save any computing time. From what I can see of your modification, I think the problem is that the demultiplexing job is run as many times as there are samples. As a result, the entire Undetermined file is unzipped and read as many times as there are samples, which takes time. BELKHIR's solution was to read the Undetermined file once and then write to the sample files directly. This solution could be done with a single job and therefore could be parrallelized as I proposed by first splitting the Undetermined file and running the python script on each group of the Undetermined file. Don't hesitate if you want help to write the snakefile and python script. Best Regards, |
That's really unfortunate that the current implementation is slow-- I was concerned about it for the very reasons you outlined. When I have time this week, I'll investigate your solution in better detail. Ideally, it would be best to split the rule you provided above into separate rules, one to split/chunk and the other to demux the chunk. |
@bpenaud the new divide-and-conquer approach has been merged into |
Yes I can launch it before the week end. Benjamin |
Hi, I try to run the new workflow, but the workflow has a problem to resolve the DAG when I set a big number of threads (i.e. dry run time) :
real 0m8.204s
real 0m17.795s
-6 threads -8 threads
With 60 threads the DAG was never resolve during the week end. For the moment, I don't find the reason of this behavior. Regards, |
So two things are happening, if I understand correctly:
|
All runtime above are to display the DAG (dry run). But since the DAG was never resolve with 60 cores the demultiplex workflow doesn't start. So with a small amount of threads the DAG is resolve but by increasing it, the DAg is not resolve. |
That's so interesting. Thanks for letting me know, I'll look into it Update: can reproduce the error on my system |
Update 2:
I'll get this fixed. |
@bpenaud alright, I think I fixed the issue by setting proper # OUTDIR being your output directory
harpy resume OUTDIR demultiplex_gen1.smkcontainerized: "docker://pdimens/harpy:latest"
import os
import logging
outdir = config["output_directory"]
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
samplefile = config["inputs"]["demultiplex_schema"]
skip_reports = config["reports"]["skip"]
keep_unknown = config["keep_unknown"]
onstart:
logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
os.makedirs(f"{outdir}/reports/data", exist_ok = True)
onsuccess:
os.remove(logger.logfile)
onerror:
os.remove(logger.logfile)
wildcard_constraints:
sample = r"[a-zA-Z0-9._-]+",
FR = r"[12]",
part = r"\d{3}"
def parse_schema(smpl, keep_unknown):
d = {}
with open(smpl, "r") as f:
for i in f.readlines():
# a casual way to ignore empty lines or lines with !=2 fields
try:
sample, bc = i.split()
id_segment = bc[0]
if sample not in d:
d[sample] = [bc]
else:
d[sample].append(bc)
except ValueError:
continue
if keep_unknown:
d["_unknown_sample"] = f"{id_segment}00"
return d
samples = parse_schema(samplefile, keep_unknown)
samplenames = [i for i in samples]
print(samplenames)
fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]
rule barcode_segments:
output:
collect(outdir + "/workflow/segment_{letter}.bc", letter = ["A","C","B","D"])
params:
f"{outdir}/workflow"
container:
None
shell:
"haplotag_acbd.py {params}"
rule partition_reads:
input:
r1 = config["inputs"]["R1"],
r2 = config["inputs"]["R2"]
output:
r1 = temp(f"{outdir}/reads.R1.fq.gz"),
r2 = temp(f"{outdir}/reads.R2.fq.gz"),
parts = temp(collect(outdir + "/reads_chunks/reads.R{FR}.part_{part}.fq.gz", part = fastq_parts, FR = [1,2]))
log:
outdir + "/logs/partition.reads.log"
threads:
workflow.cores
params:
chunks = min(workflow.cores, 999),
outdir = f"{outdir}/reads_chunks"
conda:
f"{envdir}/demultiplex.yaml"
shell:
"""
ln -sr {input.r1} {output.r1}
ln -sr {input.r2} {output.r2}
seqkit split2 -f --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}
"""
use rule partition_reads as partition_index with:
input:
r1 = config["inputs"]["I1"],
r2 = config["inputs"]["I2"]
output:
r1 = temp(f"{outdir}/reads.I1.fq.gz"),
r2 = temp(f"{outdir}/reads.I2.fq.gz"),
parts = temp(collect(outdir + "/index_chunks/reads.I{FR}.part_{part}.fq.gz", part = fastq_parts, FR = [1,2]))
log:
outdir + "/logs/partition.index.log"
params:
chunks = min(workflow.cores, 999),
outdir = f"{outdir}/index_chunks"
rule demultiplex:
input:
R1 = outdir + "/reads_chunks/reads.R1.part_{part}.fq.gz",
R2 = outdir + "/reads_chunks/reads.R2.part_{part}.fq.gz",
I1 = outdir + "/index_chunks/reads.I1.part_{part}.fq.gz",
I2 = outdir + "/index_chunks/reads.I2.part_{part}.fq.gz",
segment_a = f"{outdir}/workflow/segment_A.bc",
segment_b = f"{outdir}/workflow/segment_B.bc",
segment_c = f"{outdir}/workflow/segment_C.bc",
segment_d = f"{outdir}/workflow/segment_D.bc",
schema = samplefile
output:
temp(collect(outdir + "/{sample}.{{part}}.R{FR}.fq", sample = samplenames, FR = [1,2])),
bx_info = temp(f"{outdir}/logs/part.{{part}}.barcodes")
log:
f"{outdir}/logs/demultiplex.{{part}}.log"
params:
outdir = outdir,
qxrx = config["include_qx_rx_tags"],
keep_unknown = keep_unknown,
part = lambda wc: wc.get("part")
conda:
f"{envdir}/demultiplex.yaml"
script:
"scripts/demultiplex_gen1.py"
rule merge_partitions:
input:
collect(outdir + "/{{sample}}.{part}.R{{FR}}.fq", part = fastq_parts)
output:
outdir + "/{sample}.R{FR}.fq.gz"
log:
outdir + "/logs/{sample}.{FR}.concat.log"
container:
None
shell:
"cat {input} | gzip > {output} 2> {log}"
rule merge_barcode_logs:
input:
bc = collect(outdir + "/logs/part.{part}.barcodes", part = fastq_parts)
output:
log = f"{outdir}/logs/barcodes.log"
run:
bc_dict = {}
for i in input.bc:
with open(i, "r") as bc_log:
# skip first row of column names
_ = bc_log.readline()
for line in bc_log:
barcode,total,correct,corrected = line.split()
bc_stats = [int(total), int(correct), int(corrected)]
if barcode not in bc_dict:
bc_dict[barcode] = bc_stats
else:
bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
with open(output.log, "w") as f:
f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
for k,v in bc_dict.items():
f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n")
rule assess_quality:
input:
outdir + "/{sample}.R{FR}.fq.gz"
output:
outdir + "/reports/data/{sample}.R{FR}.fastqc"
log:
outdir + "/logs/{sample}.R{FR}.qc.log"
threads:
1
conda:
f"{envdir}/qc.yaml"
shell:
"""
( falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 ||
cat <<EOF > {output}
##Falco 1.2.4
>>Basic Statistics fail
#Measure Value
Filename {wildcards.sample}.R{wildcards.FR}.fq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 0
Sequences flagged as poor quality 0
Sequence length 0
%GC 0
>>END_MODULE
EOF
"""
rule report_config:
output:
outdir + "/workflow/multiqc.yaml"
run:
import yaml
configs = {
"sp": {"fastqc/data": {"fn" : "*.fastqc"}},
"table_sample_merge": {
"R1": ".R1",
"R2": ".R2"
},
"title": "Quality Assessment of Demultiplexed Samples",
"subtitle": "This report aggregates the QA results created by falco",
"report_comment": "Generated as part of the Harpy demultiplex workflow",
"report_header_info": [
{"Submit an issue": "https://github.com/pdimens/harpy/issues/new/choose"},
{"Read the Docs": "https://pdimens.github.io/harpy/"},
{"Project Homepage": "https://github.com/pdimens/harpy"}
]
}
with open(output[0], "w", encoding="utf-8") as yml:
yaml.dump(configs, yml, default_flow_style= False, sort_keys=False, width=float('inf'))
rule quality_report:
input:
fqc = collect(outdir + "/reports/data/{sample}.R{FR}.fastqc", sample = samplenames, FR = [1,2]),
mqc_yaml = outdir + "/workflow/multiqc.yaml"
output:
outdir + "/reports/demultiplex.QA.html"
log:
f"{outdir}/logs/multiqc.log"
params:
options = "--no-version-check --force --quiet --no-data-dir",
module = " --module fastqc",
logdir = outdir + "/reports/data/"
conda:
f"{envdir}/qc.yaml"
shell:
"multiqc --filename {output} --config {input.mqc_yaml} {params} 2> {log}"
rule workflow_summary:
default_target: True
input:
fq = collect(outdir + "/{sample}.R{FR}.fq.gz", sample = samplenames, FR = [1,2]),
barcode_logs = f"{outdir}/logs/barcodes.log",
reports = outdir + "/reports/demultiplex.QA.html" if not skip_reports else []
params:
R1 = config["inputs"]["R1"],
R2 = config["inputs"]["R2"],
I1 = config["inputs"]["I1"],
I2 = config["inputs"]["I2"]
run:
summary = ["The harpy demultiplex workflow ran using these parameters:"]
summary.append("Linked Read Barcode Design: Generation I")
inputs = "The multiplexed input files:\n"
inputs += f"\tread 1: {params.R1}\n"
inputs += f"\tread 2: {params.R2}\n"
inputs += f"\tindex 1: {params.I1}\n"
inputs += f"\tindex 2: {params.I2}"
inputs += f"Sample demultiplexing schema: {samplefile}"
summary.append(inputs)
demux = "Samples were demultiplexed using:\n"
demux += "\tworkflow/scripts/demultiplex_gen1.py"
summary.append(demux)
qc = "QC checks were performed on demultiplexed FASTQ files using:\n"
qc += "\tfalco -skip-report -skip-summary -data-filename output input.fq.gz"
summary.append(qc)
sm = "The Snakemake workflow was called via command line:\n"
sm += f"\t{config['workflow_call']}"
summary.append(sm)
with open(outdir + "/workflow/demux.gen1.summary", "w") as f:
f.write("\n\n".join(summary)) |
The runtime seems to have improved substantially, so that's a comfort. Regarding disk space, the peak should make sense though, right? The input files are first symlinked (free), then split into chunks (doubles disk space usage), and disk space should continue going up as sample fastq files are being created. Finally, when a chunk is done being processed, the R12 and I12 of a chunk get deleted, and the sample files gets compressed (one per thread). Since there are exactly as many chunks as cores provided, disk space usage should max out as all chunks are undergoing this simultaneously. A small workaround would be to create more chunks than threads provided, which would see a smaller peak due to some chunks finishing first and being deleted before a new chunk is processed. Thank you (genuinely) for continuing to test this, I'm glad to see we are making significant progress on it. Would you like to try the more-chunks-than-cores solution? I can whip that up relatively quickly. |
Good ideas here @pdimens . One way is to change chunking procedure. |
Ad hoc chunking would be also be a big boost to this. I'm not sure how to accomplish that. There must be some kind of way to arbitrarily extract a section of a fastq file, because the point of FAI files is to allow random access. So, it would take getting the seq count of the source fastq file(s), dividing by |
@bpenaud I'm struggling to convince snakemake to execute the partitions simultaneously, such that containerized: "docker://pdimens/harpy:latest"
import os
import logging
import subprocess
outdir = config["output_directory"]
envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs")
samplefile = config["inputs"]["demultiplex_schema"]
skip_reports = config["reports"]["skip"]
keep_unknown = config["keep_unknown"]
n_chunks = min(int(workflow.cores * 2.5), 999)
onstart:
logger.logger.addHandler(logging.FileHandler(config["snakemake_log"]))
os.makedirs(f"{outdir}/reports/data", exist_ok = True)
onsuccess:
os.remove(logger.logfile)
try:
os.remove(f"{outdir}/reads.R1.fq.gz")
os.remove(f"{outdir}/reads.R2.fq.gz")
os.remove(f"{outdir}/reads.I1.fq.gz")
os.remove(f"{outdir}/reads.I2.fq.gz")
except FileNotFoundError:
pass
onerror:
os.remove(logger.logfile)
wildcard_constraints:
sample = r"[a-zA-Z0-9._-]+",
FR = r"[12]",
part = r"\d{3}"
def parse_schema(smpl: str, keep_unknown: bool) -> dict:
d = {}
with open(smpl, "r") as f:
for i in f.readlines():
# a casual way to ignore empty lines or lines with !=2 fields
try:
sample, bc = i.split()
id_segment = bc[0]
if sample not in d:
d[sample] = [bc]
else:
d[sample].append(bc)
except ValueError:
continue
if keep_unknown:
d["_unknown_sample"] = f"{id_segment}00"
return d
samples = parse_schema(samplefile, keep_unknown)
samplenames = [i for i in samples]
def setup_chunks(fq1: str, fq2: str, parts: int) -> dict:
# find the minimum number of reads between R1 and R2 files
count_r1 = subprocess.check_output(["zgrep", "-c", "-x", '+', fq1])
count_r2 = subprocess.check_output(["zgrep", "-c", "-x", '+', fq2])
read_min = min(int(count_r1), int(count_r2))
chunks_length = read_min // parts
starts = list(range(1, read_min, max(chunks_length,20)))
ends = [i-1 for i in starts[1:]]
# the last end should be -1, which is the "end" in a seqkit range
ends[-1] = -1
formatted_parts = [f"{i:03d}" for i in range(1, parts + 1)]
# format {part : (start, end)}
# example {"001": (1, 5000)}
return dict(zip(formatted_parts, zip(starts, ends)))
chunk_dict = setup_chunks(config["inputs"]["R1"], config["inputs"]["R2"], n_chunks)
fastq_parts = list(chunk_dict.keys())
rule barcode_segments:
output:
collect(outdir + "/workflow/segment_{letter}.bc", letter = ["A","C","B","D"])
params:
f"{outdir}/workflow"
container:
None
shell:
"haplotag_acbd.py {params}"
rule link_input:
input:
r1 = config["inputs"]["R1"],
r2 = config["inputs"]["R2"],
i1 = config["inputs"]["I1"],
i2 = config["inputs"]["I2"]
output:
r1 = temp(f"{outdir}/reads.R1.fq.gz"),
r2 = temp(f"{outdir}/reads.R2.fq.gz"),
i1 = temp(f"{outdir}/reads.I1.fq.gz"),
i2 = temp(f"{outdir}/reads.I2.fq.gz")
run:
for i,o in zip(input,output):
if os.path.exists(o) or os.path.islink(o):
os.remove(o)
os.symlink(i, o)
rule partition_reads:
group: "partition"
input:
outdir + "/reads.R{FR}.fq.gz"
output:
temp(outdir + "/reads_chunks/reads.R{FR}.part_{part}.fq.gz")
params:
lambda wc: f"-r {chunk_dict[wc.part][0]}:{chunk_dict[wc.part][1]}"
conda:
f"{envdir}/demultiplex.yaml"
shell:
"seqkit range {params} -o {output} {input}"
use rule partition_reads as partition_index with:
group: "partition"
input:
outdir + "/reads.I{FR}.fq.gz"
output:
temp(outdir + "/index_chunks/reads.I{FR}.part_{part}.fq.gz")
checkpoint demultiplex:
group: "partition"
priority: 100
input:
R1 = outdir + "/reads_chunks/reads.R1.part_{part}.fq.gz",
R2 = outdir + "/reads_chunks/reads.R2.part_{part}.fq.gz",
I1 = outdir + "/index_chunks/reads.I1.part_{part}.fq.gz",
I2 = outdir + "/index_chunks/reads.I2.part_{part}.fq.gz",
segment_a = f"{outdir}/workflow/segment_A.bc",
segment_b = f"{outdir}/workflow/segment_B.bc",
segment_c = f"{outdir}/workflow/segment_C.bc",
segment_d = f"{outdir}/workflow/segment_D.bc",
schema = samplefile
output:
temp(collect(outdir + "/{sample}.{{part}}.R{FR}.fq", sample = samplenames, FR = [1,2])),
bx_info = temp(f"{outdir}/logs/part.{{part}}.barcodes")
log:
f"{outdir}/logs/demultiplex.{{part}}.log"
params:
outdir = outdir,
qxrx = config["include_qx_rx_tags"],
keep_unknown = keep_unknown,
part = lambda wc: wc.get("part")
conda:
f"{envdir}/demultiplex.yaml"
script:
"scripts/demultiplex_gen1.py"
rule merge_partitions:
input:
collect(outdir + "/{{sample}}.{part}.R{{FR}}.fq", part = fastq_parts)
output:
outdir + "/{sample}.R{FR}.fq.gz"
log:
outdir + "/logs/{sample}.{FR}.concat.log"
container:
None
shell:
"cat {input} | gzip > {output} 2> {log}"
rule merge_barcode_logs:
input:
bc = collect(outdir + "/logs/part.{part}.barcodes", part = fastq_parts)
output:
log = f"{outdir}/logs/barcodes.log"
run:
bc_dict = {}
for i in input.bc:
with open(i, "r") as bc_log:
# skip first row of column names
_ = bc_log.readline()
for line in bc_log:
barcode,total,correct,corrected = line.split()
bc_stats = [int(total), int(correct), int(corrected)]
if barcode not in bc_dict:
bc_dict[barcode] = bc_stats
else:
bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
with open(output.log, "w") as f:
f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n")
for k,v in bc_dict.items():
f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n")
rule assess_quality:
input:
outdir + "/{sample}.R{FR}.fq.gz"
output:
outdir + "/reports/data/{sample}.R{FR}.fastqc"
log:
outdir + "/logs/{sample}.R{FR}.qc.log"
threads:
1
conda:
f"{envdir}/qc.yaml"
shell:
"""
( falco --quiet --threads {threads} -skip-report -skip-summary -data-filename {output} {input} ) > {log} 2>&1 ||
cat <<EOF > {output}
##Falco 1.2.4
>>Basic Statistics fail
#Measure Value
Filename {wildcards.sample}.R{wildcards.FR}.fq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 0
Sequences flagged as poor quality 0
Sequence length 0
%GC 0
>>END_MODULE
EOF
"""
rule report_config:
output:
outdir + "/workflow/multiqc.yaml"
run:
import yaml
configs = {
"sp": {"fastqc/data": {"fn" : "*.fastqc"}},
"table_sample_merge": {
"R1": ".R1",
"R2": ".R2"
},
"title": "Quality Assessment of Demultiplexed Samples",
"subtitle": "This report aggregates the QA results created by falco",
"report_comment": "Generated as part of the Harpy demultiplex workflow",
"report_header_info": [
{"Submit an issue": "https://github.com/pdimens/harpy/issues/new/choose"},
{"Read the Docs": "https://pdimens.github.io/harpy/"},
{"Project Homepage": "https://github.com/pdimens/harpy"}
]
}
with open(output[0], "w", encoding="utf-8") as yml:
yaml.dump(configs, yml, default_flow_style= False, sort_keys=False, width=float('inf'))
rule quality_report:
input:
fqc = collect(outdir + "/reports/data/{sample}.R{FR}.fastqc", sample = samplenames, FR = [1,2]),
mqc_yaml = outdir + "/workflow/multiqc.yaml"
output:
outdir + "/reports/demultiplex.QA.html"
log:
f"{outdir}/logs/multiqc.log"
params:
options = "--no-version-check --force --quiet --no-data-dir",
module = " --module fastqc",
logdir = outdir + "/reports/data/"
conda:
f"{envdir}/qc.yaml"
shell:
"multiqc --filename {output} --config {input.mqc_yaml} {params} 2> {log}"
rule workflow_summary:
default_target: True
input:
fq = collect(outdir + "/{sample}.R{FR}.fq.gz", sample = samplenames, FR = [1,2]),
barcode_logs = f"{outdir}/logs/barcodes.log",
reports = outdir + "/reports/demultiplex.QA.html" if not skip_reports else []
params:
R1 = config["inputs"]["R1"],
R2 = config["inputs"]["R2"],
I1 = config["inputs"]["I1"],
I2 = config["inputs"]["I2"]
run:
summary = ["The harpy demultiplex workflow ran using these parameters:"]
summary.append("Linked Read Barcode Design: Generation I")
inputs = "The multiplexed input files:\n"
inputs += f"\tread 1: {params.R1}\n"
inputs += f"\tread 2: {params.R2}\n"
inputs += f"\tindex 1: {params.I1}\n"
inputs += f"\tindex 2: {params.I2}"
inputs += f"Sample demultiplexing schema: {samplefile}"
summary.append(inputs)
chunking = "Input data was partitioned into smaller chunks using:\n"
chunking += "\tseqkit -r start:stop -o output.fq input.fq"
summary.append(chunking)
demux = "Samples were demultiplexed using:\n"
demux += "\tworkflow/scripts/demultiplex_gen1.py"
summary.append(demux)
qc = "QC checks were performed on demultiplexed FASTQ files using:\n"
qc += "\tfalco -skip-report -skip-summary -data-filename output input.fq.gz"
summary.append(qc)
sm = "The Snakemake workflow was called via command line:\n"
sm += f"\t{config['workflow_call']}"
summary.append(sm)
with open(outdir + "/workflow/demux.gen1.summary", "w") as f:
f.write("\n\n".join(summary)) |
Damn, I thought this would have been the winner. So the previous one seems like the best option thus far. It seems like more chunks, even if smaller, has more overhead. I'd like to try one more thing where there is the n_cores number of chunks but each demuxing job writes directly to a gzip file via subprocess. I'll notify you when I get that working |
@bpenaud okay, let's try this as a possibly final option:
|
Description of feature
Hello Pavel,
I work in the same lab as BELKHIR, and on my side I tried to speed up the demultiplexing step by parallelizing the rule with the GNU parallel tool.
My method uses cpu to split the Undetermined_[RS][12].fastq.gz files into a number of files equal to the number of cpu in the rule with the seqkit split2 tool.
Then I use GNU parallel to run the demuxGen1 binary on my sequence blocks (split).
Once the BX Tag has been added, I rebuild the files in the same way as the Undetermined files.
This method has the advantage of considerably speeding up the rule, on 2*2.5 billion reads the demultiplexing step in harpy run during 6d, 20h, 30min, 43 sec. With my method the demultiplexing step run during 8h, 54min, 13 sec with 60 cores. (time only for rule demultiplex_barcodes)
On the other hand, this method requires temporary files to be written, which is going to take up a lot of disk space and can be a problem.
I've modified the rule :
By
My rule needs GNU parallel and seqkit install in the conda environment.
Best regards,
Benjamin
The text was updated successfully, but these errors were encountered: