Stats for multiple infection samples and reports workflow #116

abhi18av · 2022-08-21T12:26:48Z

This PR address

The need to collect various statistic for the samples which fail at the QC state
The need to have a unified report for a researcher's reference (REPORTS_WF)

abhi18av · 2022-08-21T15:39:00Z

@TimHHH , I've implemented what we've discussed in #106 and as of now, the failed samples would be published in the same parent folder as the normal workflow, but within the rejected folder.

Here's an example of the published stats for samples which didn't pass the QC threshold

$ tree  -f | grep 'rejected'
    ./libraries/mapped_sequences/mapped_libraries/rejected/Votintseva2015.M6076.L1.A1.1.1.1.sorted_reads.bam 
    ./QC_statistics/coverage/rejected/Votintseva2015.M6076.L1.A1.1.1.1.WgsMetrics.txt 
    ./QC_statistics/mapping/rejected/Votintseva2015.M6076.L1.A1.1.1.1.FlagStat.txt 
    ./QC_statistics/mapping/rejected/Votintseva2015.M6076.L1.A1.1.1.1.SamtoolStats.txt 
    ./samples/mapped_sequences/marked_duplicates/rejected/Votintseva2015.M6076.L1.A1.1.1.1.dedup_reads.bam 
    ./samples/mapped_sequences/marked_duplicates/rejected/Votintseva2015.M6076.L1.A1.1.1.1.dedup_reads.bam.bai 
    ./samples/mapped_sequences/marked_duplicates/rejected/Votintseva2015.M6076.L1.A1.1.1.1.MarkDupMetrics.txt

abhi18av · 2022-08-21T15:41:35Z

Could you please run this branch and see if it works as expected?

If so, then I'd also like to address #99 within this branch and would like to discuss with you which results need to be published in the consolidated report, in our next meeting.

abhi18av · 2022-08-21T16:03:20Z

workflows/contaminated_sample_stats_wf.nf

+            .join( GATK_COLLECT_WGS_METRICS.out )
+            .join( GATK_FLAG_STAT.out )
+
+        //TODO: If these stats are needed, only then implement the following processes to accommodate the missing NTM stats


//TODO: If these stats are needed, only then implement the following processes to accommodate the missing NTM stats

This could be accommodated better if we convert the shell scripts into proper python scripts as documented here #39

CC @LennertVerboven

TimHHH · 2022-08-29T10:03:52Z

We had/have some serious cluster issues, but I am trying this again atm.
In the mean time please note that we should update the name of this process and it variables. This process is not about detecting contamination but about detecting multiple infection. Let's fix this after the test runs.

TimHHH · 2022-09-01T08:19:47Z

@abhi18av
Hi Abhi. I have done the requested check. I ran these 10 samples through this branch and the master:
EarlyMGIT.10Diffidult.Samples.csv
You should have these data lying around if you want to run them yourselves.

On the master branch run 3iYH8lZDKhrJQG small_watson everything goes according to plan, once this run finishes I do a -resume and nothing odd happens 5EF6z0qHcbfYbf fabulous_colden , except that the whole ```MERGE`` workflow starts again, which was not necessary since nothing changed.

One the quanttb-stats branch 2rrGyBXeyiWIUs high_volta however one MAP_WF is not started for sample Votintseva2015.F24817 despite its QC being all good, the rest of the run is all good. When I -resume this run 1SzSRdeFTfOLkj agitated_plateau this sample does get started on the MAP_WF and get incorporated in subsequent analyses, as it should.

Do you need any other data to check this issue?

abhi18av · 2022-09-01T10:38:12Z

I think this much should be enough for me to continue investigating this - thanks Tim!

abhi18av · 2022-09-11T14:05:35Z

@TimHHH , I have tried to reproduce the same issues on my SLURM cluster, however I haven't seen a caching problem.

Hypothesis

One reason I can think of is related to the file system/storage problems your cluster had recently because there is no reason why (i) a sample wouldn't be picked up at the start of the workflow and (ii) the caching wouldn't work across various runs.

In your position I'd take this forward with the sys-admin team responsible for the file system setup and operations. The node where head job is running is responsible for maintaining cache database in .nextflow/cache folder and write it out after the workflow has been completed.

From the nextflow side, we can use a lenient caching behavior

process.cache = 'lenient'

Another reason, might be the version of Nextflow you're using i.e.21.10.3 which is somewhat dated - I've used the latest stable series 22.04.5 which theoretically shouldn't make a difference and if it is breaking the cache then it is a bug in the Nextflow runtime.

I've not seen this issue from any other community sources therefore, I consider the NF version being the root cause a highly unlikely.

For me the caching is working as expected and causes zero reruns

With optimization workflow enabled https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/1TR0XP1NXeCQjH
With optimization workflow disabled https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/2hKskWWwMbnTYR

I'm adding below my experimentation notebook logs to keep a history.

Tower workspace (TORCH/XBS-Nextflow)

Run-1 `happy_blackwell`

Tower link
Pipeline revision quanttb-stats

Uses Tim's CSV for 10 difficult samples Stats for multiple infection samples and reports workflow #116 (comment)
Has some cached results (2 approved samples) from previous analysis
Completed but failed to notify Tower

[2a/abe39d] process > REPORTS_WF:MULTIQC                                                             [100%]
Completed at: 11-Sep-2022 07:44:16
Duration    : 41m 6s
CPU hours   : 10.3 (8.9% cached)
Succeeded   : 253
Cached      : 13

WARN: Unexpected HTTP response.
Failed to send message to https://api.tower.nf -- received
- status code : 400
- response msg: Oops... Unable to process request - Error ID: 1jucjaxx1n2xMajgcaztXJ
- error cause : Oops... Unable to process request - Error ID: 1jucjaxx1n2xMajgcaztXJ

Run-2 `modest_lovelace`

Tower link
Pipeline revision - quanttb-stats

Rerun the pipeline, NO CHANGE using -resume
The pipeline resumed until MERGE_WF

Run-3 `drunk_bhabha`

Tower link
Pipeline revision - quanttb-stats

Add -dump-hashes
Use NF 22.04.5
This time results were cached until MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__SNP

Run-4 `infallible_magritte`

Tower link
Pipeline revision - quanttb-stats

No change at all
This time results were cached until MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__SNP
The results of previous run were not cached within the OPTIMIZATION workflow

Run-5 `scruffy_poincare`

Tower link
Pipeline revision - quanttb-stats

Disable the optimized recalibration workflow
SSH connection to cluster broke due to change of internet connection

Run-6 `distraught_crick`

Tower link
Pipeline revision quanttb-stats

Resumed the previous run
Processes NOT cached
- MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__*
- MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:GATK_VARIANTS_TO_TABLE
Run completed successfully on SLURM and Tower

Run-7 `angry_austin`

Tower link
Pipeline revision quanttb-stats

All processes were cached

Run-8 `silly_noyce`

Tower link

Completed on SLURM , not on Tower

ompleted at: 11-Sep-2022 08:56:40
Duration    : 11m 38s
CPU hours   : 10.3 (96.7% cached)
Succeeded   : 30
Cached      : 236

WARN: Unexpected HTTP response.
Failed to send message to https://api.tower.nf -- received
- status code : 400
- response msg: Oops... Unable to process request - Error ID: 3rPI5OdpkSz8nSJ53DsHfn
- error cause : Oops... Unable to process request - Error ID: 3rPI5OdpkSz8nSJ53DsHfn

Run-9 `special_gilbert`

Tower link
Pipeline revision quanttb-stats

No change, simple rerun
All processes were cached
Completed on SLURM , not on Tower

Run-10 `crazy_northcutt`

Tower link
Pipeline revision quanttb-stats

Did another test for good measure, same results as before
No change, simple rerun
All processes were cached
Completed on SLURM , not on Tower

Run-11 `adoring_meitner`

Tower link
Pipeline revision quanttb-stats

Disable the optimization workflow again
Completed on SLURM and Tower
All processes were cached

abhi18av · 2022-09-11T14:06:31Z

Therefore, from my side if the overall results are as expected from the new workflow - we can take this PR forward and merge it.

Update this branch for final testing

abhi18av · 2022-09-13T08:00:11Z

Discussion from the meeting (13-09-2022)

We have made the following changes during the call
a) Add lenient caching strategy a7f924b
b) Add sleep before and after quanttb process bb26aea
c) Check against 0 value in filtering out rejected samples 4effce2
@TimHHH to provide a systematic analysis of caching (runs on resume) and missed samples behavior (samples not entering the analysis) behaviors on
a) master branch
b) quanttb branch
The change of samples status from APPROVED <-> REJECTED is a different issue and we need to investigate the filtering behavior deeper. Though point 2 and 3 might be related.
@abhi18av to try and reproduce these issues on SLURM cluster.
@TimHHH to continue the paper after successive runs of the pipeline on all samples as the samples have been observed to jump the status for the better.

NOTE: This is somewhat hacky and needs to be done carefully, by saving the results of QUANTTB/COHORT_STATS

@abhi18av to incorporate the samplesheet validation script [Input samplesheet validation] Sample names with point can result in accidental sample merge #102

TimHHH · 2022-09-13T09:11:22Z

Run 1: friendly_hopper
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/1pexlGzmLs0FU7
https://github.com/TORCH-Consortium/xbs-nf/commit/4037bc40c6f811c11ec29a3dc9b9ba8a61948f74
initial default run with quanttb branch commits form 13-09-22 meeting.
manually forced early finish

Results:

04-rejected-lenient.quanttb_cohort_stats.csv: 10 samples, as expected.
MAP_WF:BWA_MEM__REJECTED: 2 samples, as expected.
MAP_WF:BWA_MEM__APPROVED: 7 samples, missing Votintseva2015.F24817.L1.A1.1.1.1

Run 2: tender_hoover
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4vFnDa1pwDOtgF
commit as above
manually forced early finish

-resume for run 1, no changes to default_params.config

Results:

04-rejected-lenient.quanttb_cohort_stats.csv: cached.
MAP_WF:BWA_MEM__REJECTED: cached.
MAP_WF:BWA_MEM__APPROVED: resuming Votintseva2015.F24817.L1.A1.1.1.1

Run 3: mad_austin
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/1LXhtrGIbXO6Jp
commit as above
manually forced early finish

Fresh run exactly as Run 1 to see if same pattern persists, updated default.config outdir and vcf names.

Results:

05-rejected-lenient.quanttb_cohort_stats.csv: 10 samples, as expected.
MAP_WF:BWA_MEM__REJECTED: 1 sample, missing Votintseva2017.L26556-m.L1.A1.1.1.1
MAP_WF:BWA_MEM__APPROVED: 8 samples, as expected.

Run 4: sharp_poisson
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/3fAFcf8UuOGD8U
commit as above
manually forced early finish

-resume for run 3, no changes to default_params.config

Results:

05-rejected-lenient.quanttb_cohort_stats.csv: cached.
MAP_WF:BWA_MEM__REJECTED: resuming Votintseva2017.L26556-m.L1.A1.1.1.1.
MAP_WF:BWA_MEM__APPROVED: cached.

Run 5: sleepy_stonebraker
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/3IllDVjLj5FnS8
master branch https://github.com/TORCH-Consortium/xbs-nf/pull/122
manually forced early finish

Run 6: prickly_archimedes
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/22RoXfMFHCCPpx
independent replication of run 5

Run 7: goofy_agnesi
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/2RwFf4FmGb0FD0
independent replication of run 5

Results run 4/5/6: all three runs behave as expected and send the 8 approved sample to BWA mapping.

Final remarks
The last three runs show that the master branch behaves correctly, previous runs further confirm this.
As for the multiple infection branch: the first four runs show that quanttb is always properly executed and summarised, it is the subsequent mapping workflow that is started with some degree of randomness in terms of samples. Samples randomly drop out and will not be assigned to either workflow, or samples can occasionally end up in the wrong workflow (previous observation). Either way it seems the -resume runs fix things and correctly analyse the missed or wrongly assigned samples.

@abhi18av @LennertVerboven your opinions / suggestions are welcomed.

abhi18av · 2022-09-14T13:57:04Z

As for the multiple infection branch: the first four runs show that quanttb is always properly executed and summarised, it is the subsequent mapping workflow that is started with some degree of randomness in terms of samples.

Hmm, then perhaps to test this we can narrow it down to MAP_WF - let me try and come up with a simpler test. I think this should take us closer to the overall issue - thanks for the effort @TimHHH

TimHHH · 2022-09-15T11:43:53Z

I have done some additional runs on the dataset for the publication and encountered another important observation. This was with the same version/branch.

I started off with this run:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4bbqZ8gOigd3C7

And then kept using -resume to finish all the MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__LOFREQ jobs:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4bbqZ8gOigd3C7
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/5JNfx1k1ZQq9uL
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4rJeXrDwaH0mIk

All jobs then successfully finished here:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/3H1YMeYgWrB3QY

Then for the sake of it I did another bunch of -resume runs:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/hj7O8d1vkvf5O
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/5aBrE6Z7AJIfvt
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/Tea1ROYpfGbNw
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/YC1LJRWkCbbqM

@abhi18av as you will see the number of samples in the MAP_WF:BWA_MEM__APPROVED flow constantly fluctuate, even when I resume the already finished run. This branch of the pipeline therefore does not gradually settle on the ideal run, but rather keeps mucking around with the number of approved samples.

LennertVerboven · 2022-09-20T12:29:50Z

Hi I added a python script here and changed the quanttb cohort stats here to run the new python script. The output are two files approved_samples.quanttb_cohort_stats.tsv and approved_samples.quanttb_cohort_stats.tsv but they still need to integrated in the rest of the pipeline, i.e., they still need to be used to start the proper workflows for the samples @abhi18av can you look into this?

The new tsv files have the same layout as the previous quanttb cohort tsv file

abhi18av · 2022-09-21T05:16:24Z

Thanks @LennertVerboven , I have now accommodated the script in the workflow and I think we should give it another go on your cluster 🤞

abhi18av added 3 commits August 21, 2022 13:56

add stub for contaminated sample stats

58ac033

Add notes for usage of VCF type in Phylogenies

202ee23

isolate the reports as a separate workflow

f4ad14e

abhi18av added the hold label Aug 21, 2022

abhi18av changed the title ~~Stats for contaminated samples and reports workflow~~ [HOLD] Stats for contaminated samples and reports workflow Aug 21, 2022

abhi18av self-assigned this Aug 21, 2022

This was linked to issues Aug 21, 2022

Publish a condensed MultiQC report with the relevant statistics #99

Open

Failing samples on QUANT_TB should also be captured in the stats #106

Closed

abhi18av added 7 commits August 21, 2022 08:57

add mermaid diagram text file

4668d82

initiate testing the contanimated sample stats wf

eb1448f

include the workflow in main

0b8345e

fix the output channel name for qc wf

f42e1fd

enable more processes for rejected samples

e449fc5

Change the file encoding back to LF

57bfa3b

enable all flows

c15e59e

abhi18av changed the title ~~[HOLD] Stats for contaminated samples and reports workflow~~ Stats for contaminated samples and reports workflow Aug 21, 2022

abhi18av added hold and removed hold labels Aug 21, 2022

abhi18av changed the title ~~Stats for contaminated samples and reports workflow~~ [HOLD] Stats for contaminated samples and reports workflow Aug 21, 2022

enable the main workflow

7b5e030

comment the utils sample/cohort stats

27f658a

abhi18av commented Aug 21, 2022

View reviewed changes

TimHHH changed the title ~~[HOLD] Stats for contaminated samples and reports workflow~~ [HOLD] Stats for multiple infection samples and reports workflow Aug 29, 2022

abhi18av added 2 commits September 11, 2022 12:34

push the changes as per last discussion

c808409

fix lofreq ntm import

2897d19

TimHHH and others added 7 commits September 12, 2022 11:32

Merge pull request #123 from TORCH-Consortium/master

5d138a2

Update this branch for final testing

update the caching strategy

a7f924b

sleep 10

dfa50bc

Update quant.nf

bb26aea

change the filter to 0

4effce2

Update quality_check_wf.nf

56c0085

rejected flow -k 100

4037bc4

abhi18av and others added 5 commits September 20, 2022 09:42

debug: add views

fd5e3c5

add terminating string

341daa3

add todo for the script

82c0cdb

add collectfile

7b5266c

added the filtering to the quanttb_cohort_stats file

e264c11

abhi18av added 4 commits September 21, 2022 06:07

accommodate changes

6d5124b

added join

740a352

join on derived names

09d2a4c

fix the syntax

c9c4bf5

abhi18av changed the title ~~[HOLD] Stats for multiple infection samples and reports workflow~~ Stats for multiple infection samples and reports workflow Sep 26, 2022

abhi18av removed the hold label Sep 26, 2022

TimHHH merged commit 0e937fb into master Sep 27, 2022

TimHHH deleted the quanttb-stats branch September 27, 2022 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats for multiple infection samples and reports workflow #116

Stats for multiple infection samples and reports workflow #116

abhi18av commented Aug 21, 2022

abhi18av commented Aug 21, 2022 •

edited

Loading

abhi18av commented Aug 21, 2022

abhi18av Aug 21, 2022 •

edited

Loading

TimHHH commented Aug 29, 2022

TimHHH commented Sep 1, 2022

abhi18av commented Sep 1, 2022

abhi18av commented Sep 11, 2022 •

edited

Loading

abhi18av commented Sep 11, 2022

abhi18av commented Sep 13, 2022 •

edited

Loading

TimHHH commented Sep 13, 2022 •

edited

Loading

abhi18av commented Sep 14, 2022

TimHHH commented Sep 15, 2022 •

edited

Loading

LennertVerboven commented Sep 20, 2022 •

edited

Loading

abhi18av commented Sep 21, 2022

Stats for multiple infection samples and reports workflow #116

Stats for multiple infection samples and reports workflow #116

Conversation

abhi18av commented Aug 21, 2022

abhi18av commented Aug 21, 2022 • edited Loading

abhi18av commented Aug 21, 2022

abhi18av Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

TimHHH commented Aug 29, 2022

TimHHH commented Sep 1, 2022

abhi18av commented Sep 1, 2022

abhi18av commented Sep 11, 2022 • edited Loading

Tower workspace (TORCH/XBS-Nextflow)

Run-1 happy_blackwell

Run-2 modest_lovelace

Run-3 drunk_bhabha

Run-4 infallible_magritte

Run-5 scruffy_poincare

Run-6 distraught_crick

Run-7 angry_austin

Run-8 silly_noyce

Run-9 special_gilbert

Run-10 crazy_northcutt

Run-11 adoring_meitner

abhi18av commented Sep 11, 2022

abhi18av commented Sep 13, 2022 • edited Loading

Discussion from the meeting (13-09-2022)

TimHHH commented Sep 13, 2022 • edited Loading

abhi18av commented Sep 14, 2022

TimHHH commented Sep 15, 2022 • edited Loading

LennertVerboven commented Sep 20, 2022 • edited Loading

abhi18av commented Sep 21, 2022

abhi18av commented Aug 21, 2022 •

edited

Loading

abhi18av Aug 21, 2022 •

edited

Loading

abhi18av commented Sep 11, 2022 •

edited

Loading

Run-1 `happy_blackwell`

Run-2 `modest_lovelace`

Run-3 `drunk_bhabha`

Run-4 `infallible_magritte`

Run-5 `scruffy_poincare`

Run-6 `distraught_crick`

Run-7 `angry_austin`

Run-8 `silly_noyce`

Run-9 `special_gilbert`

Run-10 `crazy_northcutt`

Run-11 `adoring_meitner`

abhi18av commented Sep 13, 2022 •

edited

Loading

TimHHH commented Sep 13, 2022 •

edited

Loading

TimHHH commented Sep 15, 2022 •

edited

Loading

LennertVerboven commented Sep 20, 2022 •

edited

Loading