Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats for multiple infection samples and reports workflow #116

Merged
merged 30 commits into from
Sep 27, 2022
Merged

Conversation

abhi18av
Copy link
Member

This PR address

  • The need to collect various statistic for the samples which fail at the QC state
  • The need to have a unified report for a researcher's reference (REPORTS_WF)

@abhi18av abhi18av added the hold label Aug 21, 2022
@abhi18av abhi18av changed the title Stats for contaminated samples and reports workflow [HOLD] Stats for contaminated samples and reports workflow Aug 21, 2022
@abhi18av abhi18av self-assigned this Aug 21, 2022
@abhi18av abhi18av changed the title [HOLD] Stats for contaminated samples and reports workflow Stats for contaminated samples and reports workflow Aug 21, 2022
@abhi18av abhi18av added hold and removed hold labels Aug 21, 2022
@abhi18av abhi18av changed the title Stats for contaminated samples and reports workflow [HOLD] Stats for contaminated samples and reports workflow Aug 21, 2022
@abhi18av
Copy link
Member Author

abhi18av commented Aug 21, 2022

@TimHHH , I've implemented what we've discussed in #106 and as of now, the failed samples would be published in the same parent folder as the normal workflow, but within the rejected folder.

Here's an example of the published stats for samples which didn't pass the QC threshold

$ tree  -f | grep 'rejected'
    ./libraries/mapped_sequences/mapped_libraries/rejected/Votintseva2015.M6076.L1.A1.1.1.1.sorted_reads.bam 
    ./QC_statistics/coverage/rejected/Votintseva2015.M6076.L1.A1.1.1.1.WgsMetrics.txt 
    ./QC_statistics/mapping/rejected/Votintseva2015.M6076.L1.A1.1.1.1.FlagStat.txt 
    ./QC_statistics/mapping/rejected/Votintseva2015.M6076.L1.A1.1.1.1.SamtoolStats.txt 
    ./samples/mapped_sequences/marked_duplicates/rejected/Votintseva2015.M6076.L1.A1.1.1.1.dedup_reads.bam 
    ./samples/mapped_sequences/marked_duplicates/rejected/Votintseva2015.M6076.L1.A1.1.1.1.dedup_reads.bam.bai 
    ./samples/mapped_sequences/marked_duplicates/rejected/Votintseva2015.M6076.L1.A1.1.1.1.MarkDupMetrics.txt 

@abhi18av
Copy link
Member Author

Could you please run this branch and see if it works as expected?

If so, then I'd also like to address #99 within this branch and would like to discuss with you which results need to be published in the consolidated report, in our next meeting.

.join( GATK_COLLECT_WGS_METRICS.out )
.join( GATK_FLAG_STAT.out )

//TODO: If these stats are needed, only then implement the following processes to accommodate the missing NTM stats
Copy link
Member Author

@abhi18av abhi18av Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TODO: If these stats are needed, only then implement the following processes to accommodate the missing NTM stats

This could be accommodated better if we convert the shell scripts into proper python scripts as documented here #39

CC @LennertVerboven

@TimHHH
Copy link
Collaborator

TimHHH commented Aug 29, 2022

We had/have some serious cluster issues, but I am trying this again atm.
In the mean time please note that we should update the name of this process and it variables. This process is not about detecting contamination but about detecting multiple infection. Let's fix this after the test runs.

@TimHHH TimHHH changed the title [HOLD] Stats for contaminated samples and reports workflow [HOLD] Stats for multiple infection samples and reports workflow Aug 29, 2022
@TimHHH
Copy link
Collaborator

TimHHH commented Sep 1, 2022

@abhi18av
Hi Abhi. I have done the requested check. I ran these 10 samples through this branch and the master:
EarlyMGIT.10Diffidult.Samples.csv
You should have these data lying around if you want to run them yourselves.

On the master branch run 3iYH8lZDKhrJQG small_watson everything goes according to plan, once this run finishes I do a -resume and nothing odd happens 5EF6z0qHcbfYbf fabulous_colden , except that the whole ```MERGE`` workflow starts again, which was not necessary since nothing changed.

One the quanttb-stats branch 2rrGyBXeyiWIUs high_volta however one MAP_WF is not started for sample Votintseva2015.F24817 despite its QC being all good, the rest of the run is all good. When I -resume this run 1SzSRdeFTfOLkj agitated_plateau this sample does get started on the MAP_WF and get incorporated in subsequent analyses, as it should.

Do you need any other data to check this issue?

@abhi18av
Copy link
Member Author

abhi18av commented Sep 1, 2022

I think this much should be enough for me to continue investigating this - thanks Tim!

@abhi18av
Copy link
Member Author

abhi18av commented Sep 11, 2022

@TimHHH , I have tried to reproduce the same issues on my SLURM cluster, however I haven't seen a caching problem.

Hypothesis

  1. One reason I can think of is related to the file system/storage problems your cluster had recently because there is no reason why (i) a sample wouldn't be picked up at the start of the workflow and (ii) the caching wouldn't work across various runs.

In your position I'd take this forward with the sys-admin team responsible for the file system setup and operations. The node where head job is running is responsible for maintaining cache database in .nextflow/cache folder and write it out after the workflow has been completed.

From the nextflow side, we can use a lenient caching behavior

process.cache = 'lenient'
  1. Another reason, might be the version of Nextflow you're using i.e.21.10.3 which is somewhat dated - I've used the latest stable series 22.04.5 which theoretically shouldn't make a difference and if it is breaking the cache then it is a bug in the Nextflow runtime.

I've not seen this issue from any other community sources therefore, I consider the NF version being the root cause a highly unlikely.

For me the caching is working as expected and causes zero reruns


I'm adding below my experimentation notebook logs to keep a history.

Tower workspace (TORCH/XBS-Nextflow)

Run-1 happy_blackwell

Tower link
Pipeline revision quanttb-stats

[2a/abe39d] process > REPORTS_WF:MULTIQC                                                             [100%]
Completed at: 11-Sep-2022 07:44:16
Duration    : 41m 6s
CPU hours   : 10.3 (8.9% cached)
Succeeded   : 253
Cached      : 13

WARN: Unexpected HTTP response.
Failed to send message to https://api.tower.nf -- received
- status code : 400
- response msg: Oops... Unable to process request - Error ID: 1jucjaxx1n2xMajgcaztXJ
- error cause : Oops... Unable to process request - Error ID: 1jucjaxx1n2xMajgcaztXJ

Run-2 modest_lovelace

Tower link
Pipeline revision - quanttb-stats

  • Rerun the pipeline, NO CHANGE using -resume
  • The pipeline resumed until MERGE_WF

Run-3 drunk_bhabha

Tower link
Pipeline revision - quanttb-stats

  • Add -dump-hashes
  • Use NF 22.04.5
  • This time results were cached until MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__SNP

Run-4 infallible_magritte

Tower link
Pipeline revision - quanttb-stats

  • No change at all
  • This time results were cached until MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__SNP
  • The results of previous run were not cached within the OPTIMIZATION workflow

Run-5 scruffy_poincare

Tower link
Pipeline revision - quanttb-stats

  • Disable the optimized recalibration workflow
  • SSH connection to cluster broke due to change of internet connection

Run-6 distraught_crick

Tower link
Pipeline revision quanttb-stats

  • Resumed the previous run
  • Processes NOT cached
    • MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__*
    • MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:GATK_VARIANTS_TO_TABLE
  • Run completed successfully on SLURM and Tower

Run-7 angry_austin

Tower link
Pipeline revision quanttb-stats

  • All processes were cached

Run-8 silly_noyce

Tower link

  • Completed on SLURM , not on Tower
ompleted at: 11-Sep-2022 08:56:40
Duration    : 11m 38s
CPU hours   : 10.3 (96.7% cached)
Succeeded   : 30
Cached      : 236

WARN: Unexpected HTTP response.
Failed to send message to https://api.tower.nf -- received
- status code : 400
- response msg: Oops... Unable to process request - Error ID: 3rPI5OdpkSz8nSJ53DsHfn
- error cause : Oops... Unable to process request - Error ID: 3rPI5OdpkSz8nSJ53DsHfn


Run-9 special_gilbert

Tower link
Pipeline revision quanttb-stats

  • No change, simple rerun
  • All processes were cached
  • Completed on SLURM , not on Tower

Run-10 crazy_northcutt

Tower link
Pipeline revision quanttb-stats

  • Did another test for good measure, same results as before
  • No change, simple rerun
  • All processes were cached
  • Completed on SLURM , not on Tower

Run-11 adoring_meitner

Tower link
Pipeline revision quanttb-stats

  • Disable the optimization workflow again
  • Completed on SLURM and Tower
  • All processes were cached

@abhi18av
Copy link
Member Author

Therefore, from my side if the overall results are as expected from the new workflow - we can take this PR forward and merge it.

@abhi18av
Copy link
Member Author

abhi18av commented Sep 13, 2022

Discussion from the meeting (13-09-2022)

  1. We have made the following changes during the call
    a) Add lenient caching strategy a7f924b
    b) Add sleep before and after quanttb process bb26aea
    c) Check against 0 value in filtering out rejected samples 4effce2

  2. @TimHHH to provide a systematic analysis of caching (runs on resume) and missed samples behavior (samples not entering the analysis) behaviors on
    a) master branch
    b) quanttb branch

  3. The change of samples status from APPROVED <-> REJECTED is a different issue and we need to investigate the filtering behavior deeper. Though point 2 and 3 might be related.

  4. @abhi18av to try and reproduce these issues on SLURM cluster.

  5. @TimHHH to continue the paper after successive runs of the pipeline on all samples as the samples have been observed to jump the status for the better.

NOTE: This is somewhat hacky and needs to be done carefully, by saving the results of QUANTTB/COHORT_STATS

  1. @abhi18av to incorporate the samplesheet validation script [Input samplesheet validation] Sample names with point can result in accidental sample merge #102

@TimHHH
Copy link
Collaborator

TimHHH commented Sep 13, 2022

Run 1: friendly_hopper
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/1pexlGzmLs0FU7
https://github.com/TORCH-Consortium/xbs-nf/commit/4037bc40c6f811c11ec29a3dc9b9ba8a61948f74
initial default run with quanttb branch commits form 13-09-22 meeting.
manually forced early finish

Results:

Run 2: tender_hoover
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4vFnDa1pwDOtgF
commit as above
manually forced early finish

-resume for run 1, no changes to default_params.config

Results:

  • 04-rejected-lenient.quanttb_cohort_stats.csv: cached.
  • MAP_WF:BWA_MEM__REJECTED: cached.
  • MAP_WF:BWA_MEM__APPROVED: resuming Votintseva2015.F24817.L1.A1.1.1.1

Run 3: mad_austin
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/1LXhtrGIbXO6Jp
commit as above
manually forced early finish

Fresh run exactly as Run 1 to see if same pattern persists, updated default.config outdir and vcf names.

Results:

Run 4: sharp_poisson
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/3fAFcf8UuOGD8U
commit as above
manually forced early finish

-resume for run 3, no changes to default_params.config

Results:

  • 05-rejected-lenient.quanttb_cohort_stats.csv: cached.
  • MAP_WF:BWA_MEM__REJECTED: resuming Votintseva2017.L26556-m.L1.A1.1.1.1.
  • MAP_WF:BWA_MEM__APPROVED: cached.

Run 5: sleepy_stonebraker
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/3IllDVjLj5FnS8
master branch https://github.com/TORCH-Consortium/xbs-nf/pull/122
manually forced early finish

Run 6: prickly_archimedes
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/22RoXfMFHCCPpx
independent replication of run 5

Run 7: goofy_agnesi
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/2RwFf4FmGb0FD0
independent replication of run 5

Results run 4/5/6: all three runs behave as expected and send the 8 approved sample to BWA mapping.

Final remarks
The last three runs show that the master branch behaves correctly, previous runs further confirm this.
As for the multiple infection branch: the first four runs show that quanttb is always properly executed and summarised, it is the subsequent mapping workflow that is started with some degree of randomness in terms of samples. Samples randomly drop out and will not be assigned to either workflow, or samples can occasionally end up in the wrong workflow (previous observation). Either way it seems the -resume runs fix things and correctly analyse the missed or wrongly assigned samples.

@abhi18av @LennertVerboven your opinions / suggestions are welcomed.

@abhi18av
Copy link
Member Author

As for the multiple infection branch: the first four runs show that quanttb is always properly executed and summarised, it is the subsequent mapping workflow that is started with some degree of randomness in terms of samples.

Hmm, then perhaps to test this we can narrow it down to MAP_WF - let me try and come up with a simpler test. I think this should take us closer to the overall issue - thanks for the effort @TimHHH

@TimHHH
Copy link
Collaborator

TimHHH commented Sep 15, 2022

I have done some additional runs on the dataset for the publication and encountered another important observation. This was with the same version/branch.

I started off with this run:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4bbqZ8gOigd3C7

And then kept using -resume to finish all the MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__LOFREQ jobs:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4bbqZ8gOigd3C7
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/5JNfx1k1ZQq9uL
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/4rJeXrDwaH0mIk

All jobs then successfully finished here:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/3H1YMeYgWrB3QY

Then for the sake of it I did another bunch of -resume runs:
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/hj7O8d1vkvf5O
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/5aBrE6Z7AJIfvt
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/Tea1ROYpfGbNw
https://tower.nf/orgs/TORCH/workspaces/XBS-Nextflow/watch/YC1LJRWkCbbqM

@abhi18av as you will see the number of samples in the MAP_WF:BWA_MEM__APPROVED flow constantly fluctuate, even when I resume the already finished run. This branch of the pipeline therefore does not gradually settle on the ideal run, but rather keeps mucking around with the number of approved samples.

@LennertVerboven
Copy link
Collaborator

LennertVerboven commented Sep 20, 2022

Hi I added a python script here and changed the quanttb cohort stats here to run the new python script. The output are two files approved_samples.quanttb_cohort_stats.tsv and approved_samples.quanttb_cohort_stats.tsv but they still need to integrated in the rest of the pipeline, i.e., they still need to be used to start the proper workflows for the samples @abhi18av can you look into this?

The new tsv files have the same layout as the previous quanttb cohort tsv file

@abhi18av
Copy link
Member Author

Thanks @LennertVerboven , I have now accommodated the script in the workflow and I think we should give it another go on your cluster 🤞

@abhi18av abhi18av changed the title [HOLD] Stats for multiple infection samples and reports workflow Stats for multiple infection samples and reports workflow Sep 26, 2022
@abhi18av abhi18av removed the hold label Sep 26, 2022
@TimHHH TimHHH merged commit 0e937fb into master Sep 27, 2022
@TimHHH TimHHH deleted the quanttb-stats branch September 27, 2022 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants