Skip to content

Commit

Permalink
Merge pull request #61 from broadinstitute/array_prober_yg
Browse files Browse the repository at this point in the history
Array prober WDL
  • Loading branch information
MicahR-Y authored Aug 30, 2024
2 parents d507386 + c4c9ba4 commit 700697e
Show file tree
Hide file tree
Showing 4 changed files with 161 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,8 @@ workflows:
primaryDescriptorPath: /PECGS-QUICviz/QUICviz.wdl
testParameterFiles:
- /PECGS-QUICviz/QUICviz.inputs.json
- name: cnvArrayProber
subclass: WDL
primaryDescriptorPath: /CNV_Array_Prober/cnvArrayProber.wdl
testParameterFiles:
- /CNV_Array_Prober/cnvArrayProber.inputs.json
77 changes: 77 additions & 0 deletions CNV_Array_Prober/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# cnvArrayProber

## Overview

The `cnvArrayProber` is designed to analyze CNV (Copy Number Variation) intervals from a BED file and map probe information from two array support files (CytoSNP-850K and GDA). It generates a comprehensive XLSX file containing the intervals and the number of probes they contain in each array. Additionally, the script outputs a PDF file with detailed plots illustrating the locations of these probes in the CytoSNP and GDA arrays.

This script was developed upon a request from Greg Nakashian in [TAG1994](https://github.com/broadinstitute/TAG/issues/1994).

## Features

1. **Input Processing:**
- **BED File:** The script reads CNV intervals from a specified BED file.
- **Array Support Files:** It also processes two array support files, CytoSNP-850K and GDA, to gather probe information.

2. **Data Analysis:**
- **Interval Analysis:** The [cnvArrayProber](https://dockstore.org/workflows/github.com/broadinstitute/TAG-public/cnvArrayProber:array_prober_yg?tab=info) WDL analyzes the CNV intervals to determine the number of probes from each array (CytoSNP-850K and GDA) that fall within each interval.

3. **Output Generation:**
- **XLSX File:** A xlsx file is generated, containing the CNV intervals and the corresponding count of probes from each array.
- **PDF File:** A PDF file is produced, featuring plots that visually represent the locations of the probes within the CytoSNP and GDA arrays.

## Usage

To use the `cnvArrayProber` WDL, follow these steps:

1. **Prepare Input Files:**
- Ensure you have a BED file containing the CNV intervals.
```
chr2 97220584 130400286
chr4 143920938 144022444
chr9 33140790 33261063
```

You can get CytoSNP-850K and GDA array support files using the following gcloud link. (**Note: Ensure you are using consistent genome build for those input**)


| GDACyto_hg19_SupportFile | CytoSNP_850k_v1_4_hg38_SupportFile | GDACyto_hg38_SupportFile | CytoSNP_850k_v1_4_hg19_SupportFile |
|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------|
| gs://fc-d2c7d48c-9433-4a1f-bdeb-100265b01a63/GDA_SupportFile/GDACyto_20047166_A1.csv | gs://fc-d2c7d48c-9433-4a1f-bdeb-100265b01a63/CytoSNP-850Kv1-4_SupportFile/CytoSNP-850Kv1-4_iScan_B2.csv | gs://fc-d2c7d48c-9433-4a1f-bdeb-100265b01a63/GDA_SupportFile/GDACyto_20047166_A2.csv | gs://fc-d2c7d48c-9433-4a1f-bdeb-100265b01a63/CytoSNP-850Kv1-4_SupportFile/CytoSNP-850Kv1-4_iScan_B1.csv |



2. **Execute the Workflow:**
- Execute the `cnvArrayProber` WDL with inputs defined by data table.
- The WDL will process the files and generate the output XLSX and PDF files.

3. **Review Outputs:**

- **XLSX File:** A Exel file with the following information in two separate sheets for CytoSNP-850K and GDA arrays:

CytoSNP850K:

| | left_padding | interval | right_padding |
|:-------------------------|---------------:|-----------:|----------------:|
| chr2:97220584-130400286 | 239 | 8051 | 171 |
| chr4:143920938-144022444 | 0 | 4 | 1 |
| chr9:33140790-33261063 | 2 | 38 | 2 |

GDA:

| | left_padding | interval | right_padding |
|:-------------------------|---------------:|-----------:|----------------:|
| chr2:97220584-130400286 | 897 | 19284 | 683 |
| chr4:143920938-144022444 | 2 | 41 | 1 |
| chr9:33140790-33261063 | 9 | 81 | 18 |


- **PDF File:** A PDF document with plots showing:
- The distribution of CytoSNP-850K probes within each interval.
- The distribution of GDA probes within each interval.


## Development and Contributions


The script was developed by Yueyao Gao ([email protected]) in response to a request from Greg Nakashian in [TAG1994](https://github.com/broadinstitute/TAG/issues/1994). Contributions and further improvements are welcome. Please refer to the TAG repo for more information.

11 changes: 11 additions & 0 deletions CNV_Array_Prober/cnvArrayProber.inputs.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"cnvArrayProber.cnvArrayProber.cpu": "Int (optional, default = 1)",
"cnvArrayProber.cnvProberDocker": "String (optional, default = \"us.gcr.io/tag-public/cnv-array-prober:0.0.0\")",
"cnvArrayProber.cnvBedFile": "File",
"cnvArrayProber.CytoSNP850K_Support_Csv": "File",
"cnvArrayProber.sampleName": "String",
"cnvArrayProber.GDA_Support_Csv": "File",
"cnvArrayProber.cnvArrayProber.disk": "Int (optional, default = 100)",
"cnvArrayProber.cnvArrayProber.memory": "Int (optional, default = 4)"
}

68 changes: 68 additions & 0 deletions CNV_Array_Prober/cnvArrayProber.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
version 1.0

workflow cnvArrayProber {
input{
String sampleName
File cnvBedFile
File CytoSNP850K_Support_Csv
File GDA_Support_Csv
String cnvProberDocker = "us.gcr.io/tag-public/cnv-array-prober:0.0.1"
}
call cnvArrayProber {
input:
sampleName = sampleName,
cnvBedFile = cnvBedFile,
CytoSNP850K_Support_Csv = CytoSNP850K_Support_Csv,
GDA_Support_Csv = GDA_Support_Csv,
cnvProberDocker = cnvProberDocker
}
output{
File cnvProbeAnnotation = cnvArrayProber.cnvProbeAnnotation
File cnvProbePlots = cnvArrayProber.cnvProbePlots
}
meta {
author: "Yueyao Gao"
email: "[email protected]"
description: "This workflow takes a CNV bed file and CytoSNP-850K and GDA support files as input and outputs a csv file with probe information for each CNV interval. Additionally, output a PDF file with plots for each CNV interval the number of probes in the CytoSNP-850K and GDA arrays."
}
}

task cnvArrayProber {
input{
String sampleName
File cnvBedFile
File CytoSNP850K_Support_Csv
File GDA_Support_Csv
String cnvProberDocker
Int memory = 32
Int cpu = 2
Int disk_size_gb = 500
Boolean use_ssd = false
Int preemptible = 3
Int maxRetries = 3
}
command <<<
set -e
mkdir output
conda run --no-capture-output \
-n prober_env \
python3 /BaseImage/cnvArrayProber/scripts/cnvArrayProber.py \
-b ~{cnvBedFile} \
-c ~{CytoSNP850K_Support_Csv} \
-g ~{GDA_Support_Csv} \
-o output/~{sampleName}
>>>
output{
File cnvProbeAnnotation = "output/~{sampleName}CNV_Probe_Mappings.xlsx"
File cnvProbePlots = "output/~{sampleName}CNV_Probe_Mappings_Plots.pdf"
}
runtime {
docker: cnvProberDocker
memory: memory
cpu: cpu
disks: "local-disk " + disk_size_gb + if use_ssd then " SSD" else " HDD"
preemptible: preemptible
maxRetries: maxRetries
}
}

0 comments on commit 700697e

Please sign in to comment.