From db7531cbb931df1a8949fc69af20820eaad92509 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Wed, 15 Aug 2018 17:27:43 -0700 Subject: [PATCH 01/10] updated machine spec recommendations --- README.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 628beb3..a00aa3c 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,12 @@ -dx-streaming-upload + +# dx-streaming-upload ========= [![Build Status](https://travis-ci.org/dnanexus-rnd/dx-streaming-upload.svg?branch=master)](https://travis-ci.org/dnanexus-rnd/dx-streaming-upload) The dx-streaming-upload Ansible role packages the streaming upload module for increamentally uploading a RUN directory from an Illumina sequencer onto the DNAnexus platform. -Instruments that this module support include the Illumina MiSeq, NextSeq, HiSeq-2500, HiSeq-4000 and HiSeq-X. +Instruments that this module support include the Illumina MiSeq, NextSeq, HiSeq-2500, HiSeq-4000, HiSeq-X and NovaSeq. Role Variables -------------- @@ -34,7 +35,7 @@ Python 2.7 is needed. This program is not compatible with Python 3.X. Minimal Ansible version: 2.0. -This program is intended for Ubuntu 14.04 (Trusty) and has been tested on the 15.10 (Wily) release. Most features should work on a Ubuntu 12.04 (Precise) system, but this has not been tested to date. +This program is intended for Ubuntu 14.04 (Trusty) and has been tested on the 15.10 (Wily) release. Requirements @@ -45,7 +46,8 @@ More information and tutorials about the DNAnexus platform can be found at the [ The `remote-user` that the role is run against must possess **READ** access to `monitored_folder` and **WRITE** access to disk for logging and temporary storage of tar files. These are typically stored under the `remote-user's` home directory, and is specified in the file `monitor_run_config.template` or as given explicitly by the variables `local_tar_directory` and `local_log_directory`. -The machine that this role is deployed to should have at least 500Mb of free RAM available for allocation by the upload module during the time of upload. +The machine that this role is deployed to should have sufficient free memory depending on the throughput of the sequencing instrument. For Novaseq and HiSeqs we recommend a machine with atleast 8 cores, 32 GB of RAM, and 500GB - 1TB of storage. + Example Playbook ---------------- @@ -193,4 +195,4 @@ Apache Author Information ------------------ -DNAnexus (email: support@dnanexus.com) +DNAnexus (email: support@dnanexus.com) \ No newline at end of file From 42d3f644adf84e9797064302628fc02a5f3ca196 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Mon, 24 Sep 2018 17:31:47 -0700 Subject: [PATCH 02/10] modified the structure of the ReadMe, and updated installation instructions for Ubuntu and Red Hat --- README.md | 507 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 310 insertions(+), 197 deletions(-) diff --git a/README.md b/README.md index a00aa3c..582d7cc 100644 --- a/README.md +++ b/README.md @@ -1,198 +1,311 @@ - -# dx-streaming-upload -========= - -[![Build Status](https://travis-ci.org/dnanexus-rnd/dx-streaming-upload.svg?branch=master)](https://travis-ci.org/dnanexus-rnd/dx-streaming-upload) - -The dx-streaming-upload Ansible role packages the streaming upload module for increamentally uploading a RUN directory from an Illumina sequencer onto the DNAnexus platform. - -Instruments that this module support include the Illumina MiSeq, NextSeq, HiSeq-2500, HiSeq-4000, HiSeq-X and NovaSeq. - -Role Variables --------------- -- `mode`: `{deploy, debug}` In the *debug* mode, monitoring cron job is triggered every minute; in *deploy mode*, monitoring cron job is triggered every hour. -- `upload_project`: ID of the DNAnexus project that the RUN folders should be uploaded to. The ID is of the form `project-BpyQyjj0Y7V0Gbg7g52Pqf8q` -- `dx_token`: API token for the DNAnexus user to be used for data upload. The API token should give minimally UPLOAD access to the `{{ upload project }}`, or CONTRIBUTE access if `downstream_applet` is specified. Instructions for generating a API token can be found at [DNAnexus wiki](https://wiki.dnanexus.com/UI/API-Tokens). This value is overriden by `dx_user_token` in `monitored_users`. -- `monitored_users`: This is a list of objects, each representing a remote user, with its set of incremental upload parameters. For each `monitored_user`, the following values are accepted - - `username`: (Required) username of the remote user - - `monitored_directories`: (Required) Path to the local directory that should be monitored for RUN folders. Multiple directories can be listed. Suppose that the folder `20160101_M000001_0001_000000000-ABCDE` is the RUN directory, then the folder structure assumed is `{{monitored_dir}}/20160101_M000001_0001_000000000-ABCDE` - - `local_tar_directory`: (Optional) Path to a local folder where tarballs of RUN directory is temporarily stored. User specified in `username` need to have **WRITE** access to this folder. There should be sufficient disk space to accomodate a RUN directory in this location. This overwrites the default found in `templates/monitor_run_config.template`. - - `local_log_directory`: (Optional) Path to a local folder where logs of streaming upload is stored, persistently. User specified in `username` need to have **WRITE** access to this folder. User should not manually manipulate files found in this folder, as the streaming upload code make assumptions that the files in this folder are not manually manipulated. This overwites the default found in `templates/monitor_run_config.template`. - - `run_length`: (Optional) Expected duration of a sequencing run, corresponds to the -D paramter in incremental upload (For example, 24h). Acceptable suffix: s, m, h, d, w, M, y. - - `n_seq_intervals`: (Optional) Number of intervals to wait for run to complete. If the sequencing run has not completed within `n_seq_intervals` * `run_length`, it will be deemed as aborted and the program will not attempt to upload it. Corresponds to the -I parameter in incremental upload. - - `n_upload_threads`: (Optional) Number of upload threads used by Upload Agent. For sites with severe upload bandwidth limitations (<100kb/s), it is advised to reduce this to 1, to increase robustness of upload in face of possible network disruptions. Default=8. - - `script`: (Optional) File path to an executable script to be triggered after successful upload for the RUN directory. The script must be executable by the user specified by `username`. The script will be triggered in the with a single command line argument, correpsonding to the filepath of the RUN directory (see section *Example Script*). **If the file path to the script given does not point to a file, or if the file is not executable by the user, then the upload process will not commence.** - - `dx_user_token`: (Optional) API token associated with the specific `monitored_user`. This overrides the value `dx_token`. If `dx_user_token` is not specified, defaults to `dx_token`. - - `applet`: (Optional) ID of a DNAnexus applet to be triggered after successful upload of the RUN directory. This applet's I/O contract should accept a DNAnexus record with the name `upload_sentinel_record` as input. This applet will be triggered with only the `upload_sentinel_record` input. Additional input can be specified using the variable `downstream_input`. **Note that if the specified applet is not located, the upload process will not commence. Mutually exclusive with `workflow`. The role will raise an error and fail if both are specified.** - - `workflow`: (Optional) ID of a DNAnexus workflow to be triggered after successful upload of the RUN directory. This workflow's I/O contract should accept a DNAnexus record with the name `upload_sentinel_record` in the 1st stage (stage 0) of the workflow as input. Additional input can be specified using the variable `downstream_input`. **Note that if the specified workflow is not located, the upload process will not commence. Mutually exclusive with `applet`. The role will raise an error and fail if both are specified.** - - `downstream_input`: (Optional) A JSON string, parsable as a python `dict` of `str`:``str`, where the **key** is the input_name recognized by a DNAnexus applet/workflow and the **value** is the corresponding input. For examples and detailed explanation, see section titled `Downstream analysis`. **Note that the role will raise an error and fail if this string is not JSON-parsable as a dict of the expected format** - -**Note** DNAnexus login is persistent and the login environment is stored on disk in the the Ansible user's home directory. User of this playbook responsibility to make sure that every Ansible user (`monitored_user`) with a streaming upload job assigned has been logged into DNAnexus by either specifying a `dx_token` or `dx_user_token`. - -Dependencies ------------- -Python 2.7 is needed. This program is not compatible with Python 3.X. - -Minimal Ansible version: 2.0. - -This program is intended for Ubuntu 14.04 (Trusty) and has been tested on the 15.10 (Wily) release. - - -Requirements ------------- -Users of this module needs a DNAnexus account and its accompanying authentication. To register for a trial account, visit the [DNAnexus homepage](https://dnanexus.com). - -More information and tutorials about the DNAnexus platform can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com). - -The `remote-user` that the role is run against must possess **READ** access to `monitored_folder` and **WRITE** access to disk for logging and temporary storage of tar files. These are typically stored under the `remote-user's` home directory, and is specified in the file `monitor_run_config.template` or as given explicitly by the variables `local_tar_directory` and `local_log_directory`. - -The machine that this role is deployed to should have sufficient free memory depending on the throughput of the sequencing instrument. For Novaseq and HiSeqs we recommend a machine with atleast 8 cores, 32 GB of RAM, and 500GB - 1TB of storage. - - -Example Playbook ----------------- -`dx-upload-play.yml` -```YAML ---- -- hosts: localhost - vars: - monitored_users: - - username: travis - local_tar_directory: ~/new_location/upload/TMP - local_log_directory: ~/another_location/upload/LOG - monitored_directories: - - ~/runs - applet: applet-Bq2Kkgj08FqbjV3J8xJ0K3gG - downstream_input: '{"sequencing_center": "CENTER_A"}' - - username: root - monitored_directories: - - ~/home/root/runs - workflow: workflow-BvFz31j0Y7V5QPf09x9y91pF - downstream_input: '{"0.sequencing_center: "CENTER_A"}' - mode: debug - upload_project: project-BpyQyjj0Y7V0Gbg7g52Pqf8q - - roles: - - dx-streaming-upload - -``` - -**Note**: For security reasons, you should refrain from storing the DNAnexus authentication token in a playbook that is open-access. One might trigger the playbook on the command line with extra-vars to supply the necessary authentication token, or store them in a closed-source yaml variable file. - -ie. `ansible-playbook dx-upload-play.yml -i inventory --extra-vars "dx_token="` - -We recommend that the token given is limited in scope to the upload project, and has no higher than **CONTRIBUTE** privileges. - -Example Script --------------- -The following is an example script that writes a flat file to the RUN directory once a RUN directory has been successfully streamed. - -Recall that the script will be triggered with a single command line parameter, where `$1` is the path to the local RUN directory that has been successfully streamed to DNAnexus. - -``` -#!/bin/bash - -set -e -x -o pipefail - -rundir="$1" -echo "Completed streaming run directory: $rundir" > "$rundir/COMPLETE.txt" -``` - -Actions performed by Role -------------------------- -The dx-streaming-upload role perform, broadly, the following: - -1. Installs the DNAnexus tools [dx-toolkit](https://wiki.dnanexus.com/Downloads#DNAnexus-Platform-SDK) and [upload agent](https://wiki.dnanexus.com/Downloads#Upload-Agent) on the remote machine. -2. Set up a CRON job that monitors a given directory for RUN directories periodically, and streams the RUN directory into a DNAnexus project, triggering an app(let)/workflow upon successful upload of the directory and a local script (when specified by user) - -Downstream analysis -------------------- -The dx-streaming-upload role can optionally trigger a DNAnexus applet/workflow upon completion of incremental upload. The desired DNAnexus applet or workflow can be specified (at a per `monitored_user` basis) using the Ansible variables `applet` or `workflow` respectively (mutually exclusive, see explanantion of variables for general explanations). - -More information about DNAnexus workflows can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com/API-Specification-v1.0.0/Running-Analyses) - -### Authorization -The downstream analysis (applet or workflow) will be launched in the project into which the RUN directory is uploaded to (`project`). The DNAnexus user / associated `dx_token` or `dx_user_token` must have at least `CONTRIBUTE` access to the aforementioned project for the analysis to be launched successfully. Computational resources are billable and will be billed to the bill-to of the corresponding project. - -### Input and Options -The specified applet/workflow will be triggered using the `run` [API](http://autodoc.dnanexus.com/bindings/python/current/dxpy_apps.html?highlight=applet%20run#dxpy.bindings.dxapplet.DXExecutable.run) in the dxpy tool suite. - -For an applet, the `executable_input` hash to the `run` command will be prepopulated with the key-value pair {"`upload_sentinel_record`": `$record_id`} where `$record_id` is the DNAnexus file-id of the sentinel record generated for the uploaded RUN directory (see section titled **Files generated**). - -For a workflow the `executable_input` hash will be prepoluated with the key-value pair {"`0.upload_sentinel_record`": `$record_id`} where `$record_id` is the DNAnexus file-id of the sentinel record generated for the uploaded RUN directory (see section titled **Files generated**). - -**It is the user's responsibility to ensure that the specified applet/workflow has an appropriate input contract which accepts a DNAnexus record with the input name of `upload_sentinel_record`** - -Additional input/options can be specified, statically using the Ansible variable `downstream_input`. This should be provided as a JSON string, parsable, at the top level, as a Python dict of `str` to `str`. - -Example of a properly formatted `downstream_input` for an `applet` -- ```{"input_name1": "value1", "input_name2": "value2"}``` - -Example of a properly formatted `downstream_input` for a `workflow` -- ```{"0.step0_input": "value1", "1.step2_input": "value2"})``` - -*Note the numerical index prefix necessary when specifying input for an `workflow`, which disambiguates which step in the workflow an input is targeted to* - -Files generated ----------------- -We use a hypothetical example of a local RUN folder named `20160101_M000001_0001_000000000-ABCDE`, that was placed into the `monitored_directory`, after the `dx-streaming-upload` role has been set up. - -**Local Files Generated** -``` -path/to/LOG/directory -(specified in monitor_run_config.template file) -- 20160101_M000001_0001_000000000-ABCDE.lane.all.log - -path/to/TMP/directory -(specified in monitor_run_config.template file) -- no persistent files (tar files stored transiently, deleted upon successful upload to DNAnexus) -``` - -**Files Streamed to DNAnexus project** -``` -project - └───20160101_M000001_0001_000000000-ABCDE - │───runs - │ │ RunInfo.xml - │ │ SampleSheet.csv - │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all.log - │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all.upload_sentinel - │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all_000.tar.gz - │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all_001.tar.gz - │ │ ... - │ - └───reads (or analyses) - │ output files from downstream applet (e.g. demx) - │ "reads" folder will be created if an applet is triggered - │ "analyses" folder will be created if a workflow is triggered - │ ... -``` - -The `reads` folder (and subfolders) will only be created if `applet` is specified. -The `analyses` folder (and subfolder) will only be created if `workflow` is specified. - -`RunInfo.xml` and `SampleSheet.csv` will only be upladed if they can be located within the root of the local RUN directory. - -Logging, Notification and Error Handling ------------------------------------------- -**Uploading** - -A log of the CRON command (executed with `bash -e`) is written to the user's home folder `~/dx-stream_cron.log` and can be used to check the top level command triggered. - -The verbose log of the upload process (generated by the top-level `monitor_runs.py`) is written to the user's home folder `~/monitor.log`. - -These logs can be used to diagnose failures of upload from the local machine to DNAnexus. - -**Downstream applet** - -The downstream applet will be run in the project that the RUN directory is uploaded to (as specified in role variable `upload_project`). Users can log in to their DNAnexus account (corresponding to the `dx_token` or `dx_user_token`) and navigate to the upload project to monitor the progress of the applet triggered. Typically, on failure of a DNAnexus job, the user will receive a notification email, which will direct the user to check the log of the failed job for further diagnosis and debugging. - -License -------- - -Apache - -Author Information ------------------- - + +[![Build Status](https://travis-ci.org/dnanexus-rnd/dx-streaming-upload.svg?branch=master)](https://travis-ci.org/dnanexus-rnd/dx-streaming-upload) + +dx-streaming-upload +=================== + +The dx-streaming-upload Ansible role packages the streaming upload module for increamentally uploading a RUN directory from an Illumina sequencer onto the DNAnexus platform. + +Instruments that this module support include the Illumina MiSeq, NextSeq, HiSeq-2500, HiSeq-4000, HiSeq-X and NovaSeq. + + +## Table of Contents +1. [Dependencies](#dependencies) +2. [Requirements](#requirements) +3. [Installation](#installation) +4. [Examples](#examples) +5. [Example workflows](#example-workflows) +6. [Troubleshooting](#troubleshooting) + +## Dependencies + +Python 2.7 is needed. This program is not compatible with Python 3.X. + +Minimal Ansible version: 2.0. + +This program is intended for Ubuntu 14.04 (Trusty) and has been tested on the 15.10 (Wily) release. + +## Requirements + +Users of this module needs a DNAnexus account and its accompanying authentication. To register for a trial account, visit the DNAnexus homepage. + +More information and tutorials about the DNAnexus platform can be found at the DNAnexus wiki page. + +The remote-user that the role is run against must possess READ access to monitored_folder and WRITE access to disk for logging and temporary storage of tar files. These are typically stored under the remote-user's home directory, and is specified in the file monitor_run_config.template or as given explicitly by the variables local_tar_directory and local_log_directory. + +The machine that this role is deployed to should have sufficient free memory depending on the throughput of the sequencing instrument. For Novaseq and HiSeqs we recommend a machine with atleast 8 cores, 32 GB of RAM, and 500GB - 1TB of storage. + +## Installation +##### Using Ubuntu (tested on 14.04/16.04) +Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) +```mkdir ~/dx +cd ~/dx +``` +Install prerequisite -- git +``` +sudo apt-get install git +``` +Install prerequisite -- wget +``` +sudo apt-get install wget +``` +Enable EPEL Repository for RH 7.* or universe repositories for Ubuntu +``` +sudo apt-get install software-properties-common +sudo apt-add-repository universe +sudo apt-get update +``` +Install prerequisite -- pip and some essential packages +``` +curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py +sudo apt install python2.7 +sudo cp /usr/bin/python2.7 /usr/bin/python +sudo python get-pip.py +pip packages +sudo pip install -U setuptools +sudo pip install packaging +make +sudo apt-get install build-essential -y +Install ansible +git clone https://github.com/ansible/ansible.git +cd ansible/ +make +sudo make install +``` +Please download or move some test sequencing data in /opt/seq folder +Clone streaming repo +``` +cd ~/dx +git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git +``` +Create dx-upload-play.yml file inside the dx-streaming-folder +`dx-upload-play.yml` +```YAML +--- +- hosts: localhost + vars: + monitored_users: + - username: travis + local_tar_directory: ~/new_location/upload/TMP + local_log_directory: ~/another_location/upload/LOG + monitored_directories: + - ~/runs + applet: applet-Bq2Kkgj08FqbjV3J8xJ0K3gG + downstream_input: '{"sequencing_center": "CENTER_A"}' + - username: root + monitored_directories: + - ~/home/root/runs + workflow: workflow-BvFz31j0Y7V5QPf09x9y91pF + downstream_input: '{"0.sequencing_center: "CENTER_A"}' + mode: debug + upload_project: project-BpyQyjj0Y7V0Gbg7g52Pqf8q + + roles: + - dx-streaming-upload + +``` +Here are the instructions for token generation. +Launch the ansible-playbook +sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml +Give the right permission to cron +sudo cron (U) +##### Using RedHat (tested on) +Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) +```mkdir ~/dx +cd ~/dx``` +Install prerequisite such as git, wget etc +```sudo yum install git -y (Red Hat) + +``` +sudo yum install wget -y (RH) +sudo apt-get install wget (U) +Enable EPEL Repository for RH 7.* or universe repositories for Ubuntu +wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm (RH) +sudo rpm -ivh epel-release-latest-7.noarch.rpm (RH) +sudo apt-get install software-properties-common (U) +sudo apt-add-repository universe (U) +sudo apt-get update (U) +pip +sudo yum install python-pip -y (RH) +curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py (U) +sudo apt install python2.7 +sudo cp /usr/bin/python2.7 /usr/bin/python +sudo python get-pip.py (U) +pip packages +sudo pip install -U setuptools +sudo pip install packaging +make +sudo yum install gcc gcc-c++ kernel-devel -y (RH) +sudo apt-get install build-essential -y (U) +Install ansible +git clone https://github.com/ansible/ansible.git +cd ansible/ +make +sudo make install +Download test sequencing data in /opt/seq folder +- +Clone streaming repo +cd ~/dx +git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git +Create dx-upload-play.yml file inside the dx-streaming-folder +`dx-upload-play.yml` +```YAML +--- +- hosts: localhost + vars: + monitored_users: + - username: travis + local_tar_directory: ~/new_location/upload/TMP + local_log_directory: ~/another_location/upload/LOG + monitored_directories: + - ~/runs + applet: applet-Bq2Kkgj08FqbjV3J8xJ0K3gG + downstream_input: '{"sequencing_center": "CENTER_A"}' + - username: root + monitored_directories: + - ~/home/root/runs + workflow: workflow-BvFz31j0Y7V5QPf09x9y91pF + downstream_input: '{"0.sequencing_center: "CENTER_A"}' + mode: debug + upload_project: project-BpyQyjj0Y7V0Gbg7g52Pqf8q + + roles: + - dx-streaming-upload + +``` +Here are the [instructions]() for token generation. +Launch the ansible-playbook +```sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml +``` +Give the right permission to cron +```sudo cron (U) +``` +# Examples +Role Variables +-------------- +- `mode`: `{deploy, debug}` In the *debug* mode, monitoring cron job is triggered every minute; in *deploy mode*, monitoring cron job is triggered every hour. +- `upload_project`: ID of the DNAnexus project that the RUN folders should be uploaded to. The ID is of the form `project-BpyQyjj0Y7V0Gbg7g52Pqf8q` +- `dx_token`: API token for the DNAnexus user to be used for data upload. The API token should give minimally UPLOAD access to the `{{ upload project }}`, or CONTRIBUTE access if `downstream_applet` is specified. Instructions for generating a API token can be found at [DNAnexus wiki](https://wiki.dnanexus.com/UI/API-Tokens). This value is overriden by `dx_user_token` in `monitored_users`. +- `monitored_users`: This is a list of objects, each representing a remote user, with its set of incremental upload parameters. For each `monitored_user`, the following values are accepted + - `username`: (Required) username of the remote user + - `monitored_directories`: (Required) Path to the local directory that should be monitored for RUN folders. Multiple directories can be listed. Suppose that the folder `20160101_M000001_0001_000000000-ABCDE` is the RUN directory, then the folder structure assumed is `{{monitored_dir}}/20160101_M000001_0001_000000000-ABCDE` + - `local_tar_directory`: (Optional) Path to a local folder where tarballs of RUN directory is temporarily stored. User specified in `username` need to have **WRITE** access to this folder. There should be sufficient disk space to accomodate a RUN directory in this location. This overwrites the default found in `templates/monitor_run_config.template`. + - `local_log_directory`: (Optional) Path to a local folder where logs of streaming upload is stored, persistently. User specified in `username` need to have **WRITE** access to this folder. User should not manually manipulate files found in this folder, as the streaming upload code make assumptions that the files in this folder are not manually manipulated. This overwites the default found in `templates/monitor_run_config.template`. + - `run_length`: (Optional) Expected duration of a sequencing run, corresponds to the -D paramter in incremental upload (For example, 24h). Acceptable suffix: s, m, h, d, w, M, y. + - `n_seq_intervals`: (Optional) Number of intervals to wait for run to complete. If the sequencing run has not completed within `n_seq_intervals` * `run_length`, it will be deemed as aborted and the program will not attempt to upload it. Corresponds to the -I parameter in incremental upload. + - `n_upload_threads`: (Optional) Number of upload threads used by Upload Agent. For sites with severe upload bandwidth limitations (<100kb/s), it is advised to reduce this to 1, to increase robustness of upload in face of possible network disruptions. Default=8. + - `script`: (Optional) File path to an executable script to be triggered after successful upload for the RUN directory. The script must be executable by the user specified by `username`. The script will be triggered in the with a single command line argument, correpsonding to the filepath of the RUN directory (see section *Example Script*). **If the file path to the script given does not point to a file, or if the file is not executable by the user, then the upload process will not commence.** + - `dx_user_token`: (Optional) API token associated with the specific `monitored_user`. This overrides the value `dx_token`. If `dx_user_token` is not specified, defaults to `dx_token`. + - `applet`: (Optional) ID of a DNAnexus applet to be triggered after successful upload of the RUN directory. This applet's I/O contract should accept a DNAnexus record with the name `upload_sentinel_record` as input. This applet will be triggered with only the `upload_sentinel_record` input. Additional input can be specified using the variable `downstream_input`. **Note that if the specified applet is not located, the upload process will not commence. Mutually exclusive with `workflow`. The role will raise an error and fail if both are specified.** + - `workflow`: (Optional) ID of a DNAnexus workflow to be triggered after successful upload of the RUN directory. This workflow's I/O contract should accept a DNAnexus record with the name `upload_sentinel_record` in the 1st stage (stage 0) of the workflow as input. Additional input can be specified using the variable `downstream_input`. **Note that if the specified workflow is not located, the upload process will not commence. Mutually exclusive with `applet`. The role will raise an error and fail if both are specified.** + - `downstream_input`: (Optional) A JSON string, parsable as a python `dict` of `str`:``str`, where the **key** is the input_name recognized by a DNAnexus applet/workflow and the **value** is the corresponding input. For examples and detailed explanation, see section titled `Downstream analysis`. **Note that the role will raise an error and fail if this string is not JSON-parsable as a dict of the expected format** +**Note** DNAnexus login is persistent and the login environment is stored on disk in the the Ansible user's home directory. User of this playbook responsibility to make sure that every Ansible user (`monitored_user`) with a streaming upload job assigned has been logged into DNAnexus by either specifying a `dx_token` or `dx_user_token`. +Example Playbook +---------------- +`dx-upload-play.yml` +```YAML +--- +- hosts: localhost + vars: + monitored_users: + - username: travis + local_tar_directory: ~/new_location/upload/TMP + local_log_directory: ~/another_location/upload/LOG + monitored_directories: + - ~/runs + applet: applet-Bq2Kkgj08FqbjV3J8xJ0K3gG + downstream_input: '{"sequencing_center": "CENTER_A"}' + - username: root + monitored_directories: + - ~/home/root/runs + workflow: workflow-BvFz31j0Y7V5QPf09x9y91pF + downstream_input: '{"0.sequencing_center: "CENTER_A"}' + mode: debug + upload_project: project-BpyQyjj0Y7V0Gbg7g52Pqf8q + roles: + - dx-streaming-upload +``` +**Note**: For security reasons, you should refrain from storing the DNAnexus authentication token in a playbook that is open-access. One might trigger the playbook on the command line with extra-vars to supply the necessary authentication token, or store them in a closed-source yaml variable file. +ie. `ansible-playbook dx-upload-play.yml -i inventory --extra-vars "dx_token="` +We recommend that the token given is limited in scope to the upload project, and has no higher than **CONTRIBUTE** privileges. +Example Script +-------------- +The following is an example script that writes a flat file to the RUN directory once a RUN directory has been successfully streamed. +Recall that the script will be triggered with a single command line parameter, where `$1` is the path to the local RUN directory that has been successfully streamed to DNAnexus. +``` +#!/bin/bash +set -e -x -o pipefail +rundir="$1" +echo "Completed streaming run directory: $rundir" > "$rundir/COMPLETE.txt" +``` +Actions performed by Role +------------------------- +The dx-streaming-upload role perform, broadly, the following: +1. Installs the DNAnexus tools [dx-toolkit](https://wiki.dnanexus.com/Downloads#DNAnexus-Platform-SDK) and [upload agent](https://wiki.dnanexus.com/Downloads#Upload-Agent) on the remote machine. +2. Set up a CRON job that monitors a given directory for RUN directories periodically, and streams the RUN directory into a DNAnexus project, triggering an app(let)/workflow upon successful upload of the directory and a local script (when specified by user) +Downstream analysis +------------------- +The dx-streaming-upload role can optionally trigger a DNAnexus applet/workflow upon completion of incremental upload. The desired DNAnexus applet or workflow can be specified (at a per `monitored_user` basis) using the Ansible variables `applet` or `workflow` respectively (mutually exclusive, see explanantion of variables for general explanations). +More information about DNAnexus workflows can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com/API-Specification-v1.0.0/Running-Analyses) +### Authorization +The downstream analysis (applet or workflow) will be launched in the project into which the RUN directory is uploaded to (`project`). The DNAnexus user / associated `dx_token` or `dx_user_token` must have at least `CONTRIBUTE` access to the aforementioned project for the analysis to be launched successfully. Computational resources are billable and will be billed to the bill-to of the corresponding project. +### Input and Options +The specified applet/workflow will be triggered using the `run` [API](http://autodoc.dnanexus.com/bindings/python/current/dxpy_apps.html?highlight=applet%20run#dxpy.bindings.dxapplet.DXExecutable.run) in the dxpy tool suite. +For an applet, the `executable_input` hash to the `run` command will be prepopulated with the key-value pair {"`upload_sentinel_record`": `$record_id`} where `$record_id` is the DNAnexus file-id of the sentinel record generated for the uploaded RUN directory (see section titled **Files generated**). +For a workflow the `executable_input` hash will be prepoluated with the key-value pair {"`0.upload_sentinel_record`": `$record_id`} where `$record_id` is the DNAnexus file-id of the sentinel record generated for the uploaded RUN directory (see section titled **Files generated**). +**It is the user's responsibility to ensure that the specified applet/workflow has an appropriate input contract which accepts a DNAnexus record with the input name of `upload_sentinel_record`** +Additional input/options can be specified, statically using the Ansible variable `downstream_input`. This should be provided as a JSON string, parsable, at the top level, as a Python dict of `str` to `str`. +Example of a properly formatted `downstream_input` for an `applet` +- ```{"input_name1": "value1", "input_name2": "value2"}``` +Example of a properly formatted `downstream_input` for a `workflow` +- ```{"0.step0_input": "value1", "1.step2_input": "value2"})``` +*Note the numerical index prefix necessary when specifying input for an `workflow`, which disambiguates which step in the workflow an input is targeted to* +Files generated +---------------- +We use a hypothetical example of a local RUN folder named `20160101_M000001_0001_000000000-ABCDE`, that was placed into the `monitored_directory`, after the `dx-streaming-upload` role has been set up. +**Local Files Generated** +``` +path/to/LOG/directory +(specified in monitor_run_config.template file) +- 20160101_M000001_0001_000000000-ABCDE.lane.all.log +path/to/TMP/directory +(specified in monitor_run_config.template file) +- no persistent files (tar files stored transiently, deleted upon successful upload to DNAnexus) +``` +**Files Streamed to DNAnexus project** +``` +project + └───20160101_M000001_0001_000000000-ABCDE + │───runs + │ │ RunInfo.xml + │ │ SampleSheet.csv + │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all.log + │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all.upload_sentinel + │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all_000.tar.gz + │ │ run.20160101_M000001_0001_000000000-ABCDE.lane.all_001.tar.gz + │ │ ... + │ + └───reads (or analyses) + │ output files from downstream applet (e.g. demx) + │ "reads" folder will be created if an applet is triggered + │ "analyses" folder will be created if a workflow is triggered + │ ... +``` +The `reads` folder (and subfolders) will only be created if `applet` is specified. +The `analyses` folder (and subfolder) will only be created if `workflow` is specified. +`RunInfo.xml` and `SampleSheet.csv` will only be upladed if they can be located within the root of the local RUN directory. +Logging, Notification and Error Handling +------------------------------------------ +**Uploading** +A log of the CRON command (executed with `bash -e`) is written to the user's home folder `~/dx-stream_cron.log` and can be used to check the top level command triggered. +The verbose log of the upload process (generated by the top-level `monitor_runs.py`) is written to the user's home folder `~/monitor.log`. +These logs can be used to diagnose failures of upload from the local machine to DNAnexus. +**Downstream applet** +The downstream applet will be run in the project that the RUN directory is uploaded to (as specified in role variable `upload_project`). Users can log in to their DNAnexus account (corresponding to the `dx_token` or `dx_user_token`) and navigate to the upload project to monitor the progress of the applet triggered. Typically, on failure of a DNAnexus job, the user will receive a notification email, which will direct the user to check the log of the failed job for further diagnosis and debugging. +## Troubleshooting +License +------- +Apache +Author Information +------------------ DNAnexus (email: support@dnanexus.com) \ No newline at end of file From 1655ab56643a0866d5879c69159abccd0f819d6e Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Mon, 24 Sep 2018 17:34:40 -0700 Subject: [PATCH 03/10] better formatting of code blocks --- README.md | 90 +++++++++++++++++++++++++------------------------------ 1 file changed, 40 insertions(+), 50 deletions(-) diff --git a/README.md b/README.md index 582d7cc..b477160 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,8 @@ The machine that this role is deployed to should have sufficient free memory dep ## Installation ##### Using Ubuntu (tested on 14.04/16.04) Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) -```mkdir ~/dx +``` +mkdir ~/dx cd ~/dx ``` Install prerequisite -- git @@ -49,7 +50,7 @@ Install prerequisite -- wget ``` sudo apt-get install wget ``` -Enable EPEL Repository for RH 7.* or universe repositories for Ubuntu +Enable universe repositories ``` sudo apt-get install software-properties-common sudo apt-add-repository universe @@ -106,47 +107,48 @@ Create dx-upload-play.yml file inside the dx-streaming-folder ``` Here are the instructions for token generation. Launch the ansible-playbook -sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml +```sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml ``` Give the right permission to cron -sudo cron (U) +```sudo cron``` ##### Using RedHat (tested on) Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) -```mkdir ~/dx -cd ~/dx``` -Install prerequisite such as git, wget etc -```sudo yum install git -y (Red Hat) - -``` -sudo yum install wget -y (RH) -sudo apt-get install wget (U) -Enable EPEL Repository for RH 7.* or universe repositories for Ubuntu -wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm (RH) -sudo rpm -ivh epel-release-latest-7.noarch.rpm (RH) -sudo apt-get install software-properties-common (U) -sudo apt-add-repository universe (U) -sudo apt-get update (U) -pip -sudo yum install python-pip -y (RH) -curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py (U) -sudo apt install python2.7 +``` +mkdir ~/dx +cd ~/dx +``` +Install prerequisite -- git +``` +sudo yum install git -y +``` +Install prerequisite -- wget +``` +sudo yum install wget -y +``` +Enable EPEL Repository for RH 7.* +``` +wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm +sudo rpm -ivh epel-release-latest-7.noarch.rpm +Install prerequisite -- pip and some essential packages +sudo yum install python-pip -y sudo cp /usr/bin/python2.7 /usr/bin/python -sudo python get-pip.py (U) -pip packages sudo pip install -U setuptools sudo pip install packaging make -sudo yum install gcc gcc-c++ kernel-devel -y (RH) -sudo apt-get install build-essential -y (U) +sudo yum install gcc gcc-c++ kernel-devel -y +``` Install ansible +``` git clone https://github.com/ansible/ansible.git cd ansible/ make sudo make install -Download test sequencing data in /opt/seq folder -- +``` +Please download or move some test sequencing data in /opt/seq folder Clone streaming repo +``` cd ~/dx git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git +``` Create dx-upload-play.yml file inside the dx-streaming-folder `dx-upload-play.yml` ```YAML @@ -177,12 +179,8 @@ Here are the [instructions]() for token generation. Launch the ansible-playbook ```sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml ``` -Give the right permission to cron -```sudo cron (U) -``` -# Examples -Role Variables --------------- +## Examples +##### Role Variables - `mode`: `{deploy, debug}` In the *debug* mode, monitoring cron job is triggered every minute; in *deploy mode*, monitoring cron job is triggered every hour. - `upload_project`: ID of the DNAnexus project that the RUN folders should be uploaded to. The ID is of the form `project-BpyQyjj0Y7V0Gbg7g52Pqf8q` - `dx_token`: API token for the DNAnexus user to be used for data upload. The API token should give minimally UPLOAD access to the `{{ upload project }}`, or CONTRIBUTE access if `downstream_applet` is specified. Instructions for generating a API token can be found at [DNAnexus wiki](https://wiki.dnanexus.com/UI/API-Tokens). This value is overriden by `dx_user_token` in `monitored_users`. @@ -200,8 +198,7 @@ Role Variables - `workflow`: (Optional) ID of a DNAnexus workflow to be triggered after successful upload of the RUN directory. This workflow's I/O contract should accept a DNAnexus record with the name `upload_sentinel_record` in the 1st stage (stage 0) of the workflow as input. Additional input can be specified using the variable `downstream_input`. **Note that if the specified workflow is not located, the upload process will not commence. Mutually exclusive with `applet`. The role will raise an error and fail if both are specified.** - `downstream_input`: (Optional) A JSON string, parsable as a python `dict` of `str`:``str`, where the **key** is the input_name recognized by a DNAnexus applet/workflow and the **value** is the corresponding input. For examples and detailed explanation, see section titled `Downstream analysis`. **Note that the role will raise an error and fail if this string is not JSON-parsable as a dict of the expected format** **Note** DNAnexus login is persistent and the login environment is stored on disk in the the Ansible user's home directory. User of this playbook responsibility to make sure that every Ansible user (`monitored_user`) with a streaming upload job assigned has been logged into DNAnexus by either specifying a `dx_token` or `dx_user_token`. -Example Playbook ----------------- +##### Example Playbook `dx-upload-play.yml` ```YAML --- @@ -228,8 +225,7 @@ Example Playbook **Note**: For security reasons, you should refrain from storing the DNAnexus authentication token in a playbook that is open-access. One might trigger the playbook on the command line with extra-vars to supply the necessary authentication token, or store them in a closed-source yaml variable file. ie. `ansible-playbook dx-upload-play.yml -i inventory --extra-vars "dx_token="` We recommend that the token given is limited in scope to the upload project, and has no higher than **CONTRIBUTE** privileges. -Example Script --------------- +##### Example Script The following is an example script that writes a flat file to the RUN directory once a RUN directory has been successfully streamed. Recall that the script will be triggered with a single command line parameter, where `$1` is the path to the local RUN directory that has been successfully streamed to DNAnexus. ``` @@ -238,13 +234,11 @@ set -e -x -o pipefail rundir="$1" echo "Completed streaming run directory: $rundir" > "$rundir/COMPLETE.txt" ``` -Actions performed by Role -------------------------- +##### Actions performed by Role The dx-streaming-upload role perform, broadly, the following: 1. Installs the DNAnexus tools [dx-toolkit](https://wiki.dnanexus.com/Downloads#DNAnexus-Platform-SDK) and [upload agent](https://wiki.dnanexus.com/Downloads#Upload-Agent) on the remote machine. 2. Set up a CRON job that monitors a given directory for RUN directories periodically, and streams the RUN directory into a DNAnexus project, triggering an app(let)/workflow upon successful upload of the directory and a local script (when specified by user) -Downstream analysis -------------------- +##### Downstream analysis The dx-streaming-upload role can optionally trigger a DNAnexus applet/workflow upon completion of incremental upload. The desired DNAnexus applet or workflow can be specified (at a per `monitored_user` basis) using the Ansible variables `applet` or `workflow` respectively (mutually exclusive, see explanantion of variables for general explanations). More information about DNAnexus workflows can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com/API-Specification-v1.0.0/Running-Analyses) ### Authorization @@ -260,8 +254,7 @@ Example of a properly formatted `downstream_input` for an `applet` Example of a properly formatted `downstream_input` for a `workflow` - ```{"0.step0_input": "value1", "1.step2_input": "value2"})``` *Note the numerical index prefix necessary when specifying input for an `workflow`, which disambiguates which step in the workflow an input is targeted to* -Files generated ----------------- +##### Files generated We use a hypothetical example of a local RUN folder named `20160101_M000001_0001_000000000-ABCDE`, that was placed into the `monitored_directory`, after the `dx-streaming-upload` role has been set up. **Local Files Generated** ``` @@ -294,8 +287,7 @@ project The `reads` folder (and subfolders) will only be created if `applet` is specified. The `analyses` folder (and subfolder) will only be created if `workflow` is specified. `RunInfo.xml` and `SampleSheet.csv` will only be upladed if they can be located within the root of the local RUN directory. -Logging, Notification and Error Handling ------------------------------------------- +##### Logging, Notification and Error Handling **Uploading** A log of the CRON command (executed with `bash -e`) is written to the user's home folder `~/dx-stream_cron.log` and can be used to check the top level command triggered. The verbose log of the upload process (generated by the top-level `monitor_runs.py`) is written to the user's home folder `~/monitor.log`. @@ -303,9 +295,7 @@ These logs can be used to diagnose failures of upload from the local machine to **Downstream applet** The downstream applet will be run in the project that the RUN directory is uploaded to (as specified in role variable `upload_project`). Users can log in to their DNAnexus account (corresponding to the `dx_token` or `dx_user_token`) and navigate to the upload project to monitor the progress of the applet triggered. Typically, on failure of a DNAnexus job, the user will receive a notification email, which will direct the user to check the log of the failed job for further diagnosis and debugging. ## Troubleshooting -License -------- +##### License Apache -Author Information ------------------- +##### Author Information DNAnexus (email: support@dnanexus.com) \ No newline at end of file From 766699b5abbfcdf387d54d447ec87e24ac5b120e Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Mon, 24 Sep 2018 17:37:17 -0700 Subject: [PATCH 04/10] better formatting of code blocks part 2 --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b477160..8434fe8 100644 --- a/README.md +++ b/README.md @@ -107,9 +107,13 @@ Create dx-upload-play.yml file inside the dx-streaming-folder ``` Here are the instructions for token generation. Launch the ansible-playbook -```sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml ``` +``` +sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml +``` Give the right permission to cron -```sudo cron``` +``` +sudo cron +``` ##### Using RedHat (tested on) Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) ``` From 104e2e296c2c8f9544e91bcdf11cadfe93911e52 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Mon, 24 Sep 2018 17:39:37 -0700 Subject: [PATCH 05/10] included a link for auth token --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 8434fe8..ff809f9 100644 --- a/README.md +++ b/README.md @@ -179,9 +179,10 @@ Create dx-upload-play.yml file inside the dx-streaming-folder - dx-streaming-upload ``` -Here are the [instructions]() for token generation. +Here are the [instructions](https://wiki.dnanexus.com/Command-Line-Client/Login-and-Logout#Generating-an-authentication-token) for token generation. Launch the ansible-playbook -```sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml +``` +sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml ``` ## Examples ##### Role Variables From ecbf67e89908653c6f3fd525e4265a31f92f48d7 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Tue, 25 Sep 2018 09:57:33 -0700 Subject: [PATCH 06/10] cleaned up some steps for the installation --- README.md | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index ff809f9..5e1bfd8 100644 --- a/README.md +++ b/README.md @@ -37,17 +37,14 @@ The machine that this role is deployed to should have sufficient free memory dep ## Installation ##### Using Ubuntu (tested on 14.04/16.04) -Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) +Create a working directory. Please select the /opt folder as working directory (in our case we are using use ~/dx) ``` mkdir ~/dx cd ~/dx ``` -Install prerequisite -- git +Install prerequisites ``` sudo apt-get install git -``` -Install prerequisite -- wget -``` sudo apt-get install wget ``` Enable universe repositories @@ -56,13 +53,12 @@ sudo apt-get install software-properties-common sudo apt-add-repository universe sudo apt-get update ``` -Install prerequisite -- pip and some essential packages +Install pip and some essential packages ``` curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py sudo apt install python2.7 sudo cp /usr/bin/python2.7 /usr/bin/python sudo python get-pip.py -pip packages sudo pip install -U setuptools sudo pip install packaging make @@ -73,7 +69,7 @@ cd ansible/ make sudo make install ``` -Please download or move some test sequencing data in /opt/seq folder +Download or move some test sequencing data in /opt/seq folder Clone streaming repo ``` cd ~/dx @@ -120,19 +116,16 @@ Create a working directory, to do so, please select the /opt folder as working d mkdir ~/dx cd ~/dx ``` -Install prerequisite -- git +Install prerequisites ``` sudo yum install git -y -``` -Install prerequisite -- wget -``` sudo yum install wget -y ``` Enable EPEL Repository for RH 7.* ``` wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo rpm -ivh epel-release-latest-7.noarch.rpm -Install prerequisite -- pip and some essential packages +Install pip and some essential packages sudo yum install python-pip -y sudo cp /usr/bin/python2.7 /usr/bin/python sudo pip install -U setuptools @@ -147,8 +140,8 @@ cd ansible/ make sudo make install ``` -Please download or move some test sequencing data in /opt/seq folder -Clone streaming repo +Download or move some test sequencing data in /opt/seq folder +Clone streaming repository ``` cd ~/dx git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git From 7a8fa2c4e618b96e994ef8310169ed47ee4175a6 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Tue, 25 Sep 2018 19:57:34 -0700 Subject: [PATCH 07/10] added troubleshooting steps --- README.md | 104 ++++++++++++++++++++++++------------------------------ 1 file changed, 46 insertions(+), 58 deletions(-) diff --git a/README.md b/README.md index 5e1bfd8..c8fd453 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Instruments that this module support include the Illumina MiSeq, NextSeq, HiSeq- ## Table of Contents 1. [Dependencies](#dependencies) -2. [Requirements](#requirements) +2. [Requirements](#requirements) 3. [Installation](#installation) 4. [Examples](#examples) 5. [Example workflows](#example-workflows) @@ -23,21 +23,21 @@ Python 2.7 is needed. This program is not compatible with Python 3.X. Minimal Ansible version: 2.0. -This program is intended for Ubuntu 14.04 (Trusty) and has been tested on the 15.10 (Wily) release. +This program is intended for Ubuntu 14.04 and 16.04, and has been tested on Red Hat 7.4/7.5 and OLE (Oracle Linux Enterprise) 7. It has not been tested on any other versions. ## Requirements -Users of this module needs a DNAnexus account and its accompanying authentication. To register for a trial account, visit the DNAnexus homepage. +Users of this module needs a DNAnexus account and its accompanying authentication. To register for a trial account, visit the [DNAnexus homepage](https://platform.dnanexus.com/register). -More information and tutorials about the DNAnexus platform can be found at the DNAnexus wiki page. +More information and tutorials about the DNAnexus platform can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com/Home). The remote-user that the role is run against must possess READ access to monitored_folder and WRITE access to disk for logging and temporary storage of tar files. These are typically stored under the remote-user's home directory, and is specified in the file monitor_run_config.template or as given explicitly by the variables local_tar_directory and local_log_directory. The machine that this role is deployed to should have sufficient free memory depending on the throughput of the sequencing instrument. For Novaseq and HiSeqs we recommend a machine with atleast 8 cores, 32 GB of RAM, and 500GB - 1TB of storage. ## Installation -##### Using Ubuntu (tested on 14.04/16.04) -Create a working directory. Please select the /opt folder as working directory (in our case we are using use ~/dx) +#### Using Ubuntu (tested on 14.04/16.04) +Create a working directory. Select the /opt folder as working directory (in our case we are using use ~/dx) ``` mkdir ~/dx cd ~/dx @@ -75,31 +75,24 @@ Clone streaming repo cd ~/dx git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git ``` -Create dx-upload-play.yml file inside the dx-streaming-folder +Create dx-upload-play.yml file inside the dx-streaming-folder. +###dx-upload-play.yml `dx-upload-play.yml` ```YAML --- - hosts: localhost vars: monitored_users: - - username: travis - local_tar_directory: ~/new_location/upload/TMP - local_log_directory: ~/another_location/upload/LOG - monitored_directories: - - ~/runs - applet: applet-Bq2Kkgj08FqbjV3J8xJ0K3gG - downstream_input: '{"sequencing_center": "CENTER_A"}' - username: root + local_tar_directory: /opt/tmp + local_log_directory: /opt/log monitored_directories: - - ~/home/root/runs - workflow: workflow-BvFz31j0Y7V5QPf09x9y91pF - downstream_input: '{"0.sequencing_center: "CENTER_A"}' + - /opt/seq + dx_user_token: mode: debug - upload_project: project-BpyQyjj0Y7V0Gbg7g52Pqf8q - + upload_project: project-id roles: - - dx-streaming-upload - + - dx-streaming-upload ``` Here are the instructions for token generation. Launch the ansible-playbook @@ -110,8 +103,8 @@ Give the right permission to cron ``` sudo cron ``` -##### Using RedHat (tested on) -Create a working directory, to do so, please select the /opt folder as working directory (in our case we will use ~/dx) +#### Using RedHat (tested on 7.4 and 7.5) +Create a working directory. Select the /opt folder as working directory (in our case we are using use ~/dx) ``` mkdir ~/dx cd ~/dx @@ -146,39 +139,14 @@ Clone streaming repository cd ~/dx git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git ``` -Create dx-upload-play.yml file inside the dx-streaming-folder -`dx-upload-play.yml` -```YAML ---- -- hosts: localhost - vars: - monitored_users: - - username: travis - local_tar_directory: ~/new_location/upload/TMP - local_log_directory: ~/another_location/upload/LOG - monitored_directories: - - ~/runs - applet: applet-Bq2Kkgj08FqbjV3J8xJ0K3gG - downstream_input: '{"sequencing_center": "CENTER_A"}' - - username: root - monitored_directories: - - ~/home/root/runs - workflow: workflow-BvFz31j0Y7V5QPf09x9y91pF - downstream_input: '{"0.sequencing_center: "CENTER_A"}' - mode: debug - upload_project: project-BpyQyjj0Y7V0Gbg7g52Pqf8q - - roles: - - dx-streaming-upload - -``` +Create dx-upload-play.yml file inside the dx-streaming-folder. Example given [here](#dx-upload-play.yml) Here are the [instructions](https://wiki.dnanexus.com/Command-Line-Client/Login-and-Logout#Generating-an-authentication-token) for token generation. Launch the ansible-playbook ``` sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml ``` ## Examples -##### Role Variables +#### Role Variables - `mode`: `{deploy, debug}` In the *debug* mode, monitoring cron job is triggered every minute; in *deploy mode*, monitoring cron job is triggered every hour. - `upload_project`: ID of the DNAnexus project that the RUN folders should be uploaded to. The ID is of the form `project-BpyQyjj0Y7V0Gbg7g52Pqf8q` - `dx_token`: API token for the DNAnexus user to be used for data upload. The API token should give minimally UPLOAD access to the `{{ upload project }}`, or CONTRIBUTE access if `downstream_applet` is specified. Instructions for generating a API token can be found at [DNAnexus wiki](https://wiki.dnanexus.com/UI/API-Tokens). This value is overriden by `dx_user_token` in `monitored_users`. @@ -196,7 +164,7 @@ sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml - `workflow`: (Optional) ID of a DNAnexus workflow to be triggered after successful upload of the RUN directory. This workflow's I/O contract should accept a DNAnexus record with the name `upload_sentinel_record` in the 1st stage (stage 0) of the workflow as input. Additional input can be specified using the variable `downstream_input`. **Note that if the specified workflow is not located, the upload process will not commence. Mutually exclusive with `applet`. The role will raise an error and fail if both are specified.** - `downstream_input`: (Optional) A JSON string, parsable as a python `dict` of `str`:``str`, where the **key** is the input_name recognized by a DNAnexus applet/workflow and the **value** is the corresponding input. For examples and detailed explanation, see section titled `Downstream analysis`. **Note that the role will raise an error and fail if this string is not JSON-parsable as a dict of the expected format** **Note** DNAnexus login is persistent and the login environment is stored on disk in the the Ansible user's home directory. User of this playbook responsibility to make sure that every Ansible user (`monitored_user`) with a streaming upload job assigned has been logged into DNAnexus by either specifying a `dx_token` or `dx_user_token`. -##### Example Playbook +#### Example Playbook `dx-upload-play.yml` ```YAML --- @@ -223,7 +191,7 @@ sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml **Note**: For security reasons, you should refrain from storing the DNAnexus authentication token in a playbook that is open-access. One might trigger the playbook on the command line with extra-vars to supply the necessary authentication token, or store them in a closed-source yaml variable file. ie. `ansible-playbook dx-upload-play.yml -i inventory --extra-vars "dx_token="` We recommend that the token given is limited in scope to the upload project, and has no higher than **CONTRIBUTE** privileges. -##### Example Script +#### Example Script The following is an example script that writes a flat file to the RUN directory once a RUN directory has been successfully streamed. Recall that the script will be triggered with a single command line parameter, where `$1` is the path to the local RUN directory that has been successfully streamed to DNAnexus. ``` @@ -232,11 +200,11 @@ set -e -x -o pipefail rundir="$1" echo "Completed streaming run directory: $rundir" > "$rundir/COMPLETE.txt" ``` -##### Actions performed by Role +#### Actions performed by Role The dx-streaming-upload role perform, broadly, the following: 1. Installs the DNAnexus tools [dx-toolkit](https://wiki.dnanexus.com/Downloads#DNAnexus-Platform-SDK) and [upload agent](https://wiki.dnanexus.com/Downloads#Upload-Agent) on the remote machine. 2. Set up a CRON job that monitors a given directory for RUN directories periodically, and streams the RUN directory into a DNAnexus project, triggering an app(let)/workflow upon successful upload of the directory and a local script (when specified by user) -##### Downstream analysis +#### Downstream analysis The dx-streaming-upload role can optionally trigger a DNAnexus applet/workflow upon completion of incremental upload. The desired DNAnexus applet or workflow can be specified (at a per `monitored_user` basis) using the Ansible variables `applet` or `workflow` respectively (mutually exclusive, see explanantion of variables for general explanations). More information about DNAnexus workflows can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com/API-Specification-v1.0.0/Running-Analyses) ### Authorization @@ -252,7 +220,7 @@ Example of a properly formatted `downstream_input` for an `applet` Example of a properly formatted `downstream_input` for a `workflow` - ```{"0.step0_input": "value1", "1.step2_input": "value2"})``` *Note the numerical index prefix necessary when specifying input for an `workflow`, which disambiguates which step in the workflow an input is targeted to* -##### Files generated +#### Files generated We use a hypothetical example of a local RUN folder named `20160101_M000001_0001_000000000-ABCDE`, that was placed into the `monitored_directory`, after the `dx-streaming-upload` role has been set up. **Local Files Generated** ``` @@ -285,7 +253,7 @@ project The `reads` folder (and subfolders) will only be created if `applet` is specified. The `analyses` folder (and subfolder) will only be created if `workflow` is specified. `RunInfo.xml` and `SampleSheet.csv` will only be upladed if they can be located within the root of the local RUN directory. -##### Logging, Notification and Error Handling +#### Logging, Notification and Error Handling **Uploading** A log of the CRON command (executed with `bash -e`) is written to the user's home folder `~/dx-stream_cron.log` and can be used to check the top level command triggered. The verbose log of the upload process (generated by the top-level `monitor_runs.py`) is written to the user's home folder `~/monitor.log`. @@ -293,7 +261,27 @@ These logs can be used to diagnose failures of upload from the local machine to **Downstream applet** The downstream applet will be run in the project that the RUN directory is uploaded to (as specified in role variable `upload_project`). Users can log in to their DNAnexus account (corresponding to the `dx_token` or `dx_user_token`) and navigate to the upload project to monitor the progress of the applet triggered. Typically, on failure of a DNAnexus job, the user will receive a notification email, which will direct the user to check the log of the failed job for further diagnosis and debugging. ## Troubleshooting -##### License +#### Quick upload test +You can run a quick test to see if you have the right configuration set up. Please install our Upload Agent from [here](https://wiki.dnanexus.com/Downloads#Upload-Agent) and run the command - +``` +./ua --test +``` +For more information, please refer to our documentation [here](https://wiki.dnanexus.com/Upload-Agent#Running-a-simple-diagnostic-test). +#### Check if cron job has initialized +You can check the status of the cron job by trying the following command - +``` +crontab -l +``` +This should provide an output to - +``` +#Ansible: DNAnexus monitor runs (debug) +* * * * * flock -w 5 /var/lock/dnanexus_uploader.lock bash -ex -c 'source /opt/dx-toolkit/environment; PATH=/opt/dnanexus-upload-agent:$PATH; python /opt/dnanexus/scripts/monitor_runs.py -c ~/dnanexus/config/monitor_runs.config -p project-XXXXX -d /PROD/NGS_DATA/MY_ILLUMINA_MACHINE/DATE/NVSQ-RUN_ID -v > ~/monitor.log 2>&1' > ~/dx-stream_cron.log 2>&1 +``` +#### Check status of the upload +The upload process is logged using these files - +`~/dx-stream_cron.log` is the first log file to monitor to see if the appropriate scripts are being launched +`~/monitor.log` is log contains the additional information about the upload process +## License Apache -##### Author Information +## Author Information DNAnexus (email: support@dnanexus.com) \ No newline at end of file From fbcd4462e3b9d0812b46c7d2a441be1d73d4b111 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Tue, 25 Sep 2018 20:01:37 -0700 Subject: [PATCH 08/10] better formatting --- README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index c8fd453..4911190 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ cd ansible/ make sudo make install ``` -Download or move some test sequencing data in /opt/seq folder +Download or move some test sequencing data in /opt/seq folder Clone streaming repo ``` cd ~/dx @@ -94,8 +94,8 @@ Create dx-upload-play.yml file inside the dx-streaming-folder. roles: - dx-streaming-upload ``` -Here are the instructions for token generation. -Launch the ansible-playbook +Here are the instructions for token generation. +Launch the ansible-playbook ``` sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml ``` @@ -133,14 +133,14 @@ cd ansible/ make sudo make install ``` -Download or move some test sequencing data in /opt/seq folder +Download or move some test sequencing data in /opt/seq folder Clone streaming repository ``` cd ~/dx git clone https://github.com/dnanexus-rnd/dx-streaming-upload.git ``` -Create dx-upload-play.yml file inside the dx-streaming-folder. Example given [here](#dx-upload-play.yml) -Here are the [instructions](https://wiki.dnanexus.com/Command-Line-Client/Login-and-Logout#Generating-an-authentication-token) for token generation. +Create dx-upload-play.yml file inside the dx-streaming-folder. Example given [here](#dx-upload-play.yml) +Here are the [instructions](https://wiki.dnanexus.com/Command-Line-Client/Login-and-Logout#Generating-an-authentication-token) for token generation. Launch the ansible-playbook ``` sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml @@ -272,14 +272,14 @@ You can check the status of the cron job by trying the following command - ``` crontab -l ``` -This should provide an output to - +This should provide an output such as - ``` #Ansible: DNAnexus monitor runs (debug) * * * * * flock -w 5 /var/lock/dnanexus_uploader.lock bash -ex -c 'source /opt/dx-toolkit/environment; PATH=/opt/dnanexus-upload-agent:$PATH; python /opt/dnanexus/scripts/monitor_runs.py -c ~/dnanexus/config/monitor_runs.config -p project-XXXXX -d /PROD/NGS_DATA/MY_ILLUMINA_MACHINE/DATE/NVSQ-RUN_ID -v > ~/monitor.log 2>&1' > ~/dx-stream_cron.log 2>&1 ``` #### Check status of the upload -The upload process is logged using these files - -`~/dx-stream_cron.log` is the first log file to monitor to see if the appropriate scripts are being launched +The upload process is logged using these files - +`~/dx-stream_cron.log` is the first log file to monitor to see if the appropriate scripts are being launched `~/monitor.log` is log contains the additional information about the upload process ## License Apache From cef829620417063433890212c61251b454c391a3 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Tue, 25 Sep 2018 20:03:41 -0700 Subject: [PATCH 09/10] better formatting part 2 --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4911190..ca11980 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,8 @@ cd ansible/ make sudo make install ``` -Download or move some test sequencing data in /opt/seq folder +Download or move some test sequencing data in `/opt/seq` folder + Clone streaming repo ``` cd ~/dx @@ -133,7 +134,8 @@ cd ansible/ make sudo make install ``` -Download or move some test sequencing data in /opt/seq folder +Download or move some test sequencing data in `/opt/seq` folder + Clone streaming repository ``` cd ~/dx From de9ae29515a0ece3853ebb5f50d1be76e3af48e5 Mon Sep 17 00:00:00 2001 From: nainathangaraj Date: Tue, 2 Oct 2018 15:27:30 -0700 Subject: [PATCH 10/10] incorporating Alpha's feedback on linux distributions supported and phrasing around remote user --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index ca11980..3b6d82c 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Python 2.7 is needed. This program is not compatible with Python 3.X. Minimal Ansible version: 2.0. -This program is intended for Ubuntu 14.04 and 16.04, and has been tested on Red Hat 7.4/7.5 and OLE (Oracle Linux Enterprise) 7. It has not been tested on any other versions. +This program is intended for Ubuntu 14.04 and 16.04, and has been tested on Red Hat 7.4/7.5 and OLE (Oracle Linux Enterprise) 7. It has not been tested on any other versions but it should work with most of the Linux OS releases. ## Requirements @@ -31,7 +31,7 @@ Users of this module needs a DNAnexus account and its accompanying authenticatio More information and tutorials about the DNAnexus platform can be found at the [DNAnexus wiki page](https://wiki.dnanexus.com/Home). -The remote-user that the role is run against must possess READ access to monitored_folder and WRITE access to disk for logging and temporary storage of tar files. These are typically stored under the remote-user's home directory, and is specified in the file monitor_run_config.template or as given explicitly by the variables local_tar_directory and local_log_directory. +The local user utilizing this package should possess READ access to monitored_folder and WRITE access to disk for logging and temporary storage of tar files. These are typically stored under the local user's home directory, and is specified in the file monitor_run_config.template or as given explicitly by the variables local_tar_directory and local_log_directory. The machine that this role is deployed to should have sufficient free memory depending on the throughput of the sequencing instrument. For Novaseq and HiSeqs we recommend a machine with atleast 8 cores, 32 GB of RAM, and 500GB - 1TB of storage. @@ -135,7 +135,7 @@ make sudo make install ``` Download or move some test sequencing data in `/opt/seq` folder - + Clone streaming repository ``` cd ~/dx @@ -152,8 +152,8 @@ sudo ansible-playbook dx-streaming-upload/dx-upload-play.yml - `mode`: `{deploy, debug}` In the *debug* mode, monitoring cron job is triggered every minute; in *deploy mode*, monitoring cron job is triggered every hour. - `upload_project`: ID of the DNAnexus project that the RUN folders should be uploaded to. The ID is of the form `project-BpyQyjj0Y7V0Gbg7g52Pqf8q` - `dx_token`: API token for the DNAnexus user to be used for data upload. The API token should give minimally UPLOAD access to the `{{ upload project }}`, or CONTRIBUTE access if `downstream_applet` is specified. Instructions for generating a API token can be found at [DNAnexus wiki](https://wiki.dnanexus.com/UI/API-Tokens). This value is overriden by `dx_user_token` in `monitored_users`. -- `monitored_users`: This is a list of objects, each representing a remote user, with its set of incremental upload parameters. For each `monitored_user`, the following values are accepted - - `username`: (Required) username of the remote user +- `monitored_users`: This is a list of objects, each representing a local user, with its set of incremental upload parameters. For each `monitored_user`, the following values are accepted + - `username`: (Required) username of the local user - `monitored_directories`: (Required) Path to the local directory that should be monitored for RUN folders. Multiple directories can be listed. Suppose that the folder `20160101_M000001_0001_000000000-ABCDE` is the RUN directory, then the folder structure assumed is `{{monitored_dir}}/20160101_M000001_0001_000000000-ABCDE` - `local_tar_directory`: (Optional) Path to a local folder where tarballs of RUN directory is temporarily stored. User specified in `username` need to have **WRITE** access to this folder. There should be sufficient disk space to accomodate a RUN directory in this location. This overwrites the default found in `templates/monitor_run_config.template`. - `local_log_directory`: (Optional) Path to a local folder where logs of streaming upload is stored, persistently. User specified in `username` need to have **WRITE** access to this folder. User should not manually manipulate files found in this folder, as the streaming upload code make assumptions that the files in this folder are not manually manipulated. This overwites the default found in `templates/monitor_run_config.template`. @@ -204,7 +204,7 @@ echo "Completed streaming run directory: $rundir" > "$rundir/COMPLETE.txt" ``` #### Actions performed by Role The dx-streaming-upload role perform, broadly, the following: -1. Installs the DNAnexus tools [dx-toolkit](https://wiki.dnanexus.com/Downloads#DNAnexus-Platform-SDK) and [upload agent](https://wiki.dnanexus.com/Downloads#Upload-Agent) on the remote machine. +1. Installs the DNAnexus tools [dx-toolkit](https://wiki.dnanexus.com/Downloads#DNAnexus-Platform-SDK) and [upload agent](https://wiki.dnanexus.com/Downloads#Upload-Agent) on the local machine. 2. Set up a CRON job that monitors a given directory for RUN directories periodically, and streams the RUN directory into a DNAnexus project, triggering an app(let)/workflow upon successful upload of the directory and a local script (when specified by user) #### Downstream analysis The dx-streaming-upload role can optionally trigger a DNAnexus applet/workflow upon completion of incremental upload. The desired DNAnexus applet or workflow can be specified (at a per `monitored_user` basis) using the Ansible variables `applet` or `workflow` respectively (mutually exclusive, see explanantion of variables for general explanations).