Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 95 additions & 11 deletions 02-importing_with_tables.Rmd → 01-importing_with_tables.Rmd
Original file line number Diff line number Diff line change
@@ -1,19 +1,68 @@
# (PART\*) Uploading Your Own Data {-}
# (PART\*) Bringing Your Own Data {-}


```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```

# Temporary Stub
# Uploading from your desktop

Data Tables provide a way to organize data and metadata, including URI links to storage buckets. These tables are a convenient way to organize input for analyses as well as tracking workflow outputs.
In this example, we'll upload some genomic data into AnVIL.

```{r, echo=FALSE, fig.alt="Image shows a schematic of the data storage locations in an AnVIL Workspace. The Data Table is highlighted with a number 'three'."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf982a3c0cd_0_0")
TODO: Add information about what the data is

The starting point for bringing your own data to AnVIL is the Workspace Dashboard. At the bottom right, you'll find the full path to the Google Bucket information corresponding to your Workspace. You can click the clipboard icon on the right to copy the name of your Workspace Bucket. You will be able to see any uploaded files by clicking the "Open in browser" link.

```{r, echo=FALSE, fig.alt="Image shows a screenshot of the Workspace Dashboard. Google Bucket information, including the Google Bucket name, location, and 'Open in browser' link, at the bottom right of the screen is highlighted."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5172664d7_0_142")
```

::: {.dictionary}
*Buckets** are the name of the containers used to store files and objects on Google Cloud. Everything you store on Google Cloud _must_ be in a bucket. Each bucket has its own unique name and location (URI). When we move data files into AnVIL workspaces, we use the URI to tell AnVIL where the data should be stored. (We can also use a URI to tell AnVIL where to find the data we want to upload.)

You can read more about Google Cloud buckets [here](https://docs.cloud.google.com/storage/docs/buckets)
:::

You can also see any uploaded files by clicking the "Files" directory at the bottom left in the Data Tab.

```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Workspace Data tab. The Files directory and link on the bottom left is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf55fadc51c_0_3")
```


## Clone Workspace

## Identify bucket path

## Upload file

## Check file status

## Bring file into a workspace

## Summary


# Uploading from the cloud

In this example, we'll upload some genomic data into AnVIL that is currently stored in the cloud (specifically, in a Google bucket).

::: {.dictionary}
*Buckets** are the name of the containers used to store files and objects on Google Cloud. Everything you store on Google Cloud _must_ be in a bucket. Each bucket has its own unique name and location (URI). When we move data files into AnVIL workspaces, we use the URI to tell AnVIL where the data should be stored. (We can also use a URI to tell AnVIL where to find the data we want to upload.)

You can read more about Google Cloud buckets [here](https://docs.cloud.google.com/storage/docs/buckets)
:::

We're going to upload some fastq files for a SARS-CoV-2 sample. The bucket we're accessing contains 5 samples: two compressed fastq files, a fasta file for a SARS-CoV-2 reference genome, and two uncompressed fastq files. The bucket ID (URI) is `fc-80d0e1cd-61e9-472f-b1bd-c6a8223bd1cd`. For this activity, you will retrieve the two uncompressed fastq files and upload them into your workspace.

```{r, echo=FALSE, fig.alt="Image shows the contents of a Google bucket used in the SARS-CoV-2 on Galaxy activity."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit?slide=id.g3ad3a8a2073_0_0#slide=id.g3ad3a8a2073_0_0")
```

## Step One: Create your workspace



The starting point for bringing your own data to AnVIL is the Workspace Dashboard. At the bottom right, you'll find the full path to the Google Bucket information corresponding to your Workspace. You can click the clipboard icon on the right to copy the name of your Workspace Bucket. You will be able to see any uploaded files by clicking the "Open in browser" link.

```{r, echo=FALSE, fig.alt="Image shows a screenshot of the Workspace Dashboard. Google Bucket information, including the Google Bucket name, location, and 'Open in browser' link, at the bottom right of the screen is highlighted."}
Expand All @@ -26,6 +75,46 @@ You can also see any uploaded files by clicking the "Files" directory at the bot
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf55fadc51c_0_3")
```


Data Tables provide a way to organize data and metadata, including URI links to storage buckets. These tables are a convenient way to organize input for analyses as well as tracking workflow outputs.

```{r, echo=FALSE, fig.alt="Image shows a schematic of the data storage locations in an AnVIL Workspace. The Data Table is highlighted with a number 'three'."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf982a3c0cd_0_0")
```


## Access Data Uploader

## Create a data collection

## Upload data collection

## Upload data table with metadata

## Summary


# Uploading from a remote cluster (HPC)

## Install `gsutil` on your local server

## Copy files

## Check file status

## Bring file into a workspace

## Summary


# Additional Resources

You can read documentation about bringing your own data to AnVIL on the [Portal](https://anvilproject.org/learn/find-data/bringing-your-own-data)

More details can be found in the [Terra documentation](https://support.terra.bio/hc/en-us/sections/360004147951)

## Information from Getting Started guide

## Browser: Upload Single Files

Click the "Files" directory at the bottom left of the Data Tab. Then click the "+" button in the bottom right corner of the screen. This will prompt a file browser on your local machine.
Expand All @@ -48,7 +137,7 @@ Here, you can upload files and manage your data and folders. You can also upload
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf57004a098_0_9")
```

# `gsutil`: Local to Cloud
## `gsutil`: Local to Cloud

`gsutil` is a Python application that lets you access Cloud Storage from the command line in a terminal. The terminal you use can be run on your local machine (local instance) or built into the Workspace Cloud Environment.

Expand Down Expand Up @@ -100,8 +189,3 @@ gsutil cp users/name/data/test.bam gs://ab5-27x
Remember that you can easily copy the Workspace Bucket ID using the clipboard button on the [Workspace Dashboard]({#bring-data-overview}). Please see the [`gsutil cp` documentation](https://cloud.google.com/storage/docs/gsutil/commands/cp) for more details, such as how to do parallel multi-threaded/multi-processing copying or copying an entire directory tree. The `gsutil cp` command can also be used to copy files from one Workspace Bucket to another (cloud-to-cloud copying).


# Additional Resources

You can read documentation about bringing your own data to AnVIL on the [Portal](https://anvilproject.org/learn/find-data/bringing-your-own-data)

More details can be found in the [Terra documentation](https://support.terra.bio/hc/en-us/sections/360004147951)
18 changes: 0 additions & 18 deletions 01-intro.Rmd

This file was deleted.

2 changes: 1 addition & 1 deletion 03-data_explorer.Rmd → 02-data_explorer.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fC

# AnVIL Data Explorer

The [AnVIL Data Explorer](https://explore.anvilproject.org/datasets) enables faceted searches of open and managed access datasets hosted in AnVIL, making it easier for researchers to find and custom-build cohorts.
The AnVIL Data Explorer enables faceted searches of open and managed access datasets hosted in AnVIL, making it easier for researchers to find and custom-build cohorts.

```{r, echo=FALSE, fig.alt='Image shows a screenshot of the AnVIL Data Explorer website landing page.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.g30d935bde8e_0_0")
Expand Down
10 changes: 7 additions & 3 deletions 04-importing_with_SRA.Rmd → 03-importing_with_SRA.Rmd
Original file line number Diff line number Diff line change
@@ -1,17 +1,21 @@

# (PART\*) SRA ON AnVIL {-}
# (PART\*) Importing Data from SRA {-}


```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```

# Quick Start {#quick-start}
# Quick Start: Importing a single file {#quick-start-sra}

In this module, we'll bring some metagenomic data into AnVIL.
In this example, we'll bring some metagenomic data into AnVIL.

This data comes from [this BioProject](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA904247), which collected soil samples to study bacterial communities in tallgrass prairie. Bacteria play an important role in this ecosystem, but can be changed by disturbance, management, and the presence of herbivores.

We will bring this data into AnVIL from the **Sequence Read Archive**, or SRA. You can check out the [SRA website](https://www.ncbi.nlm.nih.gov/sra) to learn more:

> Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.

The SRA Data corresponding to this project is located [here](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP409181&o=acc_s%3Aa).

```{r, fig.align='center', echo = FALSE, fig.alt= "Microbiome diversity has many benefitial properties ranging soil and plant health.", out.width = '100%'}
Expand Down
File renamed without changes.
17 changes: 17 additions & 0 deletions Feedback.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# (PART\*) Appendix {-}


```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```

# Give us Feedback

Thank you for your interest in this book! There are a few ways you can suggest improvements:

<br>
<!-- The capital letter above alters the formatting for the numbered points below -->

1. Fill out this [Google form](https://docs.google.com/forms/d/e/1FAIpQLScrDVb_utm55pmb_SHx-RgELTEbCCWdLea0T3IzS0Oj00GE4w/viewform?usp=pp_url&entry.1565230805=AnVIL+Book+Getting+Started){target="_blank"}.
1. If you have a GitHub account, you can [raise an issue](https://github.com/fhdsl/Data_on_AnVIL/issues){target="_blank"} in our repository.
1. Submit a pull request! Click the pencil icon on any page (top left) to view the source `.Rmd` for the page and suggest changes.
10 changes: 5 additions & 5 deletions _bookdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@ book_filename: "Data on AnVIL"
chapter_name: "Chapter"
repo: https://github.com/jhudsl/AnVIL_Template/
rmd_files: ["index.Rmd",
"01-intro.Rmd",
"02-importing_with_tables.Rmd",
"03-data_explorer.Rmd",
"04-importing_with_SRA.Rmd",
"05-controlled_access_data.Rmd",
"01-importing_with_tables.Rmd",
"02-data_explorer.Rmd",
"03-importing_with_SRA.Rmd",
"04-controlled_access_data.Rmd",
"Feedback.Rmd",
"About.Rmd",
"References.Rmd"]
new_session: yes
Expand Down
27 changes: 14 additions & 13 deletions index.Rmd
Original file line number Diff line number Diff line change
@@ -1,40 +1,41 @@
---
title: "AnVIL Book Name"
title: "Data on AnVIL"
date: "`r format(Sys.time(), '%B %d, %Y')`"
site: bookdown::bookdown_site
documentclass: book
bibliography: book.bib
biblio-style: apalike
link-citations: yes
description: Description about Course/Book.
description: This book contains vignettes on how to upload, find, and use data within an AnVIL workspace.
favicon: assets/AnVIL_style/anvil_favicon.ico
---


# About this Book {-}

This book is part of a series of books for the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) of the National Human Genome Research Institute (NHGRI). Learn more about AnVIL by visiting https://anvilproject.org or reading the [article in Cell Genomics](https://www.sciencedirect.com/science/article/pii/S2666979X21001063).
The chapters within this book contain hands-on activities to demonstrate how users can access and use data within an AnVIL workspace. Topics includes bringing your own data from an HPC, finding data already hosted on AnVIL with tools like the Data Explorer, importing data from online data repositories like SRA, and getting access to protected data stored in places like dbGaP.

It can be very exciting to learn how much data is at your fingertips! Once you have settled on some data to use, you'll want to bring it into AnVIL if it's not already there.

Navigate to the menu on the left to get started!

## Skills Level {-}

::: {.notice}
_Genetics_
<!-- **Novice**: no genetics knowledge needed -->

**Novice**: no genetics knowledge needed

_Programming skills_
<!-- **Novice**: no programming experience needed -->

**Novice**: no programming experience needed
:::

## AnVIL Collection {-}

Please check out our full collection of AnVIL and related resources: https://hutchdatascience.org/AnVIL_Collection/

# Learning Objectives {-}
This module is part of a series of books for the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) of the National Human Genome Research Institute (NHGRI).

<!-- Learning objectives for this activity come from the [Genetics Core Competencies](https://genetics-gsa.org/education/genetics-learning-framework/): -->
Please check out our full collection of AnVIL and related resources: https://hutchdatascience.org/AnVIL_Collection/

<!-- - Objective 1 -->
<!-- - Objective 2 -->
<!-- - Objective 3 -->
Learn more about AnVIL by visiting https://anvilproject.org or reading the [article in Cell Genomics](https://www.sciencedirect.com/science/article/pii/S2666979X21001063).

<!-- Please also see the Bioinformatics core competencies for undergraduate life sciences education from NIBLSE: https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0196878.t002 -->
3 changes: 3 additions & 0 deletions resources/dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Glimma
glimmaMDS
Gmail
GTEx
HPC
impactful
Inclusivity
ingressing
Expand Down Expand Up @@ -54,7 +55,9 @@ timeframe
TSA
TSV
underserved
Uploader
URI
workspaces
Workspaces
Workspace's
www
Loading