Extract gene-count matrices from plated-based SMART-Seq2 data
Run SMART-Seq2 Workflow
Follow the steps below to extract gene-count matrices from SMART-Seq2 data on Terra. This WDL aligns reads using STAR, HISAT2, or Bowtie 2 and estimates expression levels using RSEM.
Copy your sequencing output to your workspace bucket using gsutil in your unix terminal.
You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.
Use
gsutil cp [OPTION]... src_url dst_url
to copy data to your workspace bucket. For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket:gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4
-m
means copy in parallel,-r
means copy the directory recursively.Create a sample sheet.
Please note that the columns in the TSV can be in any order, but that the column names must match the recognized headings.
The sample sheet provides metadata for each cell:
Column
Description
entity:sample
Cell name.
plate
Plate name. Cells with the same plate name are from the same plate.
read1
Location of the FASTQ file for read1 in the cloud (gsurl).
read2
(Optional). Location of the FASTQ file for read2 in the cloud (gsurl). This field can be skipped for single-end reads.
Example:
entity:sample plate read1 read2 cell-1 plate-1 gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R1_001.fastq.gz gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R2_001.fastq.gz cell-2 plate-1 gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R1_001.fastq.gz gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R2_001.fastq.gz cell-3 plate-2 gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-3_L001_R1_001.fastq.gz cell-4 plate-2 gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-4_L001_R1_001.fastq.gz
Upload your sample sheet to the workspace bucket.
Example:
gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/
Import smartseq2 workflow to your workspace.
Import by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Smart-Seq2 to import.
Moreover, in the workflow page, click
Export to Workspace...
button, and select the workspace to which you want to export smartseq2 workflow in the drop-down menu.In your workspace, open
smartseq2
inWORKFLOWS
tab. SelectRun workflow with inputs defined by file paths
as belowand click
SAVE
button.
Inputs:
Please see the description of inputs below. Note that required inputs are shown in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_tsv_file |
Sample Sheet (contains entity:sample, plate, read1, read2) |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.tsv” |
|
output_directory |
Output directory |
“gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2_output” |
|
reference |
Reference transcriptome to align reads to. Acceptable values:
|
“GRCh38_ens93filt”, or
“gs://fc-e0000000-0000-0000-0000-000000000000/rsem_ref.tar.gz”
|
|
aligner |
Which aligner to use for read alignment. Options are “hisat2-hca”, “star” and “bowtie” |
“star” |
“hisat2-hca” |
output_genome_bam |
Whether to output bam file with alignments mapped to genomic coordinates and annotated with their posterior probabilities. |
false |
false |
smartseq2_version |
SMART-Seq2 version to use. Versions available: 1.3.0. |
“1.3.0” |
“1.3.0” |
docker_registry |
Docker registry to use. Options:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
zones |
Google cloud zones |
“us-east1-d us-west1-a us-west1-b” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
num_cpu |
Number of cpus to request for one node |
4 |
4 |
memory |
Memory size string |
“3.60G” |
If aligner is bowtie2 or hisat2-hca, “3.6G”; otherwise “32G” |
disk_space_multiplier |
Factor to multiply size of R1 and R2 by for RSEM |
Float |
11 |
generate_count_matrix_disk_space |
Disk space for count matrix generation task in GB |
Integer |
10 |
backend |
Cloud infrastructure backend to use. Available options:
|
“gcp” |
“gcp” |
preemptible |
Number of preemptible tries. This works only when backend is |
2 |
2 |
awsMaxRetries |
Number of maximum retries when running on AWS. This works only when backend is |
5 |
5 |
Outputs:
Name |
Type |
Description |
---|---|---|
output_count_matrix |
String |
Point to a Google bucket URL for count matrix in matrix market format. |
rsem_trans_bam |
Array[String?] |
An array of Google bucket URLs for RSEM transcriptomic BAM files |
rsem_genome_bam |
Array[String?] |
An array of Google bucket URLs for RSEM genomic BAM files if |
rsem_gene |
Array[File?] |
An array of RSEM gene expression estimation files. |
rsem_isoform |
Array[File?] |
An array of RSEM isoform expression estimation files. |
rsem_time |
Array[File?] |
An array of RSEM execution time log files. |
aligner_log |
Array[File?] |
An array of Aligner log files. |
rsem_cnt |
Array[File?] |
An array of RSEM count files. |
rsem_model |
Array[File?] |
An array of RSEM model files. |
rsem_theta |
Array[File?] |
An array of RSEM generated theta files. |
This WDL generates one gene-count matrix in matrix market format:
output_count_matrix is a folder containing three files: matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz.
matrix.mtx.gz is a gzipped matrix in matrix market format.
barcodes.tsv.gz is a gzipped TSV file, containing 5 columns. ‘barcodekey’ is cell name. ‘plate’ is the plate name, which can be used for batch correction. ‘total_reads’ is the total number of reads. ‘alignment_rate’ is the alignment rate obtained from the aligner. ‘unique_rate’ is the percentage of reads aligned uniquely to a gene. Cells sequenced with single-end reads appear first in ‘barcodekey’.
features.tsv.gz is a gzipped TSV file, containing 2 columns. ‘featurekey’ is gene symbol. ‘featureid’ is Ensembl ID.
The gene-count matrix can be fed directly into cumulus for downstream analysis.
TPM-normalized counts are calculated as follows:
Estimate the gene expression levels in TPM using RSEM.
Suppose
c
reads are achieved for one cell, then calculate TPM-normalized count for genei
asTPM_i / 1e6 * c
.
TPM-normalized counts reflect both the relative expression levels and the cell sequencing depth.
Custom Genome
We also provide a way of generating user-customized Genome references for SMART-Seq2 workflow.
Import smartseq2_create_reference workflow to your workspace.
Import by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Smart-Seq2_create_reference to import.
Moreover, in the workflow page, click
Export to Workflow...
button, and select the workspace to which you want to exportsmartseq2_create_reference
in the drop-down menu.In your workspace, open
smartseq2_create_reference
inWORKFLOWS
tab. SelectRun workflow with inputs defined by file paths
as belowand click
SAVE
button.
Inputs:
Please see the description of inputs below. Note that required inputs are shown in bold.
Name |
Description |
Type or Example |
Default |
---|---|---|---|
fasta |
Genome fasta file |
File.
For example, “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.primary_assembly.fa”
|
|
gtf |
GTF gene annotation file (e.g. Homo_sapiens.GRCh38.83.gtf) |
File.
For example, “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.83.gtf”
|
|
output_directory |
Google bucket url for the output folder |
“gs://fc-e0000000-0000-0000-0000-000000000000/output_refs” |
|
genome |
Output reference genome name. Output reference is a gzipped tarball with name genome_aligner.tar.gz |
“GRCm38_ens97filt” |
|
aligner |
Build indices for which aligner, choices are hisat2-hca, star, or bowtie2. |
“hisat2-hca” |
“hisat2-hca” |
smartseq2_version |
SMART-Seq2 version to use.
Versions available: 1.3.0.
|
“1.3.0” |
“1.3.0” |
docker_registry |
Docker registry to use. Options:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
zones |
Google cloud zones |
“us-central1-c” |
“us-central1-b” |
cpu |
Number of CPUs |
Integer |
If aligner is bowtie2 or hisat2-hca, 8; otherwise 32 |
memory |
Memory size string |
String |
If aligner is bowtie2 or hisat2-hca, “7.2G”; otherwise “120G” |
disk_space |
Disk space in GB |
Integer |
If aligner is bowtie2 or hisat2-hca, 40; otherwise 120 |
backend |
Cloud infrastructure backend to use. Available options:
|
“gcp” |
“gcp” |
preemptible |
Number of preemptible tries. This works only when backend is |
2 |
2 |
awsMaxRetries |
Number of maximum retries when running on AWS. This works only when backend is |
5 |
5 |
Outputs
Name |
Type |
Description |
---|---|---|
output_reference |
File |
The custom Genome reference generated. Its default file name is |
monitoring_log |
File |
CPU and memory profiling log. |