Extract gene-count matrices from plated-based SMART-Seq2 data

Run SMART-Seq2 Workflow

Follow the steps below to extract gene-count matrices from SMART-Seq2 data on Terra. This WDL aligns reads using Bowtie 2 and estimates expression levels using RSEM.

  1. Copy your sequencing output to your workspace bucket using gsutil in your unix terminal.

    You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.

    _images/google_bucket_link.png

    Note: Broad users need to be on an UGER node (not a login node) in order to use the -m flag

    Request an UGER node:

    reuse UGER
    qrsh -q interactive -l h_vmem=4g -pe smp 8 -binding linear:8 -P regevlab
    

    The above command requests an interactive node with 4G memory per thread and 8 threads. Feel free to change the memory, thread, and project parameters.

    Once you’re connected to an UGER node, you can make gsutil available by running:

    reuse Google-Cloud-SDK
    

    Use gsutil cp [OPTION]... src_url dst_url to copy data to your workspace bucket. For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket:

    gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4
    

    -m means copy in parallel, -r means copy the directory recursively.

  2. Create a sample sheet.

    Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

    The sample sheet provides metadata for each cell:

    Column Description
    Cell Cell name.
    Plate Plate name. Cells with the same plate name are from the same plate.
    Read1 Location of the FASTQ file for read1 in the cloud (gsurl).
    Read2 Location of the FASTQ file for read1 in the cloud (gsurl).

    Example:

    Cell,Plate,Read1,Read2
    cell-1,plate-1,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R2_001.fastq.gz
    cell-2,plate-1,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R2_001.fastq.gz
    cell-3,plate-2,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-3_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-3_L001_R2_001.fastq.gz
    cell-4,plate-2,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-4_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-4_L001_R2_001.fastq.gz
    
  3. Upload your sample sheet to the workspace bucket.

    Example:

    gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/
    
  4. Import smartseq2 workflow to your workspace.

    See the Terra documentation for adding a workflow. The smartseq2 workflow is under Broad Methods Repository with name “cumulus/smartseq2”.

    Moreover, in the workflow page, click Export to Workspace... button, and select the workspace to which you want to export smartseq2 workflow in the drop-down menu.

  5. In your workspace, open smartseq2 in WORKFLOWS tab. Select Process single workflow from files as below

    _images/single_workflow.png

    and click SAVE button.

Inputs:

Please see the description of inputs below. Note that required inputs are shown in bold.

Name Description Example Default
input_csv_file Sample Sheet (contains Cell, Plate, Read1, Read2) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2_output”  
reference

Reference transcriptome to align reads to. Acceptable values:

  • Pre-created genome references: “GRCh38” for human; “GRCm38” and “mm10” for mouse.
  • Create a custom genome reference using smartseq2_create_reference workflow, and specify its Google bucket URL here.
“GRCh38”, or
“gs://fc-e0000000-0000-0000-0000-000000000000/rsem_ref.tar.gz”
 
smartseq2_version SMART-Seq2 version to use. Versions available: 1.0.0. “1.0.0” “1.0.0”
docker_registry

Docker registry to use. Options:

  • “cumulusprod” for Docker Hub images;
  • “quay.io/cumulus” for backup images on Red Hat registry.
“cumulusprod” “cumulusprod”
zones Google cloud zones “us-east1-d us-west1-a us-west1-b” “us-east1-d us-west1-a us-west1-b”
num_cpu Number of cpus to request for one node 4 4
memory Memory size string “3.60G” “3.60G”
disk_space Disk space in GB 10 10
preemptible Number of preemptible tries 2 2

Outputs:

See the table below for important outputs.

Name Type Description
output_count_matrix Array[String] A list of google bucket urls containing gene-count matrices, one per plate. Each gene-count matrix file has the suffix .dge.txt.gz.

This WDL generates one gene-count matrix per SMART-Seq2 plate. The gene-count matrix uses Drop-Seq format:

  • The first line starts with "Gene" and then gives cell barcodes separated by tabs.
  • Starting from the second line, each line describes one gene. The first item in the line is the gene name and the rest items are TPM-normalized count values of this gene for each cell.

The gene-count matrices can be fed directly into cumulus for downstream analysis.

TPM-normalized counts are calculated as follows:

  1. Estimate the gene expression levels in TPM using RSEM.
  2. Suppose c reads are achieved for one cell, then calculate TPM-normalized count for gene i as TPM_i / 1e6 * c.

TPM-normalized counts reflect both the relative expression levels and the cell sequencing depth.


Custom Genome

We also provide a way of generating user-customized Genome references for SMART-Seq2 workflow.

  1. Import smartseq2_create_reference workflow to your workspace.

    See the Terra documentation for adding a workflow. The smartseq2_create_reference workflow is under Broad Methods Repository with name “cumulus/smartseq2_create_reference”.

    Moreover, in the workflow page, click Export to Workflow... button, and select the workspace to which you want to export smartseq2_create_reference in the drop-down menu.

  2. In your workspace, open smartseq2_create_reference in WORKFLOWS tab. Select Process single workflow from files as below

    _images/single_workflow.png

    and click SAVE button.

Inputs:

Please see the description of inputs below. Note that required inputs are shown in bold.

Name Description Type or Example Default
fasta Genome fasta file
File.
For example, “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.primary_assembly.fa”
 
gtf GTF gene annotation file (e.g. Homo_sapiens.GRCh38.83.gtf)
File.
For example, “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.83.gtf”
 
smartseq2_version
SMART-Seq2 version to use.
Versions available: 1.0.0.
String “1.0.0”
docker_registry

Docker registry to use. Options:

  • “cumulusprod” for Docker Hub images;
  • “quay.io/cumulus” for backup images on Red Hat registry.
String “cumulusprod”
zones Google cloud zones String “us-east1-b us-east1-c us-east1-d”
cpu Number of CPUs Integer 8
memory Memory size string String “7.2G”
extra_disk_space Extra disk space in GB Integer 15
preemptible Number of preemptible tries Integer 2

Outputs

Name Type Description
reference File The custom Genome reference generated. Its default file name is rsem_ref.tar.gz.