Extract gene-count matrices from plated-based SMART-Seq2 data
-------------------------------------------------------------

Run SMART-Seq2 Workflow
~~~~~~~~~~~~~~~~~~~~~~~~

Follow the steps below to extract gene-count matrices from SMART-Seq2 data on Terra_. This WDL aligns reads using *STAR*, *HISAT2*, or *Bowtie 2* and estimates expression levels using *RSEM*.

#. Copy your sequencing output to your workspace bucket using gsutil_ in your unix terminal.

    You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.

    .. image:: images/google_bucket_link.png


    Use ``gsutil cp [OPTION]... src_url dst_url`` to copy data to your workspace bucket.
    For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket::

        gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4

    ``-m`` means copy in parallel, ``-r`` means copy the directory recursively.


#. Create a sample sheet.

    The sample sheet provides metadata for each cell:

    .. list-table::
        :widths: 5 30
        :header-rows: 1

        * - Column
          - Description
        * - entity:sample
          - Cell name.
        * - plate
          - Plate name. Cells with the same plate name are from the same plate.
        * - read1
          - Location of the FASTQ file for read1 in the cloud (gsurl).
        * - read2
          - (Optional). Location of the FASTQ file for read2 in the cloud (gsurl). This field can be skipped for single-end reads.

    Example::

        entity:sample	plate	read1	read2
        cell-1	plate-1	gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R1_001.fastq.gz	gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-1_L001_R2_001.fastq.gz
        cell-2	plate-1	gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R1_001.fastq.gz	gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-2_L001_R2_001.fastq.gz
        cell-3	plate-2	gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-3_L001_R1_001.fastq.gz
        cell-4	plate-2	gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2/cell-4_L001_R1_001.fastq.gz


#. Upload your sample sheet to the workspace bucket.

    Example::

        gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/


#. Import *smartseq2* workflow to your workspace.

    Import by following instructions in `Import workflows to Terra`_. You should choose **github.com/lilab-bcb/cumulus/Smart-Seq2** to import.

    Moreover, in the workflow page, click ``Export to Workspace...`` button, and select the workspace to which you want to export *smartseq2* workflow in the drop-down menu.

#. In your workspace, open ``smartseq2`` in ``WORKFLOWS`` tab. Select ``Run workflow with inputs defined by file paths`` as below

    .. image:: images/single_workflow.png

   and click ``SAVE`` button.


Inputs:
^^^^^^^

Please see the description of inputs below. Note that required inputs are shown in bold.

.. list-table::
    :widths: 5 30 30 5
    :header-rows: 1

    * - Name
      - Description
      - Example
      - Default
    * - **input_tsv_file**
      - Sample Sheet (contains entity:sample, plate, read1, read2)
      - "gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.tsv"
      -
    * - **output_directory**
      - Output directory
      - "gs://fc-e0000000-0000-0000-0000-000000000000/smartseq2_output"
      -
    * - **reference**
      - Reference transcriptome to align reads to. Acceptable values:

        - Pre-created genome references:

          - "GRCh38_ens93filt" for human, genome version is GRCh38, gene annotation is generated using human Ensembl 93 GTF according to `cellranger mkgtf`_;

          - "GRCm38_ens93filt" for mouse, genome version is GRCm38, gene annotation is generated using mouse Ensembl 93 GTF according to `cellranger mkgtf`_;

        - Create a custom genome reference using `smartseq2_create_reference workflow <./smart_seq_2.html#custom-genome>`_, and specify its Google bucket URL here.
      - | "GRCh38_ens93filt", or
        | "gs://fc-e0000000-0000-0000-0000-000000000000/rsem_ref.tar.gz"
      -
    * - aligner
      - Which aligner to use for read alignment. Options are "hisat2-hca", "star" and "bowtie"
      - "star"
      - "hisat2-hca"
    * - output_genome_bam
      - Whether to output bam file with alignments mapped to genomic coordinates and annotated with their posterior probabilities.
      - false
      - false
    * - smartseq2_version
      - SMART-Seq2 version to use. Versions available: 1.3.0.
      - "1.3.0"
      - "1.3.0"
    * - docker_registry
      - Docker registry to use. Options:

        - "quay.io/cumulus" for images on Red Hat registry;

        - "cumulusprod" for backup images on Docker Hub.
      - "quay.io/cumulus"
      - "quay.io/cumulus"
    * - zones
      - Google cloud zones
      - "us-east1-d us-west1-a us-west1-b"
      - "us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c"
    * - num_cpu
      - Number of cpus to request for one node
      - 4
      - 4
    * - memory
      - Memory size string
      - "3.60G"
      - If aligner is bowtie2 or hisat2-hca, "3.6G"; otherwise "32G"
    * - disk_space_multiplier
      - Factor to multiply size of R1 and R2 by for RSEM
      - Float
      - 11
    * - generate_count_matrix_disk_space
      - Disk space for count matrix generation task in GB
      - Integer
      - 10
    * - backend
      - Cloud infrastructure backend to use. Available options:

        - "gcp" for Google Cloud;
        - "aws" for Amazon AWS;
        - "local" for local machine.
      - "gcp"
      - "gcp"
    * - preemptible
      - Number of preemptible tries. This works only when *backend* is ``gcp``.
      - 2
      - 2
    * - awsMaxRetries
      - Number of maximum retries when running on AWS. This works only when *backend* is ``aws``.
      - 5
      - 5

---------------------------------

Outputs:
^^^^^^^^

.. list-table::
    :widths: 5 5 10
    :header-rows: 1

    * - Name
      - Type
      - Description
    * - output_count_matrix
      - String
      - Point to a Google bucket URL for count matrix in matrix market format.
    * - rsem_trans_bam
      - Array[String?]
      - An array of Google bucket URLs for RSEM transcriptomic BAM files
    * - rsem_genome_bam
      - Array[String?]
      - An array of Google bucket URLs for RSEM genomic BAM files if ``output_genome_bam`` is ``true``.
    * - rsem_gene
      - Array[File?]
      - An array of RSEM gene expression estimation files.
    * - rsem_isoform
      - Array[File?]
      - An array of RSEM isoform expression estimation files.
    * - rsem_time
      - Array[File?]
      - An array of RSEM execution time log files.
    * - aligner_log
      - Array[File?]
      - An array of Aligner log files.
    * - rsem_cnt
      - Array[File?]
      - An array of RSEM count files.
    * - rsem_model
      - Array[File?]
      - An array of RSEM model files.
    * - rsem_theta
      - Array[File?]
      - An array of RSEM generated theta files.


This WDL generates one gene-count matrix in matrix market format:

- output_count_matrix is a folder containing three files: matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz.
- matrix.mtx.gz is a gzipped matrix in matrix market format.
- barcodes.tsv.gz is a gzipped TSV file, containing 5 columns. 'barcodekey' is cell name. 'plate' is the plate name, which can be used for batch correction. 'total_reads' is the total number of reads. 'alignment_rate' is the alignment rate obtained from the aligner. 'unique_rate' is the percentage of reads aligned uniquely to a gene. Cells sequenced with single-end reads appear first in 'barcodekey'.
- features.tsv.gz is a gzipped TSV file, containing 2 columns. 'featurekey' is gene symbol. 'featureid' is Ensembl ID.

The gene-count matrix can be fed directly into **cumulus** for downstream analysis.

TPM-normalized counts are calculated as follows:

#. Estimate the gene expression levels in TPM using *RSEM*.

#. Suppose ``c`` reads are achieved for one cell, then calculate TPM-normalized count for gene ``i`` as ``TPM_i / 1e6 * c``.

TPM-normalized counts reflect both the relative expression levels and the cell sequencing depth.


---------------------------------

Custom Genome
~~~~~~~~~~~~~~~~

We also provide a way of generating user-customized Genome references for SMART-Seq2 workflow.

#. Import smartseq2_create_reference workflow to your workspace.

    Import by following instructions in `Import workflows to Terra`_. You should choose **github.com/lilab-bcb/cumulus/Smart-Seq2_create_reference** to import.

    Moreover, in the workflow page, click ``Export to Workflow...`` button, and select the workspace to which you want to export ``smartseq2_create_reference`` in the drop-down menu.

#. In your workspace, open ``smartseq2_create_reference`` in ``WORKFLOWS`` tab. Select ``Run workflow with inputs defined by file paths`` as below

    .. image:: images/single_workflow.png

   and click ``SAVE`` button.


Inputs:
^^^^^^^

Please see the description of inputs below. Note that required inputs are shown in bold.

.. list-table::
    :widths: 5 30 30 5
    :header-rows: 1

    * - Name
      - Description
      - Type or Example
      - Default
    * - **fasta**
      - Genome fasta file
      - | File.
        | For example, "gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
      -
    * - **gtf**
      - GTF gene annotation file (e.g. Homo_sapiens.GRCh38.83.gtf)
      - | File.
        | For example, "gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.83.gtf"
      -
    * - **output_directory**
      - Google bucket url for the output folder
      - "gs://fc-e0000000-0000-0000-0000-000000000000/output_refs"
      -
    * - **genome**
      - Output reference genome name. Output reference is a gzipped tarball with name genome_aligner.tar.gz
      - "GRCm38_ens97filt"
      -
    * - aligner
      - Build indices for which aligner, choices are hisat2-hca, star, or bowtie2.
      - "hisat2-hca"
      - "hisat2-hca"
    * - smartseq2_version
      - | SMART-Seq2 version to use.
        | Versions available: 1.3.0.
      - "1.3.0"
      - "1.3.0"
    * - docker_registry
      - Docker registry to use. Options:

        - "quay.io/cumulus" for images on Red Hat registry;

        - "cumulusprod" for backup images on Docker Hub.
      - "quay.io/cumulus"
      - "quay.io/cumulus"
    * - zones
      - Google cloud zones
      - "us-central1-c"
      - "us-central1-b"
    * - cpu
      - Number of CPUs
      - Integer
      - If aligner is bowtie2 or hisat2-hca, 8; otherwise 32
    * - memory
      - Memory size string
      - String
      - If aligner is bowtie2 or hisat2-hca, "7.2G"; otherwise "120G"
    * - disk_space
      - Disk space in GB
      - Integer
      - If aligner is bowtie2 or hisat2-hca, 40; otherwise 120
    * - backend
      - Cloud infrastructure backend to use. Available options:

        - "gcp" for Google Cloud;
        - "aws" for Amazon AWS;
        - "local" for local machine.
      - "gcp"
      - "gcp"
    * - preemptible
      - Number of preemptible tries. This works only when *backend* is ``gcp``.
      - 2
      - 2
    * - awsMaxRetries
      - Number of maximum retries when running on AWS. This works only when *backend* is ``aws``.
      - 5
      - 5

Outputs
^^^^^^^^

.. list-table::
    :widths: 5 5 10
    :header-rows: 1

    * - Name
      - Type
      - Description
    * - output_reference
      - File
      - The custom Genome reference generated. Its default file name is ``genome_aligner.tar.gz``.
    * - monitoring_log
      - File
      - CPU and memory profiling log.

---------------------------------


.. _gsutil: https://cloud.google.com/storage/docs/gsutil
.. _Import workflows to Terra: ./cumulus_import.html
.. _cellranger mkgtf: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references
.. _Terra: https://app.terra.bio/