Run STARsolo to generate gene-count matrices from FASTQ files

This star_solo workflow generates gene-count matrices from FASTQ data using STARsolo.

Prepare input data and import workflow

1. Run cellranger_workflow to generate FASTQ data

You can skip this step if your data are already in FASTQ format.

Otherwise, for 10X data, you need to first run cellranger_workflow to generate FASTQ files from BCL raw data for each sample. Please follow cellranger_workflow manual.

Notice that you should set run_mkfastq to true to get FASTQ output. You can also set run_count to false to skip Cell Ranger count step.

For Non-Broad users, you’ll need to build your own docker for bcl2fastq step. Instructions are here.

2. Import star_solo

Import star_solo workflow to your workspace.

See the Terra documentation for adding a workflow. The star_solo workflow is under Broad Methods Repository with name “cumulus/star_solo”.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export star_solo workflow in the drop-down menu.

3. Prepare a sample sheet

3.1 Sample sheet format:

The sample sheet for star_solo workflow should be in TSV format, i.e. columns are separated by tabs (NOT commas). Please note that the columns in the TSV can be in any order, but that the column names must match the recognized headings.

The sample sheet describes how to identify flowcells and generate sample/channel-specific count matrices.

A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column Description
Sample Contains sample names. Each sample or 10X channel should have a unique sample name.
Flowcells Indicates the Google bucket URLs of folder(s) holding FASTQ files of this sample.

For 10X data, the sample sheet supports sequencing the same 10X channel across multiple flowcells. If a sample is sequenced across multiple flowcells, simply list all of its flowcells in a comma-seperated way. In the following example, we have 2 samples sequenced in two flowcells.

Example:

Sample  Flowcells
sample_1        gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/sample_1_fastqs,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/sample_1_fastqs
sample_2        gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/sample_2_fastqs

Alternatively, if you want to specify Read 1 and 2 FASTQ files yourself, you should prepare the sample sheet of the following format:

Sample  R1      R2
sample_1        gs://your-bucket/sample_1_L001_R1.fastq.gz,gs://your-bucket/sample_1_L002_R1.fastq.gz   gs://your-bucket/sample_1_L001_R2.fastq.gz,gs://your-bucket/sample_1_L002_R2.fastq.gz
sample_2        gs://your-bucket/sample_2_L001_R1.fastq.gz      gs://your-bucket/sample_2_L001_R2.fastq.gz

where FASTQ files in R1 and R2 should be in one-to-one correspondence if the sample has multiple R1 FASTQ files.

3.2 Upload your sample sheet to the workspace bucket:

Use gsutil (you already have it if you’ve installed Google cloud SDK) in your unix terminal to upload your sample sheet to workspace bucket.

Example:

gsutil cp /foo/bar/projects/sample_sheet.tsv gs://fc-e0000000-0000-0000-0000-000000000000/

4. Launch analysis

In your workspace, open star_solo in WORKFLOWS tab. Select the desired snapshot version (e.g. latest). Select Process single workflow from files as below

_images/single_workflow1.png

and click SAVE button. Select Use call caching and click INPUTS. Then fill in appropriate values in the Attribute column. Alternative, you can upload a JSON file to configure input by clicking Drag or click to upload json.

Once INPUTS are appropriated filled, click RUN ANALYSIS and then click LAUNCH.


Workflow inputs

Below are inputs for count workflow. Notice that required inputs are in bold.

Name Description Example Default
input_tsv_file Input TSV sample sheet describing metadata of each sample. “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.tsv”  
genome

Genome reference. It can be either of the following two formats:

  • String. Pre-built genome reference.
  • Google bucket URL of a custom reference, must be a .tar.gz file.
“GRCh38”,
or “gs://user-bucket/starsolo.tar.gz”
 
chemistry
Chemistry name. Available options: “tenX_v3” (for 10X V3 chemistry), “tenX_v2” (for 10X V2 chemistry), “DropSeq”, “SeqWell”, “SlideSeq” and “custom”.
For “DropSeq”, “SeqWell” and “SlideSeq”, CBstart=1, CBlen=12, UMIstart=13, UMIlen=8.
“tenX_v3”  
output_directory GS URL of output directory. “gs://fc-e0000000-0000-0000-0000-000000000000/count_result”  
CBstart Cell barcode start position (1-based coordinate). Only matters if chemistry is “custom”. 1  
CBlen Cell barcode length. Only matters if chemistry is “custom”. 16  
UMIstart UMI start position (1-based coordinate). Only matters if chemistry is “custom”. 17  
UMIlen UMI length. Only matters if chemistry is “custom”. 12  
CBwhitelist Cell barcode white list. Only matters if chemistry is “custom”. gs://my_bucket/my_white_list.txt  
star_version STAR version to use. Currently only support 2.7.6a. “2.7.6a” “2.7.6a”
docker_registry

Docker registry to use:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
zones Google cloud zones to consider for execution. “us-east1-d us-west1-a us-west1-b” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu Number of CPUs to request for count per sample. 32 32
memory Memory size string for count per sample. “120G” “120G”
disk_space Disk space in GB needed for count per sample. 500 500
backend

Cloud infrastructure backend to use. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of maximum preemptible tries allowed. This works only when backend is gcp. 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Workflow outputs

See the table below for star_solo workflow outputs.

Name Type Description
output_folder String Google Bucket URL of output directory. Within it, each folder is for one sample in the input sample sheet.

Prebuilt genome references

We’ve built the following scRNA-seq references for users’ convenience:

Keyword Description
GRCh38-2020-A Human GRCh38, comparable to cellranger reference 2020-A (GENCODE v32/Ensembl 98)
mm10-2020-A Mouse mm10, comparable to cellranger reference 2020-A (GENCODE vM23/Ensembl 98)
GRCh38 Human GRCh38, comparable to cellranger reference 3.0.0, Ensembl v93 gene annotation
mm10 Mouse mm10, comparable to cellranger reference 3.0.0, Ensembl v93 gene annotation

We’ve built the following snRNA-seq references for users’ convenience:

Keyword Description
GRCh38-2020-A-premrna Human, introns included, built from GRCh38 cellranger reference 2020-A, GENCODE v32/Ensembl 98 gene annotation, treating annotated transcripts as exons
mm10-2020-A-premrna Mouse, introns included, built from mm10 cellranger reference 2020-A, GENCODE vM23/Ensembl 98 gene annotation, treating annotated transcripts as exons