Run STARsolo to generate gene-count matrices from FASTQ files
This starsolo_workflow
workflow generates gene-count matrices from FASTQ data using STARsolo.
Prepare input data and import workflow
1. Run cellranger_workflow
to generate FASTQ data
You can skip this step if your data are already in FASTQ format.
Otherwise, for 10X data, you need to first run cellranger_workflow to generate FASTQ files from BCL raw data for each sample. Please follow cellranger_workflow manual.
Notice that you should set run_mkfastq to true
to get FASTQ output. You can also set run_count to false
to skip Cell Ranger count step.
For Non-Broad users, you’ll need to build your own docker for bcl2fastq step. Instructions are here.
2. Import starsolo_workflow
Import starsolo_workflow workflow to your workspace by following instructions in Import workflows to Terra. You should choose workflow github.com/lilab-bcb/cumulus/STARsolo to import.
Moreover, in the workflow page, click the Export to Workspace...
button, and select the workspace to which you want to export starsolo_workflow in the drop-down menu.
3. Prepare a sample sheet
3.1 Sample sheet format:
Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.
The sample sheet describes how to identify flowcells and generate sample/channel-specific count matrices.
A brief description of the sample sheet format is listed below (required column headers are shown in bold).
Column |
Description |
---|---|
Sample |
Contains the sample name. Each sample should have a unique sample name. |
Reference |
Provides the reference genome used by STARSolo for each sample.
The elements in this column can be either Cloud bucket URIs to reference tarballs or keywords such as GRCh38-2020-A.
A full list of available keywords is included in genome reference section below.
|
Location |
Indicates the Cloud bucket URI of the folder holding FASTQ files of each sample. |
Assay |
Indicates the assay type of each sample. Available options:
If not specified, use the default |
3.2 Assay-specific preset STARsolo options
If tenX_v3, The following STARsolo options would be applied (could be overwritten by user-specified options):
--soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB
If tenX_multiome, use the same STARsolo options as for tenX_v3 assay, but with the 10X ARC Multiome Gene Expression whitelist.
If tenX_v2, the following STARsolo options would be applied (could be overwritten by user-specified options):
--soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB
If tenX_5p, the following STARsolo options would be applied (could be overwritten by user-specified options):
--soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloStrand Reverse --soloUMIdedup 1MM_CR --outFilterScoreMin 30 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB
If tenX_5p_pe, the following STARsolo options would be applied (could be overwritten by user-specified options):
--soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloBarcodeMate 1 --clip5pNbases 39 0 --soloUMIdedup 1MM_CR --outFilterScoreMin 30 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB
If ShareSeq, the following STARsolo options would be applied (could be overwritten by user-specific options):
--soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 24 --soloUMIstart 25 --soloUMIlen 10 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR --clipAdapterType CellRanger4 --outFilterScoreMin 30 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB
If SeqWell or DropSeq, the following STARsolo options would be applied (could be overwritten by user-specified options):
--soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 12 --soloUMIstart 13 --soloUMIlen 8 --outSAMtype BAM SortedByCoordinate --outSAMattributes CR UR CY UY CB UB
If None, no preset option would be applied.
The sample sheet supports sequencing the same sample across multiple flowcells. In case of multiple flowcells, you should specify one line for each flowcell using the same sample name. In the following example, we have 2 samples and sample_1
is sequenced in two flowcells.
Example:
Sample,Reference,Location,Assay
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/sample_1_fastqs,tenX_v3
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/sample_1_fastqs,tenX_v3
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/sample_2_fastqs,tenX_v2
3.2 Upload your sample sheet to the workspace bucket:
Example:
gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/
1. Launch analysis
In your workspace, open starsolo_workflow
in WORKFLOWS
tab. Select the desired snapshot version (e.g. latest). Select Process single workflow from files
as below
and click SAVE
button. Select Use call caching
and click INPUTS
. Then fill in appropriate values in the Attribute
column. Alternative, you can upload a JSON file to configure input by clicking Drag or click to upload json
.
Once INPUTS are appropriated filled, click RUN ANALYSIS
and then click LAUNCH
.
Workflow inputs
Below are inputs for count workflow. Notice that required inputs are in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_csv_file |
Input CSV sample sheet describing metadata of each sample. |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.tsv” |
|
output_directory |
Cloud bucket URI of output directory. |
“gs://fc-e0000000-0000-0000-0000-000000000000/count_result” |
|
read1_fastq_pattern |
Filename suffix pattern in wildcards for Read 1. This is used for looking for Read 1 fastq files.
If fastq files are generated by CellRanger count, use
_S*_L*_R1_001.fastq.gz , which means Read 1 files must have names such as “<Sample>_S1_L1_R1_001.fastq.gz”, where <Sample> is specified in input_csv_file.If fastq files are Sequence Read Archive (SRA) data, use something like
_1.fastq.gz , where _1 refers to the first reads, so that Read 1 files must have names such as “<Sample>_1.fastq.gz” where <Sample> is specified in input_csv_file.If fastq files are not zipped, substitute
.fastq for .fastq.gz in the corresponding pattern above. |
“_S*_L*_R1_001.fastq.gz” |
“_S*_L*_R1_001.fastq.gz” |
read2_fastq_pattern |
Filename suffix pattern in wildcards for Read 2. This is used for looking for Read 2 fastq files.
If fastq files are generated by CellRanger count, use
_S*_L*_R2_001.fastq.gz , which means Read 2 files must have names such as “<Sample>_S1_L1_R2_001.fastq.gz”, where <Sample> is specified in input_csv_file.If fastq files are Sequence Read Archive (SRA) data, use something like
_2.fastq.gz , where _2 refers to the second reads, so that Read 2 files must have names such as “<Sample>_2.fastq.gz” where <Sample> is specified in input_csv_file.If fastq files are not zipped, substitute
.fastq for .fastq.gz in the corresponding pattern above. |
“_S*_L*_R2_001.fastq.gz” |
“_S*_L*_R2_001.fastq.gz” |
barcode_read |
Specify which read contains cell barcodes and UMIs: either
read1 or read2 . This only applies to samples with Assay None in input_csv_file.Otherwise, samples with Assay type
ShareSeq automatically specify read2 for cell barcodes and UMIs, while read1 for cDNAs;samples of all the other know Assay types automatically specify
read1 for cell barcodes and UMIs, while read2 for cDNAs. |
“read1” |
“read1” |
soloType |
[STARsolo option] Type of single-cell RNA-seq, choosing from CB_UMI_Simple, CB_UMI_Complex, CB_samTagOut, SmartSeq. |
“CB_UMI_Simple” |
None |
soloCBwhitelist |
[STARsolo option] Cell barcode white list in either plain text or gzipped format.
Notice: If specified, it will overwrite the white lists for ALL the samples in your sample sheet.
|
gs://my_bucket/my_white_list.txt |
None |
soloFeatures |
[STARsolo option] Genomic features for which the UMI counts per Cell Barcode are collected (can choose multiple items):
|
“Gene GeneFull SJ Velocyto” |
“Gene” |
soloMultiMappers |
[STARsolo option] Counting method for reads mapping to multiple genes (can choose multiple items):
|
“Unique” |
“Unique” |
soloCBstart |
[STARsolo option] Cell barcode start position (1-based coordinate). |
1 |
1 |
soloCBlen |
[STARsolo option] Cell barcode length. |
16 |
16 |
soloUMIstart |
[STARsolo option] UMI start position (1-based coordinate). |
17 |
17 |
soloUMIlen |
[STARsolo option] UMI length. |
10 |
10 |
soloBarcodeReadLength |
[STARsolo option] Length of the barcode read
- 1: equals to sum of soloCBlen and soloUMIlen.
- 0: not defined, do not check.
Notice:
0 is set to be default, which is different from STAR. This is in case users have barcode read sequenced of length 28 nt (standard for 10x 3’), but assay is 5’ (CB+UMI length is 26 nt). |
0 |
0 |
soloBarcodeMate |
[STARsolo option] Identifies which read mate contains the barcode (CB+UMI) sequence:
|
0 |
0 |
soloCBposition |
[STARsolo option] Position of Cell Barcode(s) on the barcode read.
Presently only works when solo_type is
CB_UMI_Complex , and barcodes are assumed to be on Read2.Format for each barcode: “startAnchor_startPosition_endAnchor_endPosition”
start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
String for different barcodes are separated by space.
|
“0_0_2_-1 3_1_3_8” |
|
soloUMIposition |
[STARsolo option] Position of the UMI on the barcode read, same as soloCBposition |
“3_9_3_14” |
|
soloAdapterSequence |
[STARsolo option] Adapter sequence to anchor barcodes. |
||
soloAdapterMismatchesNmax |
[STARsolo option] Maximum number of mismatches allowed in adapter sequence. |
1 |
1 |
soloCBmatchWLtype |
[STARsolo option] Matching the Cell Barcodes to the WhiteList, choosing from
|
“1MM_multi” |
“1MM_multi” |
soloInputSAMattrBarcodeSeq |
[STARsolo option] When inputting reads from a SAM file ( |
“CR UR” |
|
soloInputSAMattrBarcodeQual |
[STARsolo option] When inputting reads from a SAM file ( |
“CY UY” |
|
soloStrand |
[STARsolo option] Strandedness of the solo libraries:
|
“Forward” |
“Forward” |
soloUMIdedup |
[STARsolo option] Type of UMI deduplication (collapsing) algorithm:
|
“1MM_All” |
“1MM_All” |
soloUMIfiltering |
[STARsolo option] Type of UMI filtering (for reads uniquely mapping to genes):
|
“MultiGeneUMI” |
“-” |
soloCellFilter |
[STARsolo option] Cell filtering type and parameters:
|
“CellRanger2.2 3000 0.99 10” |
“CellRanger2.2 3000 0.99 10” |
soloOutFormatFeaturesGeneField3 |
[STARsolo option] Field 3 in the Gene features.tsv file. If “-”, then no 3rd field is output. |
“Gene Expression” |
“Gene Expression” |
outSAMtype |
[STAR option] Type of SAM/BAM output. |
“BAM SortedByCoordinate” |
“BAM SortedByCoordinate” for tenX_v3, tenX_v2, SeqWell and DropSeq assay types,
“BAM Unsorted” otherwise.
|
limitBAMsortRAM |
[STAR option] Maximum available RAM (bytes) for sorting BAM. If |
0 |
0 |
outBAMsortingBinsN |
[STAR option] Number of genome bins fo coordinate-sorting. |
50 |
50 |
star_version |
STAR version to use. Currently support: |
“2.7.10b” |
“2.7.10b” |
docker_registry |
Docker registry to use:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
zones |
Google cloud zones to consider for execution. |
“us-east1-d us-west1-a us-west1-b” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
num_cpu |
Number of CPUs to request for count per sample. |
32 |
32 |
memory |
Memory size string for count per sample. |
“120G” |
“120G” |
disk_space |
Disk space in GB needed for count per sample. |
500 |
500 |
backend |
Cloud infrastructure backend to use. Available options:
|
“gcp” |
“gcp” |
preemptible |
Number of maximum preemptible tries allowed. This works only when backend is |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. This only works for |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
Workflow outputs
See the table below for star_solo workflow outputs.
Name |
Type |
Description |
---|---|---|
count_outputs |
Array[String] |
Google Bucket URI of output directories of all samples. Each folder is for one sample in the input sample sheet.
For the count matrices generated, taking
Gene solo feature for example, they are in <output_folder>/<sample_id>/Solo.out/Gene/raw/ and <output_folder>/<sample_id>/Solo.out/Gene/filtered/ subfolders. |
starsoloLogs |
Array[File] |
Google Bucket URIs of STAR logs for each sample, respectively. This is the |
Prebuilt genome references
We’ve built the following scRNA-seq references for users’ convenience:
Keyword |
Description |
---|---|
GRCh38-2020-A |
Human GRCh38, comparable to cellranger reference 2020-A (GENCODE v32/Ensembl 98) |
mm10-2020-A |
Mouse mm10, comparable to cellranger reference 2020-A (GENCODE vM23/Ensembl 98) |
GRCh38-and-mm10-2020-A |
Human GRCh38 (GENCODE v32/Ensembl 98) and mouse mm10 (GENCODE vM23/Ensembl 98) |
Note
For snRNA-seq data, please choose the corresponding scRNA-seq reference above, and add GeneFull
in the soloFeatures input.
Build STARSolo References
We provide a wrapper of STAR to build sc/snRNA-seq references. Please follow the instructions below.
1. Import starsolo_create_reference
Import starsolo_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/STARsolo_create_reference to import.
Moreover, in the workflow page, click the Export to Workspace...
button, and select the workspace to which you want to export starsolo_create_reference workflow in the drop-down menu.
2. Upload required data to Cloud bucket
Required data include the genome FASTA file and gene annotation GTF file of the target genome reference.
3. Workflow input
Required inputs are highlighted in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_fasta |
Input genome reference in FASTA format. |
“gs://fc-e0000000-0000-0000-0000-000000000000/mm-10/genome.fa” |
|
input_gtf |
Input gene annotation file in GTF format. |
“gs://fc-e0000000-0000-0000-0000-000000000000/mm-10/genes.gtf” |
|
genome |
Genome reference name. This is used for specifying the name of the genome index generated. |
“mm-10” |
|
output_directory |
Cloud bucket URI of the output directory. |
“gs://fc-e0000000-0000-0000-0000-000000000000/starsolo-reference” |
|
docker_registry |
Docker registry to use:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
star_version |
STAR version to use. Currently support: |
“2.7.10a” |
“2.7.10a” |
num_cpu |
Number of CPUs to request for count per sample. |
32 |
32 |
memory |
Memory size string for count per sample. |
“80G” |
“80G” |
disk_space |
Disk space in GB needed for count per sample. |
100 |
100 |
zones |
Google cloud zones to consider for execution. |
“us-east1-d us-west1-a us-west1-b” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
backend |
Cloud infrastructure backend to use. Available options:
|
“gcp” |
“gcp” |
preemptible |
Number of maximum preemptible tries allowed. This works only when backend is |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. This only works for |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
4. Workflow Output
Name |
Type |
Description |
---|---|---|
output_reference |
File |
Gzipped reference folder with name “<genome>-starsolo.tar.gz”, where <genome> is specified by workflow input genome above. The workflow will save a copy of it under output_directory specified in workflow input above. |