Run Cell Ranger tools using cellranger_workflow
cellranger_workflow wraps Cell Ranger to process single-cell/nucleus RNA-seq, single-cell ATAC-seq and single-cell immune profiling data, and supports feature barcoding (cell/nucleus hashing, CITE-seq, Perturb-seq). It also provide routines to build cellranger references.
A general step-by-step instruction
The workflow starts with FASTQ files.
Note
Starting from v3.0.0, Cumulus cellranger_workflow drops support for mkfastq
. If your data start from BCL files, please first run BCL Convert to demultiplex flowcells to generate FASTQ files.
1. Import cellranger_workflow
Import cellranger_workflow workflow to your workspace by following instructions in Import workflows to Terra. You should choose workflow github.com/lilab-bcb/cumulus/CellRanger to import.
Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export cellranger_workflow workflow in the drop-down menu.
2. Upload sequencing data to Google bucket
Copy your FASTQ files to your workspace bucket using gcloud storage command (you already have it if you’ve installed Google cloud SDK) in your unix terminal.
You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.
![]()
There are three cases:
Case 1: All the FASTQ files are in one top-level folder. Then you can simply upload this folder to Cloud, and in your sample sheet, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files.
Case 2: In the top-level folder, each sample has a dedicated subfolder containing its FASTQ files. In this case, you need to upload the whole top-level folder, and in your sample sheet, make sure Sample names and their corresponding subfolder names are identical.
Case 3: Each sample’s FASTQ files are wrapped in a TAR file. In this case, upload the folder which contains this TAR file. Also, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files inside the TAR files.
Notice that if your FASTQ files are downloaded from the Sequence Read Archive (SRA) from NCBI, you must rename your FASTQs to follow the Illumina file naming conventions.
Example:
gcloud storage cp -r /foo/bar/K18WBC6Z4/Fastq gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastqwhere
-r
means copy the directory recursively, andfc-e0000000-0000-0000-0000-000000000000
should be replaced by your own workspace Google bucket name.
Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload FASTQ files to cloud.
3. Prepare a sample sheet
3.1 Sample sheet format:
Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.
The sample sheet describes how to generate count matrices from sequencing reads. A brief description of the sample sheet format is listed below (required column headers are shown in bold).
Column
Description
Sample
Sample name. This name must be consistent with its corresponding FASTQ filename prefix in the folder specified in Flowcell column. Sample names can only contain characters from[a-zA-Z0-9\_-]
to be recognized by Cell Ranger.Notice that if a sample has multiple sequencing runs, each of which has FASTQ files stored in dedicated location, you can specify multiple entries in the sample sheet with the same name in Sample column, and each entry accounts for one FASTQ folder location.Reference
Provides the reference genome used by Cell Ranger for processing the sample.The reference can be a keyword of prebuilt references (e.g.GRCh38-2020-A
) that stored in Cumulus bucket, or a user specified cloud URI to a custom reference (in tarball.tar.gz
format).A full list of available keywords is included in each of the following data type sections (e.g. sc/snRNA-seq) below.Flowcell
Indicates the cloud URI of the uploaded folder containing FASTQ files for each sample.
Chemistry
Keywords to describe the 10x chemistry used for the sample. This column is optional. Check data type sections (e.g. sc/snRNA-seq) below for the corresponding list of available keywords.
DataType
Describes the data type of each sample, with keywords chosen from the list below. This column is optional, and the default is rna.
rna: Gene expression (GEX) data
vdj: V(D)J data
citeseq: CITE-Seq tag data
hashing: Cell-hashing or nucleus-hashing tag data
adt: For the case where hashing and citeseq reads are in the same sample library
cmo: Cell multiplexing oligos used in 10x Genomics’ CellPlex assay
crispr: Perturb-seq guide tag data
atac: scATAC-Seq data
frp: 10x Flex gene expression (old name is Fixed RNA Profiling) data
AuxFile
The Cloud URI pointing to auxiliary files of the corresponding samples, with different usage depending on DataType values:
For rna: It’s used by Sample Multiplexing methods, which specifies the sample name to multiplexing barcode mapping.
For frp: It’s used by Flex data, which specifies the sample name to Flex probe barcode mapping.
For citeseq, hashing, adt, and crispr: It’s the feature barcode file, which contains the information of antibody for CITE-Seq, cell-hashing, nucleus-hashing, or gNRA for Perturb-Seq.
If analyzing using cumulus_feature_barcoding, the feature barcode file should be in format specified in Feature barcoding assays section below;
If analyzing as part of the Sample Multiplexing data using
cellranger multi
, the feature barcode file should be in 10x Feature Reference format.For cmo: It’s the CMO reference file (
cmo-set
option) when using custom CMOs in CellPlex data.For vdj_t_gd: It’s the inner enrichment primer file (
inner-enrichment-primers
option) for VDJ-T-GD data.Notice: This is the FeatureBarcodeFile column in previous versions of Cellranger workflow. This old name is still accepted for backward compatibility.
Link
Designed for Single Cell Multiome ATAC + Gene Expression, Feature Barcoding, Sample Multiplexing, or Flex.Link multiple modalities together using a single link name.cellranger-arc count
,cellranger count
, orcellranger multi
will be triggered automatically depending on the modalities.If empty string is provided, no link is assumed.Link name can only contain characters from[a-zA-Z0-9\_-]
for Cell Ranger to recognize.Notice: The Link names must be unique to Sample values to avoid overwriting each other’s settings.The sample sheet supports sequencing the same 10x channels across multiple flowcells. If a sample is sequenced across multiple flowcells, simply list it in multiple rows, with one flowcell per row. In the following example, we have 4 samples sequenced in two flowcells.
Example:
Sample,Reference,Flowcell,Chemistry,DataType sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,rna sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,SC3Pv3,rna sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,threeprime,rna sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,SC3Pv3,rna sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna3.2 Upload your sample sheet to the workspace bucket:
Example:
gcloud storage cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/
Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload FASTQ files to cloud.
4. Launch analysis
In your workspace, open
cellranger_workflow
inWORKFLOWS
tab. Select the desired snapshot version (e.g. latest). SelectRun workflow with inputs defined by file paths
as below![]()
and click
SAVE
button. SelectUse call caching
and clickINPUTS
. Then fill in appropriate values in theAttribute
column. Alternative, you can upload a JSON file to configure input by clickingDrag or click to upload json
.Once INPUTS are appropriated filled, click
RUN ANALYSIS
and then clickLAUNCH
.
5. Workflow outputs
See the table below for workflow level outputs.
Name
Type
Description
count_outputs
Map[String, Array[String]?]
A modality-to-output map showing output URIs for all samples, organized by modality and one URI per sample.
Single-cell and single-nucleus RNA-seq
To process sc/snRNA-seq data, follow the specific instructions below.
Sample sheet
Reference column.
Pre-built scRNA-seq references are summarized below.
Keyword
Description
GRCh38-2024-A
Human GRCh38, comparable to cellranger reference 2024-A (GENCODE v44/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.
GRCm39-2024-A
Mouse GRCm39, comparable to cellranger reference 2024-A (GENCODE vM33/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.
GRCh38_and_GRCm39-2024-A
Human GRCh38 (v44/Ensembl 110) and mouse GRCm39 (GENCODE vM33/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.
mRatBN7.2-2024-A
Rat mRatBN7.2 reference.
GRCh38-2020-A
Human GRCh38 (GENCODE v32/Ensembl 98)
mm10-2020-A
Mouse mm10 (GENCODE vM23/Ensembl 98)
GRCh38_and_mm10-2020-A
Human GRCh38 (GENCODE v32/Ensembl 98) and mouse mm10 (GENCODE vM23/Ensembl 98)
Chemistry column.
The cellranger workflow fully supports all 10x assay configurations. The most widely used ones are listed below:
Chemistry
Explanation
auto
autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry
threeprime
Single Cell 3′
fiveprime
Single Cell 5′
ARC-v1
Gene Expression portion of 10x Multiome data
Please refer to the section of
--chemistry
option in Cell Ranger Command Line Arguments for all other valid chemistry keywords.Flowcell column.
See the table in general steps section above.
Note
The workflow accepts input in TAR files which contain FASTQ files inside, and can automatically handle such cases.
DataType column.
This column is optional with a default rna. If you want to put a value, put rna here.
Example:
Sample,Reference,Flowcell,Chemistry,DataType sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,rna sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,threeprime,rna sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna sample_3,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna
Workflow input
For sc/snRNA-seq data, cellranger_workflow
takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger count
. Revalant workflow inputs are described below, with required inputs highlighted in bold.
Name
Description
Example
Default
input_csv_file
Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType) in CSV format
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory
Cloud URI of the output directory
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
Results are written under directory output_directory and will overwrite any existing files at this location.
include_introns
Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0.
true
true
no_bam
Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0.
false
false
expect_cells
Expected number of recovered cells. Mutually exclusive with force_cells
3000
force_cells
Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells
6000
secondary
Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.)
false
false
cellranger_version
cellranger version, could be: 9.0.1, 8.0.1, 7.2.0
“9.0.1”
“9.0.1”
docker_registry
Docker registry to use for cellranger_workflow. Options:
“quay.io/cumulus” for images on Red Hat registry;
“cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus”
“quay.io/cumulus”
acronym_file
The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.“s3://xxxx/index.tsv”
“gs://cumulus-ref/resources/cellranger/index.tsv”
zones
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.
“us-central1-a us-west1-a”
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu
Number of cpus to request for one node for cellranger count
32
32
memory
Memory size string for cellranger count
“120G”
“120G”
count_disk_space
Disk space in GB needed for cellranger count
500
500
preemptible
Number of preemptible tries. Only works for GCP
2
2
awsQueueArn
The AWS ARN string of the job queue to be used. Only works for AWS
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”
“”
Workflow output
See the table below for important sc/snRNA-seq outputs.
Name |
Type |
Description |
---|---|---|
cellranger_count.output_count_directory |
Array[String] |
Subworkflow output. A list of cloud URIs containing gene count matrices, one URI per sample. |
cellranger_count.output_web_summary |
Array[File] |
Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger count output). |
collect_summaries.metrics_summaries |
File |
Task output. An excel spreadsheet containing QCs for each sample. |
Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)
cellranger_workflow
can extract feature-barcode count matrices in CSV format for feature barcoding assays such as cell and nucleus hashing, CellPlex, CITE-seq, and Perturb-seq.
For cell and nucleus hashing as well as CITE-seq, the feature refers to antibody. For Perturb-seq, the feature refers to guide RNA. Please follow the instructions below to configure cellranger_workflow
.
Tthe workflow uses Cumulus Feature Barcoding to process antibody and Perturb-Seq data.
Prepare feature barcode files
Prepare a CSV file with the following format: feature_barcode,feature_name. See below for an example:
TTCCTGCCATTACTA,sample_1 CCGTACCTCATTGTT,sample_2 GGTAGATGTCCTCAG,sample_3 TGGTGTCATTCTTGA,sample_4The above file describes a cell hashing application with 4 samples.
If cell hashing and CITE-seq data share a same sample index, you should concatenate hashing and CITE-seq barcodes together and add a third column indicating the feature type. See below for an example:
TTCCTGCCATTACTA,sample_1,hashing CCGTACCTCATTGTT,sample_2,hashing GGTAGATGTCCTCAG,sample_3,hashing TGGTGTCATTCTTGA,sample_4,hashing CTCATTGTAACTCCT,CD3,citeseq GCGCAACTTGATGAT,CD8,citeseqThen upload it to your google bucket:
gcloud storage cp antibody_index.csv gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
Sample sheet
Reference column.
This column is not used for extracting feature-barcode count matrix. To be consistent, you can put the reference for the associated scRNA-seq assay here.
Chemistry column.
The following keywords are accepted for Chemistry column:
Chemistry
Explanation
auto
Default. Auto-detect the chemistry of your data from all possible 10x assay types.
threeprime
Auto-detect the chemistry of your data from all 3’ assay types.
fiveprime
Auto-detect the chemistry of your data from all 5’ assay types.
SC3Pv4
Single Cell 3’ v4. The workflow will auto-detect if Poly-A or CS1 capture method was applied to your data.Notice: This is a GEM-X chemistry, and only works for Cell Ranger v8.0.0+SC3Pv3
Single Cell 3′ v3. This is a Next GEM chemistry. The workflow will auto-detect if Poly-A or CS1 capture method was applied to your data.
SC3Pv2
Single Cell 3′ v2
SC5Pv3
Single Cell 5’ v3. Notice: This is a GEM-X chemistry, and only works for Cell Rangrer v8.0.0+
SC5Pv2
Single Cell 5′ v2
multiome
10x Multiome barcodes
Note
Not all 10x chemistry names are supported for feature barcoding, as the workflow uses Cumulus Feature Barcoding to process the data.
DataType column.
The following keywords are accepted for DataType column:
DataType
Explanation
citeseq
CITE-seq
hashing
Cell or nucleus hashing
cmo
CellPlex
adt
Hashing and CITE-seq are in the same library
crispr
Perturb-seq/CROP-seqIf neither crispr_barcode_pos nor scaffold_sequence (see Workflow input) is set, crispr refers to 10x CRISPR assays. If in addition Chemistry is set to be SC3Pv3 or its aliases, Cumulus automatically complement the middle two bases to convert 10x feature barcoding cell barcodes back to 10x RNA cell barcodes.Otherwise, crispr refers to non 10x CRISPR assays, such as CROP-Seq. In this case, we assume feature barcoding cell barcodes are the same as the RNA cell barcodes and no cell barcode convertion will be conducted.AuxFile column.
Put cloud URI of the feature barcode file here.
Below is an example sample sheet:
Sample,Reference,Flowcell,Chemistry,DataType,AuxFile
sample_1_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna,
sample_1_adt,,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,hashing,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
sample_2_gex,GRCh38-2024-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna
sample_2_adt,GRCh38-2024-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index2.csv
sample_3_crispr,,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,crispr,gs://fc-e0000000-0000-0000-0000-000000000000/crispr_index.csv
In the sample sheet above, despite the header row,
Row 1 and 2 specify the GEX and Hashing libraries of the same sample.
Row 3 and 4 specify a sample which has GEX and adt (contains both Hashing and CITE-Seq data) libraries.
Row 5 describes one gRNA guide data for Perturb-seq (see
crispr
in DataType field).
Workflow input
For feature barcoding data, cellranger_workflow
takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cumulus adt
. Revalant workflow inputs are described below, with required inputs highlighted in bold.
Name
Description
Example
Default
input_csv_file
Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType, and AuxFile)
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory
Output directory
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
crispr_barcode_pos
Barcode start position at Read 2 (0-based coordinate) for CRISPR
19
0
scaffold_sequence
Scaffold sequence in sgRNA for Purturb-seq, only used for crispr data type.
“GTTTAAGAGCTAAGCTGGAA”
“”
max_mismatch
Maximum hamming distance in feature barcodes for the adt task (changed to 2 as default)
2
2
min_read_ratio
Minimum read count ratio (non-inclusive) to justify a feature given a cell barcode and feature combination, only used for the adt task and crispr data type
0.1
0.1
cumulus_feature_barcoding_version
Cumulus_feature_barcoding version for extracting feature barcode matrix.
“1.0.0”
“1.0.0”
docker_registry
Docker registry to use for cellranger_workflow. Options:
“quay.io/cumulus” for images on Red Hat registry;
“cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus”
“quay.io/cumulus”
zones
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.
“us-central1-a us-west1-a”
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
feature_num_cpu
Number of cpus for extracting feature count matrix
4
4
feature_memory
Optional memory string for extracting feature count matrix
“32G”
“32G”
feature_disk_space
Disk space in GB needed for extracting feature count matrix
100
100
preemptible
Number of preemptible tries. Only works for GCP
2
2
awsQueueArn
The AWS ARN string of the job queue to be used. Only works for AWS
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”
“”
Parameters used for feature count matrix extraction
Cell barcode inclusion lists (previously known as whitelists) are automatically decided based on the Chemistry specified in the sample sheet. The association table is here.
Cell barcode matching settings are also automatically decided based on the chemistry specified:
For 10x V3 and V4 chemistry: a hamming distance of
0
is allowed for matching cell barcodes, and the UMI length is12
;For multiome: a hamming distance of
1
is allowed for matching cell barcodes, and the UMI length is12
;For 10x V2 chemistry: a hamming distance of
1
is allowed for matching cell barcodes, and the UMI length is10
.
For Perturb-seq data, a small number of sgRNA protospace sequences will be sequenced ultra-deeply and we may have PCR chimeric reads. Therefore, we generate filtered feature count matrices as well in a data driven manner:
First, plot the histogram of UMIs with certain number of read counts. The number of UMIs with
x
supporting reads decreases whenx
increases. We start fromx = 1
, and a valley between two peaks is detected if we findcount[x] < count[x + 1] < count[x + 2]
. We filter out all UMIs with< x
supporting reads since they are likely formed due to chimeric reads.In addition, we also filter out barcode-feature-UMI combinations that have their read count ratio, which is defined as total reads supporting barcode-feature-UMI over total reads supporting barcode-UMI, no larger than
min_read_ratio
parameter set above.
Workflow outputs
The table below lists important feature barcoding output when using Cumulus Feature Barcoding:
Name |
Type |
Description |
---|---|---|
cumulus_adt.output_count_directory |
Array[String] |
Subworkflow output. A list of cloud URIs containing feature-barcode count matrices, one URI per sample. |
In addition, For each antibody tag or crispr tag sample, a folder with the sample ID is generated under output_directory
. In the folder, two files — sample_id.csv
and sample_id.stat.csv.gz
— are generated.
sample_id.csv
is the feature count matrix. It has the following format. The first line describes the column names: Antibody/CRISPR,cell_barcode_1,cell_barcode_2,...,cell_barcode_n
. The following lines describe UMI counts for each feature barcode, with the following format: feature_name,umi_count_1,umi_count_2,...,umi_count_n
.
sample_id.stat.csv.gz
stores the gzipped sufficient statistics. It has the following format. The first line describes the column names: Barcode,UMI,Feature,Count
. The following lines describe the read counts for every barcode-umi-feature combination.
If the feature barcode file has a third column, there will be two files for each feature type in the third column. For example, if hashing
presents, sample_id.hashing.csv
and sample_id.hashing.stat.csv.gz
will be generated.
sample_id.report.txt
is a summary report in TXT format. The first lines describe the total number of reads parsed, the number of reads with valid cell barcodes (and percentage over all parsed reads), the number of reads with valid feature barcodes (and percentage over all parsed reads) and the number of reads with both valid cell and feature barcodes (and percentage over all parsed reads). It is then followed by sections describing each feature type. In each section, 7 lines are shown: section title, number of valid cell barcodes (with matching cell barcode and feature barcode) in this section, number of reads for these cell barcodes, mean number of reads per cell barcode, number of UMIs for these cell barcodes, mean number of UMIs per cell barcode and sequencing saturation.
If data type is crispr
, three additional files, sample_id.umi_count.pdf
, sample_id.filt.csv
and sample_id.filt.stat.csv.gz
, are generated.
sample_id.umi_count.pdf
plots number of UMIs against UMI with certain number of reads and colors UMIs with high likelihood of being chimeric in blue and other UMIs in red. This plot is generated purely based on number of reads each UMI has. For better visualization, we do not show UMIs with > 50 read counts (rare in data).
sample_id.filt.csv
is the filtered feature count matrix. It has the same format as sample_id.csv
.
sample_id.filt.stat.csv.gz
is the filtered sufficient statistics. It has the same format as sample_id.stat.csv.gz
.
Single-cell immune profiling
To process single-cell immune profiling (scIR-seq) data, follow the specific instructions below.
Sample sheet
Reference column.
Pre-built scIR-seq references are summarized below.
Keyword
Description
GRCh38_vdj_v7.1.0
Human GRCh38 V(D)J sequences, cellranger reference 7.1.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
GRCh38_vdj_v7.0.0
Human GRCh38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
GRCm38_vdj_v7.0.0
Mouse GRCm38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf
Chemistry column.
This column is not used for scIR-seq data. Put fiveprime here as a placeholder if you decide to include the Chemistry column.
DataType column.
Choose one from the availabe types below:
vdj: The VDJ library. Let the workflow auto-detect the chain type.
vdj_t: The VDJ-T library for T-cell receptor sequences.
vdj_b: The VDJ-B library for B-cell receptor sequences.
vdj_t_gd: The VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains.
AuxFile column.
Only need for vdj_t_gd type samples which use primer sequences to enrich cDNA for V(D)J sequences. In this case, provide a
.txt
file containing such sequences, one per line. Then this file would be given to--inner-enrichment-primers
option in cellranger vdj.
Note
- The
--chain
option incellranger vdj
is automatically decided based on the DataType value specified: For vdj: set to
--chain auto
For vdj_t and vdj_t_gd: set to
--chain TR
For vdj_b: set to
--chain IG
An example sample sheet is below:
Sample,Reference,Flowcell,Chemistry,DataType,AuxFile
sample1,GRCh38_vdj_v7.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,fiveprime,vdj,
sample2,GRCh38_vdj_v7.1.0,gs://my-bucket/s2_fastqs,,vdj_t_gd,gs://my-bucket/s2_enrich_primers.txt
Workflow input
For scIR-seq data, cellranger_workflow
takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger vdj
. Revalant workflow inputs are described below, with required inputs highlighted in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_csv_file |
Sample Sheet (contains Sample, Reference, Flowcell, DataType, Chemistry, and AuxFile) |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv” |
|
output_directory |
Output directory |
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output” |
|
vdj_denovo |
Do not align reads to reference V(D)J sequences before de novo assembly |
false |
false |
cellranger_version |
cellranger version, could be: 9.0.1, 8.0.1, 7.2.0 |
“9.0.1” |
“9.0.1” |
docker_registry |
Docker registry to use for cellranger_workflow. Options:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
acronym_file |
The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.
|
“s3://xxxx/index.tsv” |
“gs://cumulus-ref/resources/cellranger/index.tsv” |
zones |
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings. |
“us-central1-a us-west1-a” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
num_cpu |
Number of cpus to request for one node for cellranger vdj |
32 |
32 |
memory |
Memory size string for cellranger vdj |
“120G” |
“120G” |
vdj_disk_space |
Disk space in GB needed for cellranger vdj |
500 |
500 |
preemptible |
Number of preemptible tries. Only works for GCP |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. Only works for AWS |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
Workflow output
See the table below for important scIR-seq outputs.
Name |
Type |
Description |
---|---|---|
cellranger_vdj.output_count_directory |
Array[String] |
Subworkflow output. A list of cloud URIs containing vdj results, one URI per sample. |
cellranger_vdj.output_web_summary |
Array[File] |
Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger vdj output). |
collect_summaries_vdj.metrics_summaries |
File |
Task output. An excel spreadsheet containing QCs for each sample. |
Single-cell ATAC-seq
To process scATAC-seq data, follow the specific instructions below.
Sample sheet
Reference column.
Pre-built scATAC-seq references are summarized below.
Keyword
Description
GRCh38-2020-A_arc_v2.0.0
Human GRCh38, cellranger-arc/atac reference 2.0.0
mm10-2020-A_arc_v2.0.0
Mouse mm10, cellranger-arc/atac reference 2.0.0
GRCh38_and_mm10-2020-A_atac_v2.0.0
Human GRCh38 and mouse mm10, cellranger-atac reference 2.0.0
Chemistry column.
By default is auto, which will not specify a given chemistry. To analyze just the individual ATAC library from a 10x multiome assay using cellranger-atac count, use
ARC-v1
in the Chemistry column.DataType column.
Set it to atac.
AuxFile column.
Leave it blank for scATAC-seq.
An example sample sheet is below:
Sample,Reference,Flowcell,DataType
sample_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9YB/Fastq,atac
Workflow input
cellranger_workflow
takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger-atac count
. Please see the description of inputs below. Note that required inputs are shown in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_csv_file |
Sample Sheet (contains Sample, Reference, Flowcell, DataType, and Chemistry) |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv” |
|
output_directory |
Output directory |
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_output” |
|
force_cells |
Force pipeline to use this number of cells, bypassing the cell detection algorithm |
6000 |
|
atac_dim_reduce |
Choose the algorithm for dimensionality reduction prior to clustering and tsne: “lsa”, “plsa”, or “pca” |
“lsa” |
“lsa” |
peaks |
A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with |
“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed” |
|
cellranger_atac_version |
cellranger-atac version. Available options: 2.1.0, 2.0.0 |
“2.1.0” |
“2.1.0” |
docker_registry |
Docker registry to use for cellranger_workflow. Options:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
acronym_file |
The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.
|
“s3://xxxx/index.tsv” |
“gs://cumulus-ref/resources/cellranger/index.tsv” |
zones |
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings. |
“us-central1-a us-west1-a” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
atac_num_cpu |
Number of cpus for cellranger-atac count |
64 |
64 |
atac_memory |
Memory string for cellranger-atac count |
“57.6G” |
“57.6G” |
atac_disk_space |
Disk space in GB needed for cellranger-atac count |
500 |
500 |
preemptible |
Number of preemptible tries. Only works for GCP |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. Only works for AWS |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
Workflow output
See the table below for important scATAC-seq outputs.
Name |
Type |
Description |
---|---|---|
cellranger_atac_count.output_count_directory |
Array[String] |
Subworkflow output. A list of cloud URIs containing cellranger-atac count outputs, one URI per sample. |
cellranger_atac_count.output_web_summary |
Array[File] |
Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger-atac count output). |
collect_summaries_atac.metrics_summaries |
File |
Task output. An Excel spreadsheet containing QCs for each sample. |
Single-cell Multiome (GEX + ATAC)
To process 10x Multiome (GEX + ATAC) data, follow the instructions below:
Sample sheet
Reference column.
Pre-built single-cell Multiome ATAC + Gene Expression references are summarized below.
Keyword
Description
GRCh38-2020-A_arc_v2.0.0
Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 2.0.0
mm10-2020-A_arc_v2.0.0
Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 2.0.0
Chemistry column.
By default is auto, which will not specify a given chemistry.
DataType column.
For each sample, choose a data type from the table below:
DataType
Description
rna
For scRNA-Seq modality of the data
atac
For scATAC-Seq modality of the data
AuxFile column.
Leave it blank.
Link column.
Put a unique link name for all modalities that are linked. Notice: The Link name must be different from all Sample column values.
Example:
Link,Sample,Reference,Flowcell,DataType sample1,s1_rna,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,rna sample1,s1_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,atac
In the above example, the linked samples will be processed altogether. And the output will be one subfolder named sample1
.
Workflow input
For single-cell multiomics data, cellranger_workflow
takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files). Revalant workflow inputs are described below, with required inputs highlighted in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_csv_file |
Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType, and Link) |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv” |
|
output_directory |
Output directory |
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output” |
|
include_introns |
Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. |
true |
true |
no_bam |
Turn this option on to disable BAM file generation. |
false |
false |
arc_gex_exclude_introns |
Disable counting of intronic reads. In this mode, only reads that are exonic and compatible with annotated splice junctions in the reference are counted.
Note: using this mode will reduce the UMI counts in the feature-barcode matrix.
|
false |
false |
arc_min_atac_count |
Cell caller override to define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode.
Note: this input must be specified in conjunction with
arc_min_gex_count input.With both inputs set, a barcode is defined as a cell if it contains at least
arc_min_atac_count ATAC counts AND at least arc_min_gex_count GEX UMI counts. |
100 |
|
arc_min_gex_count |
Cell caller override to define the minimum number of GEX UMI counts for a cell barcode.
Note: this input must be specified in conjunction with
arc_min_atac_count . See the description of arc_min_atac_count input for details. |
200 |
|
peaks |
A 3-column BED file of peaks to override cellranger arc peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with |
“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed” |
|
cellranger_arc_version |
cellranger-arc version, could be: |
“2.0.2.strato” |
“2.0.2.strato” |
docker_registry |
Docker registry to use for cellranger_workflow. Options:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
acronym_file |
The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.
|
“s3://xxxx/index.tsv” |
“gs://cumulus-ref/resources/cellranger/index.tsv” |
zones |
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings. |
“us-central1-a us-west1-a” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
arc_num_cpu |
Number of cpus to request for one link |
64 |
64 |
arc_memory |
Memory size string for one link |
“160G” |
“160G” |
arc_disk_space |
Disk space in GB needed for one link |
700 |
700 |
preemptible |
Number of preemptible tries. Only works for GCP |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. Only works for AWS |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
Workflow output
See the table below for important output:
Name |
Type |
Description |
---|---|---|
cellranger_arc_count.output_count_directory |
Array[String] |
A list of cloud URIs to output, one URI per link |
cellranger_arc_count.output_web_summary |
Array[File] |
A list of htmls visualizing QCs for each link |
collect_summaries_arc.metrics_summaries |
File |
An excel spreadsheet containing QCs for each link |
Flex, Sample Multiplexing and Multiomics
The cellranger workflow supports processing data of 10x Flex and Sample Multiplexing type, as well as multiomics data. Follow the corresponding sections below based on your data type:
Flex Gene Expression
This section covers preparing the sample sheet for Flex (previously named Fixed RNA Profiling) data.
Sample and Link column.
Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.
- Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run
cellranger multi
.
Notice 1: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.
Notice 2: If there is only a scRNA-Seq sample in the data, you don’t need to specify Link name. Then the workflow would use its Sample name for the whole data.
DataType, Reference, and AuxFile column.
For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:
DataType
Reference
AuxFile
Description
frp
Select one from prebuilt genome references in scRNA-seq section,or provide a cloud URI of a custom reference in.tar.gz
format.Path to a text file including the sample name to Flex probe barcode association (see an example below this table).
For RNA-Seq samples
Choose one from: citeseq, crispr
No need to specify a reference
Path to its feature reference file of 10x Feature Reference format. Notice: If multiple antibody capture samples, you need to combine feature barcodes used in all of them in one reference file.
For antibody capture samples:
citeseq
: For CITE-Seq samples.
crispr
: For Perturb-Seq samples. Notice: This data type used in Flex is supported only in Cell Ranger v8.0+.An example sample name to Flex probe barcode association file is the following (see here for examples of different Flex experiment settings):
sample_id,probe_barcode_ids,description sample1,BC001,Control sample2,BC002,TreatedThe description column is optional, which specifies the description of the samples.
Note
In the sample name to Flex probe barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, probe_barcode_ids, and description (optional).
Below is an example sample sheet for Flex data:
Sample,Reference,Flowcell,DataType,AuxFile s1,GRCh38-2020-A,gs://my-bucket/s1_fastqs,frp,gs://my-bucket/s1_flex.csvNotice that Link column is not required for this case.
An example sample sheet for a more complex Flex data:
Link,Sample,Reference,Flowcell,DataType,AuxFile s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,frp,gs://my-bucket/s2_flex.csv s2,s2_citeseq,,gs://my-bucket/s2_fastqs,citeseq,gs://my-bucket/s2_fbc.csv s2,s2_crispr,,gs://my-bucket/s2_fastqs,crispr,gs://my-bucket/s2_fbc.csv
Flex Probe Set.
Flex uses probes that target protein-coding genes in the human or mouse transcriptome. It’s automatically determined by the genome reference specified by users for the scRNA-Seq sample by following the table below:
Genome Reference
Probe Set
Cell Ranger version
GRCh38-2024-A
v9.0+
GRCh38-2020-A
v7.1+
GRCm39-2024-A
v9.0+
mm10-2020-A
v7.1+
See Flex probe sets overview for details on these probe sets.
On Chip Multiplexing
This section covers preparing the sample sheet for On-Chip Multiplexing (OCM) data.
Sample and Link column.
Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.
Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run
cellranger multi
. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.
DataType, Reference, and AuxFile column.
For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:
DataType
Reference
AuxFile
Description
rna
Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in
.tar.gz
format.Path to a text file including the sample name to OCM barcode association (see an example below this table).
For RNA-Seq samples
Choose one from: vdj, vdj_t, vdj_b, vdj_t_gd
Select one from prebuilt VDJ references in Single-cell immune profiling section.
Optional. For
vdj_t_gd
type samples only: path to a text file containing inner enrichment primers info. This is theinner-enrichment-primers
option in VDJ section of Cell Ranger multi config CSV.For each VDJ sample, choose one from the 4 provided VDJ data types:
vdj
: Leave the workflow to auto-detect.
vdj_t
: VDJ-T library for T-cell receptor sequences.
vdj_b
: VDJ-B library for B-cell receptor sequences.
vdj_t_gd
: VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains. Notice: For such sample, A text file containing inner enrichment primers info must provided in AuxFile column.Choose one from: citeseq, adt
No need to specify a reference
Path to its feature reference file of 10x Feature Reference format. Notice: If
adt
type, you need to combine feature barcodes of both CITE-Seq and Hashing modalities in one file.For antibody capture samples:
citeseq
: For samples only containing CITE-Seq modality.
adt
: For samples containing both CITE-Seq and Hashing modalities.An example sample name to OCM barcode association file is the following:
sample_id,ocm_barcode_ids,description sample1,OB1,Control sample2,OB2,Treatedwhere description column is optional, which specifies the description of the samples.
Note
In the sample name to OCM barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, ocm_barcode_ids, and description (optional).
Below is an example sample sheet for OCM:
Sample,Reference,Flowcell,DataType,AuxFile,Link
s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_ocm.csv,s1
s1_vdj,GRCh38_vdj_v7.1.0,gs://my-bucket/s1_fastqs,vdj,,s1
s1_adt,,gs://my-bucket/s1_fastqs,citeseq,gs://my-bucket/s1_fbc.csv,s1
In the case where there is only scRNA-Seq library in your data, the Link column is optional:
Sample,Reference,Flowcell,DataType,AuxFile
s2,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_ocm.csv
Hashing with Antibody Capture
This section covers preparing the sample sheet for non-OCM hashtag oligo (HTO) data.
Sample and Link column.
Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.
Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run
cellranger multi
. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.
DataType, Reference, and AuxFile column.
For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:
DataType
Reference
AuxFile
Description
rna
Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in
.tar.gz
format.Path to a text file including the sample name to HTO barcode association (see an example below this table).
For RNA-Seq samples
Choose one from: vdj, vdj_t, vdj_b, vdj_t_gd
Select one from prebuilt VDJ references in Single-cell immune profiling section.
Optional. For
vdj_t_gd
type samples only: path to a text file containing inner enrichment primers info. This is theinner-enrichment-primers
option in VDJ section of Cell Ranger multi config CSV.For each VDJ sample, choose one from the 4 provided VDJ data types:
vdj
: Leave the workflow to auto-detect.
vdj_t
: VDJ-T library for T-cell receptor sequences.
vdj_b
: VDJ-B library for B-cell receptor sequences.
vdj_t_gd
: VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains. Notice: For such sample, A text file containing inner enrichment primers info must provided in AuxFile column.hashing
No need to specify a reference
Path to its feature reference file of 10x Feature Reference format, which specifies the oligonucleotide sequences used in the data.
For antibody capture samples
An example sample name to HTO barcode association file is the following:
sample_id,hashtag_ids,description
sample1,TotalSeqB_Hashtag_1,Control
sample2,CD3_TotalSeqB,Treated
where names in hashtag_ids column must be consistent with id
column in the feature reference file. The description column is optional, which specifies the description of the samples.
Note
In the sample name to HTO barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, hashtag_ids, and description (optional).
Below is an example sample sheet for HTO:
Link,Sample,Reference,Flowcell,DataType,AuxFile
s1,s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_hto.csv
s1,s1_vdj,GRCh38_vdj_v7.1.0,gs://my-bucket/s1_fastqs,vdj,
s1,s1_hto,,gs://my-bucket/s1_fastqs,hashing,gs://my-bucket/s1_fbc_ref.csv
Or if your data contain only scRNA-Seq and antibody capture libraries:
Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_hto.csv
s2,s2_hto,,gs://my-bucket/s2_fastqs,hashing,gs://my-bucket/s2_fbc_ref.csv
Cell Multiplexing with CMO (CellPlex)
This section covers preparing the sample sheet for CellPlex data using Cell Multiplexing Oligos (CMO).
Sample and Link column.
Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.
Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run
cellranger multi
. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.
DataType, Reference, and AuxFile column.
For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:
DataType
Reference
AuxFile
Description
rna
Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in
.tar.gz
format.Path to a text file including the sample name to CMO barcode association (see an example below this table).
For RNA-Seq samples
cmo
No need to specify a reference
Optional. If using custom CMOs, provide the path to their
cmo-set
reference file of 10x Feature Reference format. See here for an example.For CMO samples.
citeseq
No need to specify a reference
Path to its feature reference file of 10x Feature Reference format.
For CITE-Seq samples.
An example sample name to CMO barcode association file is the following:
sample_id,cmo_ids,description
sample1,CMO301,Control
sample2,CMO302,Treated
If using a cmo-set
reference file, the names in cmo_ids must be consistent with id
column in the CMO reference file. The description column is optional, which specifies the description of the samples.
Note
In the sample name to CMO barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, cmo_ids, and description (optional).
Below is an example sample sheet for CellPlex:
Link,Sample,Reference,Flowcell,DataType,AuxFile
s1,s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_cmo.csv
s1,s1_cellplex,,gs://my-bucket/s1_fastqs,cmo,
Or if a CITE-Seq sample/library is also included in the data:
Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_cmo.csv
s2,s2_cellplex,,gs://my-bucket/s2_fastqs,cmo,
s2,s2_citeseq,,gs://my-bucket/s2_fastqs,citeseq,gs://my-bucket/s2_fbc.csv
Multiomics
To analyze multiomics (GEX + CITE-Seq/CRISPR) data, prepare your sample sheet as follows:
Link column.
A unique link name for all modalities of the same data
Chemistry column.
The workflow supports all 10x assay configurations. The most widely used ones are listed below:
Chemistry
Explanation
auto
autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry
threeprime
Single Cell 3′
fiveprime
Single Cell 5′
ARC-v1
Gene Expression portion of 10x Multiome data
Please refer to the section of
--chemistry
option in Cell Ranger Command Line Arguments for all other valid chemistry keywords.
DataType column.
The following keywords are accepted for DataType column:
DataType
Explanation
rna
For scRNA-seq samples
citeseq
For CITE-seq samples
crispr
For 10x CRISPR samples
AuxFile column.
Prepare your feature reference file in 10x Feature Reference format.
Notice: If multiple antibody samples are used, you need to merge them into one feature reference file, and assign it for each of the samples.
Below is an example sample sheet:
Link,Sample,Reference,DataType,Flowcell,Chemistry,AuxFile
sample_4,s4_gex,GRCh38-2020-A,rna,gs://my-bucket/s4_fastqs,auto,
sample_4,s4_citeseq,,citeseq,gs://my-bucket/s4_fastqs,SC3Pv4,gs://my-bucket/s4_feature_ref.csv
Here, by specifying sample_4
in Link column, the two modalities will be processed together. The output will be one subfolder named sample_4
.
Workflow Input
All the sample multiplexing assays share the same workflow input settings. cellranger_workflow
takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger multi
. Revalant workflow inputs are described below, with required inputs highlighted in bold:
Name |
Description |
Example |
Default |
---|---|---|---|
input_csv_file |
Sample Sheet (contains Link, Sample, Reference, DataType, Flowcell, and AuxFile columns) |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv” |
|
output_directory |
Output directory |
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output” |
|
include_introns |
Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references |
true |
true |
no_bam |
Turn this option on to disable BAM file generation |
false |
false |
force_cells |
Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells |
6000 |
|
expect_cells |
Expected number of recovered cells. Mutually exclusive with force_cells |
3000 |
|
secondary |
Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.) |
false |
false |
cellranger_version |
Cell Ranger version to use. Available versions: 9.0.1, 8.0.1, 7.2.0. |
“9.0.1” |
“9.0.1” |
docker_registry |
Docker registry to use for cellranger_workflow. Options:
|
“quay.io/cumulus” |
“quay.io/cumulus” |
acronym_file |
The link/path of an index file in TSV format for fetching preset genome references, probe set references, chemistry whitelists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI if running on AWS; an absolute file path if running on HPC or local machines.
|
“s3://xxxx/index.tsv” |
“gs://cumulus-ref/resources/cellranger/index.tsv” |
zones |
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings. |
“us-central1-a us-west1-a” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
num_cpu |
Number of cpus to request per link |
32 |
32 |
memory |
Memory size string to request per link |
“120G” |
“120G” |
multi_disk_space |
Used by Flex and Sample Multiplexing data. Disk space in GB to request per link. |
1500 |
1500 |
count_disk_space |
Only used by Multiomics data. Disk space in GB to request per link |
500 |
500 |
preemptible |
Number of preemptible tries. This only works for GCP. |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. This only works for AWS. |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
Workflow Output
All the sample multiplexing assays share the same workflow output structure. See the table below for important outputs:
Name |
Type |
Description |
---|---|---|
cellranger_multi.output_multi_directory |
Array[String] |
Flex and Sample Multiplexing output. A list of cloud URIs to output folders, one URI per link. |
cellranger_count_fbc.output_count_directory |
Array[String] |
Multiomics output. A list of cloud URIs to output folders, one URI per link. |
Build Cell Ranger References
We provide routines wrapping Cell Ranger tools to build references for sc/snRNA-seq, scATAC-seq and single-cell immune profiling data.
Build references for sc/snRNA-seq
We provide a wrapper of cellranger mkref
to build sc/snRNA-seq references. Please follow the instructions below.
1. Import cellranger_create_reference
Import cellranger_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_create_reference to import.
Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export cellranger_create_reference workflow in the drop-down menu.
2. Upload requred data to Google Bucket
Required data may include input sample sheet, genome FASTA files and gene annotation GTF files.
3. Input sample sheet
If multiple species are specified, a sample sheet in CSV format is required. We describe the sample sheet format below, with required columns highlighted in bold:
Column
Description
Genome
Genome name
Fasta
Location to the genome assembly in FASTA/FASTA.gz format
Genes
Location to the gene annotation file in GTF/GTF.gz format
Attributes
Optional, A list of
key:value
pairs separated by;
. If set,cellranger mkgtf
will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more detailsPlease note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.
See below for an example for building Example:
Genome,Fasta,Genes,Attributes GRCh38,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.gtf.gz,gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense mm10,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.gtf.gzIf multiple species are specified, the reference will built under Genome names concatenated by ‘_and_’s. In the above example, the reference is stored under ‘GRCh38_and_mm10’.
4. Workflow input
Required inputs are highlighted in bold. Note that input_sample_sheet and input_fasta, input_gtf , genome and attributes are mutually exclusive.
Name
Description
Example
Default
input_sample_sheet
A sample sheet in CSV format allows users to specify more than 1 genomes to build references (e.g. human and mouse). If a sample sheet is provided, input_fasta, input_gtf, and attributes will be ignored.
“gs://fc-e0000000-0000-0000-0000-000000000000/input_sample_sheet.csv”
input_fasta
Input genome reference in either FASTA or FASTA.gz format
“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”
input_gtf
Input gene annotation file in either GTF or GTF.gz format
“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”
genome
Genome reference name. New reference will be stored in a folder named genome
refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0
output_directory
Output directory
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_reference”
attributes
A list of
key:value
pairs separated by;
. If this option is not None,cellranger mkgtf
will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details“gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense”
pre_mrna
If we want to build pre-mRNA references, in which we use full length transcripts as exons in the annotation file. We follow 10x build Cell Ranger compatible pre-mRNA Reference Package to build pre-mRNA references
true
false
ref_version
reference version string
Ensembl v94
cellranger_version
cellranger version, could be: 9.0.1, 8.0.1, 7.2.0
“9.0.1”
“9.0.1”
docker_registry
Docker registry to use for cellranger_workflow. Options:
“quay.io/cumulus” for images on Red Hat registry;
“cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus”
“quay.io/cumulus”
zones
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.
“us-central1-a us-west1-a”
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu
Number of cpus to request for one node for building indices
1
1
memory
Memory size string for cellranger mkref
“32G”
“32G”
disk_space
Optional disk space in GB
100
100
preemptible
Number of preemptible tries. Only works for GCP
2
2
awsQueueArn
The AWS ARN string of the job queue to be used. Only works for AWS
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”
“”
5. Workflow output
Name
Type
Description
output_reference
File
Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.
Build references for scATAC-seq
We provide a wrapper of cellranger-atac mkref
to build scATAC-seq references. Please follow the instructions below.
1. Import cellranger_atac_create_reference
Import cellranger_atac_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_atac_create_reference to import.
Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export cellranger_atac_create_reference workflow in the drop-down menu.
2. Upload required data to Google Bucket
Required data include config JSON file, genome FASTA file, gene annotation file (GTF or GFF3 format) and motif input file (JASPAR format).
3. Workflow input
Required inputs are highlighted in bold.
Name
Description
Example
Default
genome
Genome reference name. New reference will be stored in a folder named genome
refdata-cellranger-atac-mm10-1.1.0
input_fasta
GSURL for input fasta file
“gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa”
input_gtf
GSURL for input GTF file
“gs://fc-e0000000-0000-0000-0000-000000000000/annotation.gtf”
organism
Name of the organism
“human”
non_nuclear_contigs
A comma separated list of names of contigs that are not in nucleus
“chrM”
“chrM”
input_motifs
Optional file containing transcription factor motifs in JASPAR format
“gs://fc-e0000000-0000-0000-0000-000000000000/motifs.pfm”
output_directory
Output directory
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_reference”
cellranger_atac_version
cellranger-atac version, could be: 2.1.0, 2.0.0
“2.1.0”
“2.1.0”
docker_registry
Docker registry to use for cellranger_workflow. Options:
“quay.io/cumulus” for images on Red Hat registry;
“cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus”
“quay.io/cumulus”
zones
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.
“us-central1-a us-west1-a”
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
memory
Memory size string for cellranger-atac mkref
“32G”
“32G”
disk_space
Optional disk space in GB
100
100
preemptible
Number of preemptible tries. Only works for GCP
2
2
awsQueueArn
The AWS ARN string of the job queue to be used. Only works for AWS
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”
“”
4. Workflow output
Name
Type
Description
output_reference
File
Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.
Build references for single-cell immune profiling data
We provide a wrapper of cellranger mkvdjref
to build single-cell immune profiling references. Please follow the instructions below.
1. Import cellranger_vdj_create_reference
Import cellranger_vdj_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_vdj_create_reference to import.
Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export cellranger_vdj_create_reference workflow in the drop-down menu.
2. Upload requred data to Google Bucket
Required data include genome FASTA file and gene annotation file (GTF format).
3. Workflow input
Required inputs are highlighted in bold.
Name
Description
Example
Default
input_fasta
Input genome reference in either FASTA or FASTA.gz format
“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”
input_gtf
Input gene annotation file in either GTF or GTF.gz format
“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”
genome
Genome reference name. New reference will be stored in a folder named genome
refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0
output_directory
Output directory
“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_vdj_reference”
ref_version
reference version string
Ensembl v94
cellranger_version
cellranger version, could be: 9.0.1, 8.0.1, 7.2.0
“9.0.1”
“9.0.1”
docker_registry
Docker registry to use for cellranger_workflow. Options:
“quay.io/cumulus” for images on Red Hat registry;
“cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus”
“quay.io/cumulus”
zones
Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.
“us-central1-a us-west1-a”
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
memory
Memory size string for cellranger mkvdjref
“32G”
“32G”
disk_space
Optional disk space in GB
100
100
preemptible
Number of preemptible tries. Only works for GCP
2
2
awsQueueArn
The AWS ARN string of the job queue to be used. Only works for AWS
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”
“”
4. Workflow output
Name
Type
Description
output_reference
File
Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.