Run Cell Ranger tools using cellranger_workflow¶

cellranger_workflow wraps Cell Ranger to process single-cell/nucleus RNA-seq, single-cell ATAC-seq and single-cell immune profiling data, and supports feature barcoding (cell/nucleus hashing, CITE-seq, Perturb-seq). It also provide routines to build cellranger references.

A general step-by-step instruction¶

This section mainly considers jobs starting from BCL files. If your job starts with FASTQ files, and only need to run cellranger count part, please refer to this subsection.

1. Import `cellranger_workflow`¶

Import cellranger_workflow workflow to your workspace by following instructions in Import workflows to Terra. You should choose workflow github.com/lilab-bcb/cumulus/CellRanger to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_workflow workflow in the drop-down menu.

2. Upload sequencing data to Google bucket¶

Copy your sequencing output to your workspace bucket using gsutil (you already have it if you’ve installed Google cloud SDK) in your unix terminal.

You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.

Use gsutil cp [OPTION]... src_url dst_url to copy data to your workspace bucket. For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket:
gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4
-m means copy in parallel, -r means copy the directory recursively, and gs://fc-e0000000-0000-0000-0000-000000000000 should be replaced by your own workspace Google bucket URL.

Note

If input is a folder of BCL files, users do not need to upload the whole folder to the Google bucket. Instead, they only need to upload the following files:

RunInfo.xml
RTAComplete.txt
runParameters.xml
Data/Intensities/s.locs
Data/Intensities/BaseCalls

If data are generated using MiSeq or NextSeq, the location files are inside lane subfloders L001 under Data/Intensities/. In addition, if users’ data only come from a subset of lanes (e.g. L001 and L002), users only need to upload lane subfolders from the subset (e.g. Data/Intensities/BaseCalls/L001, Data/Intensities/BaseCalls/L002 and Data/Intensities/L001, Data/Intensities/L002 if sequencer is MiSeq or NextSeq).

Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload BCL folders according to the above rules.

3. Prepare a sample sheet¶

3.1 Sample sheet format:

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

The sample sheet describes how to demultiplex flowcells and generate channel-specific count matrices. Note that Sample, Lane, and Index columns are defined exactly the same as in 10x’s simple CSV layout file.

A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column Description

Sample Contains sample names. Each 10x channel should have a unique sample name. Sample name can only contain characters from [a-zA-Z0-9_-].

Reference

Provides the reference genome used by Cell Ranger for each 10x channel.

The elements in the reference column can be either Google bucket URLs to reference tarballs or keywords such as GRCh38-2020-A.

A full list of available keywords is included in each of the following data type sections (e.g. sc/snRNA-seq) below.

Flowcell

Indicates the Google bucket URLs of uploaded BCL folders.

If starts with FASTQ files, this should be Google bucket URLs of uploaded FASTQ folders.

The FASTQ folders should contain one subfolder for each sample in the flowcell with the sample name as the subfolder name.

Each subfolder contains FASTQ files for that sample.

Lane

Tells which lanes the sample was pooled into.

Can be either single lane (e.g. 8) or a range (e.g. 7-8) or all (e.g. *).

Index Sample index (e.g. SI-GA-A12).

Chemistry Describes the 10x chemistry used for the sample. This column is optional.

DataType

Describes the data type of the sample — rna, vdj, citeseq, hashing, cmo, crispr, atac.

rna refers to gene expression data (cellranger count),

vdj refers to V(D)J data (cellranger vdj),

citeseq refers to CITE-Seq tag data,

hashing refers to cell-hashing or nucleus-hashing tag data,

adt, which refers to the case where hashing and citeseq reads are in a sample library.

cmo refers to cell multiplexing oligos used in 10x Genomics’ CellPlex assay,

crispr refers to Perturb-seq guide tag data,

atac refers to scATAC-Seq data (cellranger-atac count),

This column is optional and the default data type is rna.

FeatureBarcodeFile

Google bucket urls pointing to feature barcode files for rna, citeseq, hashing, cmo and crispr data.

Features can be either targeted genes for targeted gene expression analysis, antibody for CITE-Seq, cell-hashing, nucleus-hashing or gRNA for Perburb-seq.

If cmo data is analyzed separately using cumulus_feature_barcoding, file format should follow the guide in Feature barcoding assays section, otherwise follow the guide in Single-cell multiomics section.

This column is only required for targeted gene expression analysis (rna), CITE-Seq (citeseq), cell-hashing or nucleus-hashing (hashing), CellPlex (cmo) and Perturb-seq (crispr).

Link

Designed for Single Cell Multiome ATAC + Gene Expression, Feature Barcoding, or CellPlex.

Link multiple modalities together using a single link name.

cellranger-arc count, cellranger count, or cellranger multi will be triggered automatically depending on the modalities.

If empty string is provided, no link is assumed.

Link name can only contain characters from [a-zA-Z0-9_-].

The sample sheet supports sequencing the same 10x channels across multiple flowcells. If a sample is sequenced across multiple flowcells, simply list it in multiple rows, with one flowcell per row. In the following example, we have 4 samples sequenced in two flowcells.

Example:
Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,SI-GA-A8,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,3-4,SI-GA-B8,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,5-6,SI-GA-C8,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,7-8,SI-GA-D8,fiveprime,rna
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,1-2,SI-GA-A8,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,3-4,SI-GA-B8,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,5-6,SI-GA-C8,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,7-8,SI-GA-D8,fiveprime,rna
3.2 Upload your sample sheet to the workspace bucket:
Example:
gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/

4. Launch analysis¶

In your workspace, open cellranger_workflow in WORKFLOWS tab. Select the desired snapshot version (e.g. latest). Select Run workflow with inputs defined by file paths as below

and click SAVE button. Select Use call caching and click INPUTS. Then fill in appropriate values in the Attribute column. Alternative, you can upload a JSON file to configure input by clicking Drag or click to upload json.

Once INPUTS are appropriated filled, click RUN ANALYSIS and then click LAUNCH.

5. Notice: run `cellranger mkfastq` if you are non Broad Institute users¶

Non Broad Institute users that wish to run cellranger mkfastq must create a custom docker image that contains bcl2fastq.

See bcl2fastq instructions.

6. Run `cellranger count` only¶

Sometimes, users might want to perform demultiplexing locally and only run the count part on the cloud. This section describes how to only run the count part via cellranger_workflow.
Copy your FASTQ files to the workspace using gsutil in your unix terminal. There are two cases:
Case 1: All the FASTQ files are in one top-level folder. Then you can simply upload this folder to Cloud, and in your sample sheet, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files.

Case 2: In the top-level folder, each sample has a dedicated subfolder containing its FASTQ files. In this case, you need to upload the whole top-level folder, and in your sample sheet, make sure Sample names and their corresponding subfolder names are identical.

Notice that if your FASTQ files are downloaded from the Sequence Read Archive (SRA) from NCBI, you must rename your FASTQs to follow the bcl2fastq file naming conventions.

Example:
gsutil -m cp -r /foo/bar/fastq_path/K18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastq
Create a sample sheet following the similar structure as above, except the following differences:
Flowcell column should list Google bucket URLs of the FASTQ folders for flowcells.

Lane and Index columns are NOT required in this case.

Example:
Sample,Reference,Flowcell
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastq
Set optional input run_mkfastq to false.

7. Workflow outputs¶

See the table below for workflow level outputs.

Name Type Description

fastq_outputs Array[Array[String]?] The top-level array contains results (as arrays) for different data modalities. The inner-level array contains cloud locations of FASTQ files, one url per flowcell.

count_outputs Array[Array[String]?] The top-level array contains results (as arrays) for different data modalities. The inner-level array contains cloud locations of count matrices, one url per sample.

count_matrix String Cloud url for a template count_matrix.csv to run Cumulus. It only contains sc/snRNA-Seq samples.

Single-cell and single-nucleus RNA-seq¶

To process sc/snRNA-seq data, follow the specific instructions below.

Sample sheet¶

Reference column.

Pre-built scRNA-seq references are summarized below.

Keyword Description

GRCh38-2020-A Human GRCh38 (GENCODE v32/Ensembl 98)

mm10-2020-A Mouse mm10 (GENCODE vM23/Ensembl 98)

GRCh38_and_mm10-2020-A Human GRCh38 (GENCODE v32/Ensembl 98) and mouse mm10 (GENCODE vM23/Ensembl 98)

GRCh38_v3.0.0 Human GRCh38, cellranger reference 3.0.0, Ensembl v93 gene annotation

hg19_v3.0.0 Human hg19, cellranger reference 3.0.0, Ensembl v87 gene annotation

mm10_v3.0.0 Mouse mm10, cellranger reference 3.0.0, Ensembl v93 gene annotation

GRCh38_and_mm10_v3.1.0 Human (GRCh38) and mouse (mm10), cellranger references 3.1.0, Ensembl v93 gene annotations for both human and mouse

hg19_and_mm10_v3.0.0 Human (hg19) and mouse (mm10), cellranger reference 3.0.0, Ensembl v93 gene annotations for both human and mouse

GRCh38_v1.2.0 or GRCh38 Human GRCh38, cellranger reference 1.2.0, Ensembl v84 gene annotation

hg19_v1.2.0 or hg19 Human hg19, cellranger reference 1.2.0, Ensembl v82 gene annotation

mm10_v1.2.0 or mm10 Mouse mm10, cellranger reference 1.2.0, Ensembl v84 gene annotation

GRCh38_and_mm10_v1.2.0 or GRCh38_and_mm10 Human and mouse, built from GRCh38 and mm10 cellranger references, Ensembl v84 gene annotations are used

GRCh38_and_SARSCoV2 Human GRCh38 and SARS-COV-2 RNA genome, cellranger reference 3.0.0, generated by Carly Ziegler. The SARS-COV-2 viral sequence and gtf are as described in [Kim et al. Cell 2020] (https://github.com/hyeshik/sars-cov-2-transcriptome, BetaCov/South Korea/KCDC03/2020 based on NC_045512.2). The GTF was edited to include only CDS regions, and regions were added to describe the 5’ UTR (“SARSCoV2_5prime”), the 3’ UTR (“SARSCoV2_3prime”), and reads aligning to anywhere within the Negative Strand(“SARSCoV2_NegStrand”). Additionally, trailing A’s at the 3’ end of the virus were excluded from the SARSCoV2 fasta, as these were found to drive spurious viral alignment in pre-COVID19 samples.

Pre-built snRNA-seq references are summarized below.

Keyword Description

GRCh38_premrna_v3.0.0 Human, introns included, built from GRCh38 cellranger reference 3.0.0, Ensembl v93 gene annotation, treating annotated transcripts as exons

GRCh38_premrna_v1.2.0 or GRCh38_premrna Human, introns included, built from GRCh38 cellranger reference 1.2.0, Ensembl v84 gene annotation, treating annotated transcripts as exons

mm10_premrna_v1.2.0 or mm10_premrna Mouse, introns included, built from mm10 cellranger reference 1.2.0, Ensembl v84 gene annotation, treating annotated transcripts as exons

GRCh38_premrna_and_mm10_premrna_v1.2.0 or GRCh38_premrna_and_mm10_premrna Human and mouse, introns included, built from GRCh38_premrna_v1.2.0 and mm10_premrna_v1.2.0

GRCh38_premrna_and_SARSCoV2 Human, introns included, built from GRCh38_premrna_v3.0.0, and SARS-COV-2 RNA genome. This reference was generated by Carly Ziegler. The SARS-COV-2 RNA genome is from [Kim et al. Cell 2020] (https://github.com/hyeshik/sars-cov-2-transcriptome, BetaCov/South Korea/KCDC03/2020 based on NC_045512.2). Please see the description of GRCh38_and_SARSCoV2 above for details.

Index column.

Put 10x single cell RNA-seq sample index set names (e.g. SI-GA-A12) here.

Chemistry column.

According to cellranger count’s documentation, chemistry can be

Chemistry Explanation

auto autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry

threeprime Single Cell 3′

fiveprime Single Cell 5′

SC3Pv1 Single Cell 3′ v1

SC3Pv2 Single Cell 3′ v2

SC3Pv3 Single Cell 3′ v3. You should set cellranger version input parameter to >= 3.0.2

SC5P-PE Single Cell 5′ paired-end (both R1 and R2 are used for alignment)

SC5P-R2 Single Cell 5′ R2-only (where only R2 is used for alignment)

DataType column.

This column is optional with a default rna. If you want to put a value, put rna here.
FetureBarcodeFile column.

Put target panel CSV file here for targeted expressiond data. Note that if a target panel CSV is present, cell ranger version must be >= 4.0.0.

Example:

Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType,FeatureBarcodeFile
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,SI-GA-A8,threeprime,rna
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,1-2,SI-GA-A8,threeprime,rna
sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,5-6,SI-GA-C8,fiveprime,rna
sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,5-6,SI-GA-C8,fiveprime,rna
sample_3,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,3,SI-TT-A1,auto,rna,gs://fc-e0000000-0000-0000-0000-000000000000/immunology_v1.0_GRCh38-2020-A.target_panel.csv

Workflow input¶

For sc/snRNA-seq data, cellranger_workflow takes Illumina outputs as input and runs cellranger mkfastq and cellranger count. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name Description Example Default

input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output” Results are written under directory output_directory and will overwrite any existing files at this location.

run_mkfastq If you want to run cellranger mkfastq true true

run_count If you want to run cellranger count true true

delete_input_bcl_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false

mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0

mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false

mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false

mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”

mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false

force_cells Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells 6000

expect_cells Expected number of recovered cells. Mutually exclusive with force_cells 3000

include_introns Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0. true true

no_bam Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0. false false

secondary Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.) false false

cellranger_version cellranger version, could be: 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0 “7.0.0” “7.0.0”

config_version config docker version used for processing sample sheets, could be 0.2, 0.1 “0.2” “0.2”

docker_registry
Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus” “quay.io/cumulus”

mkfastq_docker_registry Docker registry to use for cellranger mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.

Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.

“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”

zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu Number of cpus to request for one node for cellranger mkfastq and cellranger count 32 32

memory Memory size string for cellranger mkfastq and cellranger count “120G” “120G”

mkfastq_disk_space Optional disk space in GB for mkfastq 1500 1500

count_disk_space Disk space in GB needed for cellranger count 500 500

backend
Cloud backend for file transfer. Available options:

“gcp” for Google Cloud;

“aws” for Amazon AWS;

“local” for local machine.

“gcp” “gcp”

preemptible Number of preemptible tries 2 2

awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Workflow output¶

See the table below for important sc/snRNA-seq outputs.

Name	Type	Description
cellranger_mkfastq.output_fastqs_directory	Array[String]?	Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_count.output_count_directory	Array[String]?	Subworkflow output. A list of cloud urls containing gene count matrices, one url per sample.
cellranger_count.output_web_summary	Array[File]?	Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger count output).
collect_summaries.metrics_summaries	File?	Task output. A excel spreadsheet containing QCs for each sample.
count_matrix	String	Workflow output. Cloud url for a template count_matrix.csv to run Cumulus.

Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)¶

cellranger_workflow can extract feature-barcode count matrices in CSV format for feature barcoding assays such as cell and nucleus hashing, CellPlex, CITE-seq, and Perturb-seq. For cell and nucleus hashing as well as CITE-seq, the feature refers to antibody. For Perturb-seq, the feature refers to guide RNA. Please follow the instructions below to configure cellranger_workflow.

Prepare feature barcode files¶

Prepare a CSV file with the following format: feature_barcode,feature_name. See below for an example:
TTCCTGCCATTACTA,sample_1
CCGTACCTCATTGTT,sample_2
GGTAGATGTCCTCAG,sample_3
TGGTGTCATTCTTGA,sample_4
The above file describes a cell hashing application with 4 samples.

If cell hashing and CITE-seq data share a same sample index, you should concatenate hashing and CITE-seq barcodes together and add a third column indicating the feature type. See below for an example:
TTCCTGCCATTACTA,sample_1,hashing
CCGTACCTCATTGTT,sample_2,hashing
GGTAGATGTCCTCAG,sample_3,hashing
TGGTGTCATTCTTGA,sample_4,hashing
CTCATTGTAACTCCT,CD3,citeseq
GCGCAACTTGATGAT,CD8,citeseq
Then upload it to your google bucket:
gsutil antibody_index.csv gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv

Sample sheet¶

Reference column.

This column is not used for extracting feature-barcode count matrix. To be consistent, please put the reference for the associated scRNA-seq assay here.
Index column.

The ADT/HTO index can be either Illumina index primer sequence (e.g. ATTACTCG, also known as D701), or 10x single cell RNA-seq sample index set names (e.g. SI-GA-A12).

Note 1: All ADT/HTO index sequences (including 10x’s) should have the same length (8 bases). If one index sequence is shorter (e.g. ATCACG), pad it with P7 sequence (e.g. ATCACGAT).

Note 2: It is users’ responsibility to avoid index collision between 10x genomics’ RNA indexes (e.g. SI-GA-A8) and Illumina index sequences for used here (e.g. ATTACTCG).

Note 3: For NextSeq runs, please reverse complement the ADT/HTO index primer sequence (e.g. use reverse complement CGAGTAAT instead of ATTACTCG).

Chemistry column.

The following keywords are accepted for Chemistry column:

Chemistry Explanation

auto Default. This is an alias for Single Cell 3’ v3 (SC3Pv3)

threeprime This is another alias for Single Cell 3’ v3

SC3Pv3 Single Cell 3′ v3

SC3Pv2 Single Cell 3′ v2

fiveprime Single Cell 5′

SC5P-PE Single Cell 5′ paired-end (both R1 and R2 are used for alignment)

SC5P-R2 Single Cell 5′ R2-only (where only R2 is used for alignment)

multiome 10x Multiome barcodes

DataType column.

The following keywords are accepted for DataType column:

DataType Explanation

citeseq CITE-seq

hashing Cell or nucleus hashing

cmo CellPlex

adt Hashing and CITE-seq are in the same library

crispr

Perturb-seq/CROP-seq

If neither crispr_barcode_pos nor scaffold_sequence (see Workflow input) is set, crispr refers to 10x CRISPR assays. If in addition Chemistry is set to be SC3Pv3 or its aliases, Cumulus automatically complement the middle two bases to convert 10x feature barcoding cell barcodes back to 10x RNA cell barcodes.

Otherwise, crispr refers to non 10x CRISPR assays, such as CROP-Seq. In this case, we assume feature barcoding cell barcodes are the same as the RNA cell barcodes and no cell barcode convertion will be conducted.

FetureBarcodeFile column.

Put Google Bucket URL of the feature barcode file here.

Example:

Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType,FeatureBarcodeFile
sample_1_rna,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,SI-GA-A8,threeprime,rna
sample_1_adt,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,ATTACTCG,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
sample_2_adt,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,3-4,TCCGGAGA,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
sample_3_crispr,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,5-6,CGCTCATT,SC3Pv3,crispr,gs://fc-e0000000-0000-0000-0000-000000000000/crispr_index.csv

In the sample sheet above, despite the header row,

First row describes the normal 3’ RNA assay;

Second row describes its associated antibody tag data, which can from either a CITE-seq, cell hashing, or nucleus hashing experiment.

Third row describes another tag data, which is in 10x genomics’ V3 chemistry. For tag and crispr data, it is important to explicitly state the chemistry (e.g. SC3Pv3).

Last row describes one gRNA guide data for Perturb-seq (see crispr in DataType field).

Workflow input¶

For feature barcoding data, cellranger_workflow takes Illumina outputs as input and runs cellranger mkfastq and cumulus adt. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name Description Example Default

input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

run_mkfastq If you want to run cellranger mkfastq true true

run_count If you want to run cumulus adt true true

delete_input_bcl_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false

mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0

mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false

mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false

mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”

mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false

crispr_barcode_pos Barcode start position at Read 2 (0-based coordinate) for CRISPR 19 0

scaffold_sequence Scaffold sequence in sgRNA for Purturb-seq, only used for crispr data type. “GTTTAAGAGCTAAGCTGGAA” “”

max_mismatch Maximum hamming distance in feature barcodes for the adt task (changed to 2 as default) 2 2

min_read_ratio Minimum read count ratio (non-inclusive) to justify a feature given a cell barcode and feature combination, only used for the adt task and crispr data type 0.1 0.1

cellranger_version cellranger version, could be 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0 “7.0.0” “7.0.0”

cumulus_feature_barcoding_version Cumulus_feature_barcoding version for extracting feature barcode matrix. Version available: 0.9.0, 0.8.0, 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0. “0.9.0” “0.9.0”

docker_registry
Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus” “quay.io/cumulus”

mkfastq_docker_registry Docker registry to use for cellranger mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.

Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.

“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”

zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu Number of cpus to request for one node for cellranger mkfastq 32 32

memory Memory size string for cellranger mkfastq “120G” “120G”

feature_num_cpu Number of cpus for extracting feature count matrix 4 4

feature_memory Optional memory string for extracting feature count matrix “32G” “32G”

mkfastq_disk_space Optional disk space in GB for mkfastq 1500 1500

feature_disk_space Disk space in GB needed for extracting feature count matrix 100 100

backend
Cloud backend for file transfer. Available options:

“gcp” for Google Cloud;

“aws” for Amazon AWS;

“local” for local machine.

“gcp” “gcp”

preemptible Number of preemptible tries 2 2

awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Parameters used for feature count matrix extraction¶

If the chemistry is V2, 10x genomics v2 cell barcode white list will be used, a hamming distance of 1 is allowed for matching cell barcodes, and the UMI length is 10. If the chemistry is V3, 10x genomics v3 cell barcode white list will be used, a hamming distance of 0 is allowed for matching cell barcodes, and the UMI length is 12.

For Perturb-seq data, a small number of sgRNA protospace sequences will be sequenced ultra-deeply and we may have PCR chimeric reads. Therefore, we generate filtered feature count matrices as well in a data driven manner:

First, plot the histogram of UMIs with certain number of read counts. The number of UMIs with x supporting reads decreases when x increases. We start from x = 1, and a valley between two peaks is detected if we find count[x] < count[x + 1] < count[x + 2]. We filter out all UMIs with < x supporting reads since they are likely formed due to chimeric reads.
In addition, we also filter out barcode-feature-UMI combinations that have their read count ratio, which is defined as total reads supporting barcode-feature-UMI over total reads supporting barcode-UMI, no larger than min_read_ratio parameter set above.

Workflow outputs¶

See the table below for important outputs.

Name	Type	Description
cellranger_mkfastq.output_fastqs_directory	Array[String]?	Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cumulus_adt.output_count_directory	Array[String]?	Subworkflow output. A list of cloud urls containing feature-barcode count matrices, one url per sample.

In addition, For each antibody tag or crispr tag sample, a folder with the sample ID is generated under output_directory. In the folder, two files — sample_id.csv and sample_id.stat.csv.gz — are generated.

sample_id.csv is the feature count matrix. It has the following format. The first line describes the column names: Antibody/CRISPR,cell_barcode_1,cell_barcode_2,...,cell_barcode_n. The following lines describe UMI counts for each feature barcode, with the following format: feature_name,umi_count_1,umi_count_2,...,umi_count_n.

sample_id.stat.csv.gz stores the gzipped sufficient statistics. It has the following format. The first line describes the column names: Barcode,UMI,Feature,Count. The following lines describe the read counts for every barcode-umi-feature combination.

If the feature barcode file has a third column, there will be two files for each feature type in the third column. For example, if hashing presents, sample_id.hashing.csv and sample_id.hashing.stat.csv.gz will be generated.

sample_id.report.txt is a summary report in TXT format. The first lines describe the total number of reads parsed, the number of reads with valid cell barcodes (and percentage over all parsed reads), the number of reads with valid feature barcodes (and percentage over all parsed reads) and the number of reads with both valid cell and feature barcodes (and percentage over all parsed reads). It is then followed by sections describing each feature type. In each section, 7 lines are shown: section title, number of valid cell barcodes (with matching cell barcode and feature barcode) in this section, number of reads for these cell barcodes, mean number of reads per cell barcode, number of UMIs for these cell barcodes, mean number of UMIs per cell barcode and sequencing saturation.

If data type is crispr, three additional files, sample_id.umi_count.pdf, sample_id.filt.csv and sample_id.filt.stat.csv.gz, are generated.

sample_id.umi_count.pdf plots number of UMIs against UMI with certain number of reads and colors UMIs with high likelihood of being chimeric in blue and other UMIs in red. This plot is generated purely based on number of reads each UMI has. For better visualization, we do not show UMIs with > 50 read counts (rare in data).

sample_id.filt.csv is the filtered feature count matrix. It has the same format as sample_id.csv.

sample_id.filt.stat.csv.gz is the filtered sufficient statistics. It has the same format as sample_id.stat.csv.gz.

Single-cell ATAC-seq¶

To process scATAC-seq data, follow the specific instructions below.

Sample sheet¶

Reference column.

Pre-built scATAC-seq references are summarized below.

Keyword Description

GRCh38-2020-A_arc_v2.0.0 Human GRCh38, cellranger-arc/atac reference 2.0.0

mm10-2020-A_arc_v2.0.0 Mouse mm10, cellranger-arc/atac reference 2.0.0

GRCh38_and_mm10-2020-A_atac_v2.0.0 Human GRCh38 and mouse mm10, cellranger-atac reference 2.0.0

GRCh38_atac_v1.2.0 Human GRCh38, cellranger-atac reference 1.2.0

mm10_atac_v1.2.0 Mouse mm10, cellranger-atac reference 1.2.0

hg19_atac_v1.2.0 Human hg19, cellranger-atac reference 1.2.0

b37_atac_v1.2.0 Human b37 build, cellranger-atac reference 1.2.0

GRCh38_and_mm10_atac_v1.2.0 Human GRCh38 and mouse mm10, cellranger-atac reference 1.2.0

hg19_and_mm10_atac_v1.2.0 Human hg19 and mouse mm10, cellranger-atac reference 1.2.0

GRCh38_atac_v1.1.0 Human GRCh38, cellranger-atac reference 1.1.0

mm10_atac_v1.1.0 Mouse mm10, cellranger-atac reference 1.1.0

hg19_atac_v1.1.0 Human hg19, cellranger-atac reference 1.1.0

b37_atac_v1.1.0 Human b37 build, cellranger-atac reference 1.1.0

GRCh38_and_mm10_atac_v1.1.0 Human GRCh38 and mouse mm10, cellranger-atac reference 1.1.0

hg19_and_mm10_atac_v1.1.0 Human hg19 and mouse mm10, cellranger-atac reference 1.1.0

Index column.

Put 10x single cell ATAC sample index set names (e.g. SI-NA-B1) here.
Chemistry column.

By default is auto, which will not specify a given chemistry. To analyze just the individual ATAC library from a 10x multiome assay using cellranger-atac count, use ARC-v1 in the Chemistry column.
DataType column.

Set it to atac.
FetureBarcodeFile column.

Leave it blank for scATAC-seq.

Example:

Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType
sample_atac,GRCh38_atac_v1.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9YB,*,SI-NA-A1,auto,atac

Workflow input¶

cellranger_workflow takes Illumina outputs as input and runs cellranger-atac mkfastq and cellranger-atac count. Please see the description of inputs below. Note that required inputs are shown in bold.

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_output”
run_mkfastq	If you want to run `cellranger-atac mkfastq`	true	true
run_count	If you want to run `cellranger-atac count`	true	true
delete_input_directory	If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges	false	false
mkfastq_barcode_mismatches	Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1)	0
mkfastq_force_single_index	If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair	false	false
mkfastq_filter_single_index	Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed	false	false
mkfastq_use_bases_mask	Override the read lengths as specified in RunInfo.xml	“Y28n,I8n,N10,Y90n*”
mkfastq_delete_undetermined	Delete undetermined FASTQ files generated by bcl2fastq2	true	false
force_cells	Force pipeline to use this number of cells, bypassing the cell detection algorithm	6000
atac_dim_reduce	Choose the algorithm for dimensionality reduction prior to clustering and tsne: “lsa”, “plsa”, or “pca”	“lsa”	“lsa”
peaks	A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with `#` are allowed	“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”
cellranger_atac_version	cellranger-atac version. Available options: 2.1.0, 2.0.0, 1.2.0, 1.1.0	“2.1.0”	“2.1.0”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
mkfastq_docker_registry	Docker registry to use for `cellranger-atac mkfastq`. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry.	“gcr.io/broad-cumulus”	“gcr.io/broad-cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if backend is `gcp`; an S3 URI for `aws` backend; an absolute file path for `local` backend.	“s3://xxxx/index.tsv”	“gs://regev-lab/resources/cellranger/index.tsv”
zones	Google cloud zones	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
atac_num_cpu	Number of cpus for cellranger-atac count	64	64
atac_memory	Memory string for cellranger-atac count	“57.6G”	“57.6G”
mkfastq_disk_space	Optional disk space in GB for cellranger-atac mkfastq	1500	1500
atac_disk_space	Disk space in GB needed for cellranger-atac count	500	500
backend	Cloud backend for file transfer. Available options: “gcp” for Google Cloud; “aws” for Amazon AWS; “local” for local machine.	“gcp”	“gcp”
preemptible	Number of preemptible tries	2	2
awsMaxRetries	Number of maximum retries when running on AWS. This works only when backend is `aws`.	5	5

Workflow output¶

See the table below for important scATAC-seq outputs.

Name	Type	Description
cellranger_atac_mkfastq.output_fastqs_directory	Array[String]?	Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_atac_count.output_count_directory	Array[String]?	Subworkflow output. A list of cloud urls containing cellranger-atac count outputs, one url per sample.
cellranger_atac_count.output_web_summary	Array[File]?	Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger-atac count output).
collect_summaries_atac.metrics_summaries	File?	Task output. A excel spreadsheet containing QCs for each sample.

Aggregate scATAC-Seq Samples¶

To aggregate multiple scATAC-Seq samples, follow the instructions below:

Import cellranger_atac_aggr workflow. Please see Step 1 here, and the name of workflow is “cumulus/cellranger_atac_aggr”.
Set the inputs of workflow. Please see the description of inputs below. Notice that required inputs are shown in bold:

Name	Description	Example	Default
aggr_id	Aggregate ID.	“aggr_sample”
input_counts_directories	A string contains comma-separated URLs to directories of samples to be aggregated.	“gs://fc-e0000000-0000-0000-0000-000000000000/data/sample1,gs://fc-e0000000-0000-0000-0000-000000000000/data/sample2”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/aggregate_result”
genome	The reference genome name used by Cell Ranger, can be either a keyword of pre-built genome, or a Google Bucket URL. See this table for the list of keywords of pre-built genomes.	“GRCh38_atac_v1.2.0”
normalize	Sample normalization mode. Options are: `none`, `depth`, or `signal`.	“none”	“none”
secondary	Perform secondary analysis (dimensionality reduction, clustering and visualization).	false	false
dim_reduce	Choose the algorithm for dimensionality reduction prior to clustering and tsne. Options are: `lsa`, `plsa`, or `pca`.	“lsa”	“lsa”
peaks	A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with # are allowed	“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”
cellranger_atac_version	Cell Ranger ATAC version to use. Options: 2.1.0, 2.0.0, 1.2.0, 1.1.0.	“2.1.0”	“2.1.0”
zones	Google cloud zones	“us-central1-a us-west1-a”	“us-central1-b”
num_cpu	Number of cpus to request for cellranger atac aggr.	64	64
backend	Cloud backend for file transfer. Available options: “gcp” for Google Cloud; “aws” for Amazon AWS; “local” for local machine.	“gcp”	“gcp”
memory	Memory size string for cellranger atac aggr.	“57.6G”	“57.6G”
disk_space	Disk space in GB needed for cellranger atac aggr.	500	500
preemptible	Number of preemptible tries.	2	2
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”

Check out the output in output_directory/aggr_id folder, where output_directory and aggr_id are the inputs you set in Step 2.

Single-cell immune profiling¶

To process single-cell immune profiling (scIR-seq) data, follow the specific instructions below.

Sample sheet¶

Reference column.

Pre-built scIR-seq references are summarized below.

Keyword Description

GRCh38_vdj_v7.0.0 Human GRCh38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

GRCm38_vdj_v7.0.0 Mouse GRCm38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf

GRCh38_vdj_v5.0.0 Human GRCh38 V(D)J sequences, cellranger reference 5.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

GRCm38_vdj_v5.0.0 Mouse GRCm38 V(D)J sequences, cellranger reference 5.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf

GRCh38_vdj_v4.0.0 Human GRCh38 V(D)J sequences, cellranger reference 4.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

GRCm38_vdj_v4.0.0 Mouse GRCm38 V(D)J sequences, cellranger reference 4.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf

GRCh38_vdj_v3.1.0 Human GRCh38 V(D)J sequences, cellranger reference 3.1.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

GRCm38_vdj_v3.1.0 Mouse GRCm38 V(D)J sequences, cellranger reference 3.1.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf

GRCh38_vdj_v2.0.0 or GRCh38_vdj Human GRCh38 V(D)J sequences, cellranger reference 2.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf and vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf

GRCm38_vdj_v2.2.0 or GRCm38_vdj Mouse GRCm38 V(D)J sequences, cellranger reference 2.2.0, annotation built from Ensembl Mus_musculus.GRCm38.90.chr_patch_hapl_scaff.gtf

Index column.

Put 10x single cell V(D)J sample index set names (e.g. SI-GA-A3) here.
Chemistry column.

This column is not used for scIR-seq data. Put fiveprime here as a placeholder if you decide to include the Chemistry column.
DataType column.

Set it to vdj.
FetureBarcodeFile column.

Leave it blank for scIR-seq.

Example:

Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType
sample_vdj,GRCh38_vdj_v3.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ,1,SI-GA-A1,fiveprime,vdj

Workflow input¶

For scIR-seq data, cellranger_workflow takes Illumina outputs as input and runs cellranger mkfastq and cellranger vdj. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
run_mkfastq	If you want to run `cellranger mkfastq`	true	true
run_count	If you want to run `cellranger vdj`	true	true
delete_input_bcl_directory	If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges	false	false
mkfastq_barcode_mismatches	Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1)	0
mkfastq_force_single_index	If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair	false	false
mkfastq_filter_single_index	Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed	false	false
mkfastq_use_bases_mask	Override the read lengths as specified in RunInfo.xml	“Y28n,I8n,N10,Y90n*”
mkfastq_delete_undetermined	Delete undetermined FASTQ files generated by bcl2fastq2	true	false
vdj_denovo	Do not align reads to reference V(D)J sequences before de novo assembly	false	false
vdj_chain	Force the analysis to be carried out for a particular chain type. The accepted values are: “auto” for auto detection based on TR vs IG representation; “TR” for T cell receptors; “IG” for B cell receptors.	“auto”	“auto”
cellranger_version	cellranger version, could be 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0	“7.0.0”	“7.0.0”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
mkfastq_docker_registry	Docker registry to use for `cellranger mkfastq`. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry.	“gcr.io/broad-cumulus”	“gcr.io/broad-cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if backend is `gcp`; an S3 URI for `aws` backend; an absolute file path for `local` backend.	“s3://xxxx/index.tsv”	“gs://regev-lab/resources/cellranger/index.tsv”
zones	Google cloud zones	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu	Number of cpus to request for one node for cellranger mkfastq and cellranger vdj	32	32
memory	Memory size string for cellranger mkfastq and cellranger vdj	“120G”	“120G”
mkfastq_disk_space	Optional disk space in GB for mkfastq	1500	1500
vdj_disk_space	Disk space in GB needed for cellranger vdj	500	500
backend	Cloud backend for file transfer. Available options: “gcp” for Google Cloud; “aws” for Amazon AWS; “local” for local machine.	“gcp”	“gcp”
preemptible	Number of preemptible tries	2	2
awsMaxRetries	Number of maximum retries when running on AWS. This works only when backend is `aws`.	5	5

Workflow output¶

See the table below for important scIR-seq outputs.

Name	Type	Description
cellranger_mkfastq.output_fastqs_directory	Array[String]?	Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_vdj.output_vdj_directory	Array[String]?	Subworkflow output. A list of cloud urls containing vdj results, one url per sample.
cellranger_vdj.output_web_summary	Array[File]?	Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger vdj output).
collect_summaries_vdj.metrics_summaries	File?	Task output. A excel spreadsheet containing QCs for each sample.

Single-cell multiomics¶

To utilize cellranger arc/cellranger multi/cellranger count for single-cell multiomics, follow the specific instructions below. In particular, we put each single modality in one separate lin in the sample sheet as described above. We then use the Link column to link multiple modalities together. Depending on the modalities included, cellranger arc (Multiome ATAC + Gene Expression), cellranger multi (CellPlex), or cellranger count (Feature Barcode) will be triggered. Note that cumulus_feature_barcoding/demuxEM would not be triggered for hashing/citeseq in this setting.

Sample sheet¶

Reference column.

Pre-built Multiome ATAC + Gene Expression references are summarized below. CellPlex and Feature Barcode use the same reference as in Single-cell and single-nucleus RNA-seq.

Keyword Description

GRCh38-2020-A_arc_v2.0.0 Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 2.0.0

mm10-2020-A_arc_v2.0.0 Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 2.0.0

GRCh38-2020-A_arc_v1.0.0 Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 1.0.0

mm10-2020-A_arc_v1.0.0 Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 1.0.0

DataType column.

For each modality, set it to the corresponding data type.
FetureBarcodeFile column.
For RNA-seq modality, only set this if a target panel is provided. For CMO (CellPlex), provide sample name - CMO tag association as follows:
sample1,CMO301|CMO302 sample2,CMO303
For CITESeq, Perturb-seq and hashing, provide one CSV file as defined in Feature Barcode Reference. Note that one feature barcode reference should be provided for all feature-barcode related modalities (e.g. citeseq, hashing, crispr) and all these modalities should put the same reference file in FeatureBarcodeFile column.
Link column.

Put a sample unique link name for all modalities that are linked.

Example:

Sample,Reference,Flowcell,Lane,Index,DataType,FeatureBarcodeFile,Link
sample1_rna,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ,*,SI-TT-A1,rna,,sample1
sample1_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ,*,SI-TT-N1,atac,,sample1
sample2_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZX,*,SI-TT-A2,rna,,sample2
sample2_cmo,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZX,*,SI-TT-N2,cmo,gs://fc-e0000000-0000-0000-0000-000000000000/cmo.csv,sample2
sample3_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZY,*,SI-TT-A3,rna,,sample3
sample3_citeseq,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZY,*,SI-TT-N3,citeseq,gs://fc-e0000000-0000-0000-0000-000000000000/feature_ref.csv,sample3

In the above example, three linked samples are provided. cellranger arc, cellranger multi and cellranger count will be triggered respectively.

Workflow input¶

For single-cell multiomics data, cellranger_workflow takes Illumina outputs as input and runs cellranger-arc mkfastq/cellranger mkfastq and cellranger-arc ount/cellranger multi/cellranger count. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile, Link as optional)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
run_mkfastq	If you want to run `cellranger-arc mkfastq/cellranger mkfastq`	true	true
run_count	If you want to run `cellranger-arc count/cellranger multi/cellranger count`	true	true
delete_input_bcl_directory	If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges	false	false
mkfastq_barcode_mismatches	Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1)	0
mkfastq_force_single_index	If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair	false	false
mkfastq_filter_single_index	Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed	false	false
mkfastq_use_bases_mask	Override the read lengths as specified in RunInfo.xml	“Y28n,I8n,N10,Y90n*”
mkfastq_delete_undetermined	Delete undetermined FASTQ files generated by bcl2fastq2	true	false
force_cells	Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells. This option is used by cellranger multi and cellranger count.	6000
expect_cells	Expected number of recovered cells. Mutually exclusive with force_cells. This option is used by cellranger multi and cellranger count.	3000
include_introns	Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0. This option is used by cellranger multi and cellranger count.	true	true
arc_gex_exclude_introns	Disable counting of intronic reads. In this mode, only reads that are exonic and compatible with annotated splice junctions in the reference are counted. Note: using this mode will reduce the UMI counts in the feature-barcode matrix.	false	false
no_bam	Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0. This option is used by cellranger-arc count, cellranger multi and cellranger count.	false	false
arc_min_atac_count	Cell caller override to define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode. Note: this input must be specified in conjunction with `arc_min_gex_count` input. With both inputs set, a barcode is defined as a cell if it contains at least `arc_min_atac_count` ATAC counts AND at least `arc_min_gex_count` GEX UMI counts.	100
arc_min_gex_count	Cell caller override to define the minimum number of GEX UMI counts for a cell barcode. Note: this input must be specified in conjunction with `arc_min_atac_count`. See the description of `arc_min_atac_count` input for details.	200
peaks	A 3-column BED file of peaks to override cellranger arc peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with `#` are allowed	“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”
secondary	Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.). This option is used by cellranger multi and cellranger count.	false	false
cmo_set	CMO set CSV file, delaring CMO constructs and associated barcodes. See CMO reference for details. Used only for cellranger multi.	“gs://fc-e0000000-0000-0000-0000-000000000000/cmo_set.csv”
cellranger_version	cellranger version, could be 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0	“7.0.0”	“7.0.0”
cellranger_arc_version	cellranger-arc version, could be 2.0.1, 2.0.0, 1.0.1, 1.0.0	“2.0.1”	“2.0.1”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
mkfastq_docker_registry	Docker registry to use for `cellranger-arc mkfastq/cellranger mkfastq`. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry.	“gcr.io/broad-cumulus”	“gcr.io/broad-cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if backend is `gcp`; an S3 URI for `aws` backend; an absolute file path for `local` backend.	“s3://xxxx/index.tsv”	“gs://regev-lab/resources/cellranger/index.tsv”
zones	Google cloud zones	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu	Number of cpus to request for one node for cellranger mkfastq and cellranger vdj	32	32
memory	Memory size string for cellranger/cellranger-arc mkfastq and cellranger vdj	“120G”	“120G”
mkfastq_disk_space	Optional disk space in GB for mkfastq	1500	1500
count_disk_space	Disk space in GB needed for cellranger count	500	500
arc_num_cpu	Number of cpus to request for one node for cellranger-arc count	64	64
arc_memory	Memory size string for cellranger-arc count	“160G”	“160G”
arc_disk_space	Disk space in GB needed for cellranger-arc count	700	700
backend	Cloud backend for file transfer. Available options: “gcp” for Google Cloud; “aws” for Amazon AWS; “local” for local machine.	“gcp”	“gcp”
preemptible	Number of preemptible tries	2	2
awsMaxRetries	Number of maximum retries when running on AWS. This works only when backend is `aws`.	5	5

Workflow output¶

See the table below for important sc/snRNA-seq outputs.

Name	Type	Description
cellranger_arc_mkfastq.output_fastqs_directory / cellranger_mkfastq.output_fastqs_directory	Array[String]?	Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_arc_count.output_count_directory / cellranger_multi.output_multi_directory / cellranger_count_fbc.output_count_directory	Array[String]?	Subworkflow output. A list of cloud urls containing cellranger-arc count, cellranger multi or cellranger count outputs, one url per sample.
cellranger_arc_count.output_web_summary / cellranger_count_fbc.output_web_summary	Array[File]?	A list of htmls visualizing QCs for each sample (cellranger-arc count / cellranger count output).
collect_summaries_arc.metrics_summaries / collect_summaries_fbc.metrics_summaries	File?	A excel spreadsheet containing QCs for each sample.

Build Cell Ranger References¶

We provide routines wrapping Cell Ranger tools to build references for sc/snRNA-seq, scATAC-seq and single-cell immune profiling data.

Build references for sc/snRNA-seq¶

We provide a wrapper of cellranger mkref to build sc/snRNA-seq references. Please follow the instructions below.

1. Import `cellranger_create_reference`¶

Import cellranger_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/kalarman-cell-observatory/cumulus/Cellranger_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket¶

Required data may include input sample sheet, genome FASTA files and gene annotation GTF files.

3. Input sample sheet¶

If multiple species are specified, a sample sheet in CSV format is required. We describe the sample sheet format below, with required columns highlighted in bold:

Column Description

Genome Genome name

Fasta Location to the genome assembly in FASTA/FASTA.gz format

Genes Location to the gene annotation file in GTF/GTF.gz format

Attributes Optional, A list of key:value pairs separated by ;. If set, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

See below for an example for building Example:
Genome,Fasta,Genes,Attributes
GRCh38,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.gtf.gz,gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense
mm10,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.gtf.gz
If multiple species are specified, the reference will built under Genome names concatenated by ‘_and_’s. In the above example, the reference is stored under ‘GRCh38_and_mm10’.

4. Workflow input¶

Required inputs are highlighted in bold. Note that input_sample_sheet and input_fasta, input_gtf , genome and attributes are mutually exclusive.

Name Description Example Default

input_sample_sheet A sample sheet in CSV format allows users to specify more than 1 genomes to build references (e.g. human and mouse). If a sample sheet is provided, input_fasta, input_gtf, and attributes will be ignored. “gs://fc-e0000000-0000-0000-0000-000000000000/input_sample_sheet.csv”

input_fasta Input genome reference in either FASTA or FASTA.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”

input_gtf Input gene annotation file in either GTF or GTF.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”

genome Genome reference name. New reference will be stored in a folder named genome refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0

output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_reference”

attributes A list of key:value pairs separated by ;. If this option is not None, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details “gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense”

pre_mrna If we want to build pre-mRNA references, in which we use full length transcripts as exons in the annotation file. We follow 10x build Cell Ranger compatible pre-mRNA Reference Package to build pre-mRNA references true false

ref_version reference version string Ensembl v94

cellranger_version cellranger version, could be: 7.0.0, 6.1.2, 6.1.1 “7.0.0” “7.0.0”

docker_registry
Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus” “quay.io/cumulus”

zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu Number of cpus to request for one node for building indices 1 1

memory Memory size string for cellranger mkref “32G” “32G”

disk_space Optional disk space in GB 100 100

backend
Cloud backend for file transfer. Available options:

“gcp” for Google Cloud;

“aws” for Amazon AWS;

“local” for local machine.

“gcp” “gcp”

preemptible Number of preemptible tries 2 2

awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

5. Workflow output¶

Name Type Description

output_reference File Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Build references for scATAC-seq¶

We provide a wrapper of cellranger-atac mkref to build scATAC-seq references. Please follow the instructions below.

1. Import `cellranger_atac_create_reference`¶

Import cellranger_atac_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_atac_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_atac_create_reference workflow in the drop-down menu.

2. Upload required data to Google Bucket¶

Required data include config JSON file, genome FASTA file, gene annotation file (GTF or GFF3 format) and motif input file (JASPAR format).

3. Workflow input¶

Required inputs are highlighted in bold.

Name Description Example Default

genome Genome reference name. New reference will be stored in a folder named genome refdata-cellranger-atac-mm10-1.1.0

input_fasta GSURL for input fasta file “gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa”

input_gtf GSURL for input GTF file “gs://fc-e0000000-0000-0000-0000-000000000000/annotation.gtf”

organism Name of the organism “human”

non_nuclear_contigs A comma separated list of names of contigs that are not in nucleus “chrM” “chrM”

input_motifs Optional file containing transcription factor motifs in JASPAR format “gs://fc-e0000000-0000-0000-0000-000000000000/motifs.pfm”

output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_reference”

cellranger_atac_version cellranger-atac version, could be: 2.1.0, 2.0.0, 1.2.0, 1.1.0 “2.1.0” “2.1.0”

docker_registry
Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus” “quay.io/cumulus”

zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

memory Memory size string for cellranger-atac mkref “32G” “32G”

disk_space Optional disk space in GB 100 100

backend
Cloud backend for file transfer. Available options:

“gcp” for Google Cloud;

“aws” for Amazon AWS;

“local” for local machine.

“gcp” “gcp”

preemptible Number of preemptible tries 2 2

awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

4. Workflow output¶

Name Type Description

output_reference File Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Build references for single-cell immune profiling data¶

We provide a wrapper of cellranger mkvdjref to build single-cell immune profiling references. Please follow the instructions below.

1. Import `cellranger_vdj_create_reference`¶

Import cellranger_vdj_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_vdj_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_vdj_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket¶

Required data include genome FASTA file and gene annotation file (GTF format).

3. Workflow input¶

Required inputs are highlighted in bold.

Name Description Example Default

input_fasta Input genome reference in either FASTA or FASTA.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”

input_gtf Input gene annotation file in either GTF or GTF.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”

genome Genome reference name. New reference will be stored in a folder named genome refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0

output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_vdj_reference”

ref_version reference version string Ensembl v94

cellranger_version cellranger version, could be: 7.0.0, 6.1.2, 6.1.1 “7.0.0” “7.0.0”

docker_registry
Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus” “quay.io/cumulus”

zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

memory Memory size string for cellranger mkvdjref “32G” “32G”

disk_space Optional disk space in GB 100 100

backend
Cloud backend for file transfer. Available options:

“gcp” for Google Cloud;

“aws” for Amazon AWS;

“local” for local machine.

“gcp” “gcp”

preemptible Number of preemptible tries 2 2

awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

4. Workflow output¶

Name Type Description

output_reference File Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Run Cell Ranger tools using cellranger_workflow¶

A general step-by-step instruction¶

1. Import cellranger_workflow¶

2. Upload sequencing data to Google bucket¶

3. Prepare a sample sheet¶

4. Launch analysis¶

5. Notice: run cellranger mkfastq if you are non Broad Institute users¶

6. Run cellranger count only¶

7. Workflow outputs¶

Single-cell and single-nucleus RNA-seq¶

Sample sheet¶

Workflow input¶

Workflow output¶

Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)¶

Prepare feature barcode files¶

Sample sheet¶

Workflow input¶

Parameters used for feature count matrix extraction¶

Workflow outputs¶

Single-cell ATAC-seq¶

Sample sheet¶

Workflow input¶

Workflow output¶

Aggregate scATAC-Seq Samples¶

Single-cell immune profiling¶

Sample sheet¶

Workflow input¶

Workflow output¶

Single-cell multiomics¶

Sample sheet¶

Workflow input¶

Workflow output¶

Build Cell Ranger References¶

Build references for sc/snRNA-seq¶

1. Import cellranger_create_reference¶

2. Upload requred data to Google Bucket¶

3. Input sample sheet¶

4. Workflow input¶

5. Workflow output¶

Build references for scATAC-seq¶

1. Import cellranger_atac_create_reference¶

2. Upload required data to Google Bucket¶

3. Workflow input¶

4. Workflow output¶

Build references for single-cell immune profiling data¶

1. Import cellranger_vdj_create_reference¶

2. Upload requred data to Google Bucket¶

3. Workflow input¶

4. Workflow output¶

1. Import `cellranger_workflow`¶

5. Notice: run `cellranger mkfastq` if you are non Broad Institute users¶

6. Run `cellranger count` only¶

1. Import `cellranger_create_reference`¶

1. Import `cellranger_atac_create_reference`¶

1. Import `cellranger_vdj_create_reference`¶