Run Cell Ranger tools using cellranger_workflow

cellranger_workflow wraps Cell Ranger to process single-cell/nucleus RNA-seq, single-cell ATAC-seq and single-cell immune profiling data, and supports feature barcoding (cell/nucleus hashing, CITE-seq, Perturb-seq). It also provide routines to build cellranger references.

A general step-by-step instruction

The workflow starts with FASTQ files.

Note

Starting from v3.0.0, Cumulus cellranger_workflow drops support for mkfastq. If your data start from BCL files, please first run BCL Convert to demultiplex flowcells to generate FASTQ files.

1. Import `cellranger_workflow`

Import cellranger_workflow workflow to your workspace by following instructions in Import workflows to Terra. You should choose workflow github.com/lilab-bcb/cumulus/CellRanger to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_workflow workflow in the drop-down menu.

2. Upload sequencing data to Google bucket

Copy your FASTQ files to your workspace bucket using gcloud storage command (you already have it if you’ve installed Google cloud SDK) in your unix terminal.

You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.

There are three cases:
Case 1: All the FASTQ files are in one top-level folder. Then you can simply upload this folder to Cloud, and in your sample sheet, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files.

Case 2: In the top-level folder, each sample has a dedicated subfolder containing its FASTQ files. In this case, you need to upload the whole top-level folder, and in your sample sheet, make sure Sample names and their corresponding subfolder names are identical.

Case 3: Each sample’s FASTQ files are wrapped in a TAR file. In this case, upload the folder which contains this TAR file. Also, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files inside the TAR files.

Notice that if your FASTQ files are downloaded from the Sequence Read Archive (SRA) from NCBI, you must rename your FASTQs to follow the Illumina file naming conventions.

Example:
gcloud storage cp -r /foo/bar/K18WBC6Z4/Fastq gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastq
where -r means copy the directory recursively, and fc-e0000000-0000-0000-0000-000000000000 should be replaced by your own workspace Google bucket name.

Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload FASTQ files to cloud.

3. Prepare a sample sheet

3.1 Sample sheet format:

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

The sample sheet describes how to generate count matrices from sequencing reads. A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column

Description

Sample

Sample name. This name must be consistent with its corresponding FASTQ filename prefix in the folder specified in Flowcell column. Sample names can only contain characters from [a-zA-Z0-9\_-] to be recognized by Cell Ranger.

Notice that if a sample has multiple sequencing runs, each of which has FASTQ files stored in dedicated location, you can specify multiple entries in the sample sheet with the same name in Sample column, and each entry accounts for one FASTQ folder location.

Reference

Provides the reference genome used by Cell Ranger for processing the sample.

The reference can be a keyword of prebuilt references (e.g. GRCh38-2020-A) that stored in Cumulus bucket, or a user specified cloud URI to a custom reference (in tarball .tar.gz format).

A full list of available keywords is included in each of the following data type sections (e.g. sc/snRNA-seq) below.

Flowcell

Indicates the cloud URI of the uploaded folder containing FASTQ files for each sample.

Chemistry

Keywords to describe the 10x chemistry used for the sample. This column is optional. Check data type sections (e.g. sc/snRNA-seq) below for the corresponding list of available keywords.

DataType

Describes the data type of each sample, with keywords chosen from the list below. This column is optional, and the default is rna.

rna: Gene expression (GEX) data

vdj: V(D)J data

citeseq: CITE-Seq tag data

hashing: Cell-hashing or nucleus-hashing tag data

adt: For the case where hashing and citeseq reads are in the same sample library

cmo: Cell multiplexing oligos used in 10x Genomics’ CellPlex assay

crispr: Perturb-seq guide tag data

atac: scATAC-Seq data

frp: 10x Flex gene expression (old name is Fixed RNA Profiling) data

AuxFile

The Cloud URI pointing to auxiliary files of the corresponding samples, with different usage depending on DataType values:

For rna: It’s used by Sample Multiplexing methods, which specifies the sample name to multiplexing barcode mapping.

For frp: It’s used by Flex data, which specifies the sample name to Flex probe barcode mapping.

For citeseq, hashing, adt, and crispr: It’s the feature barcode file, which contains the information of antibody for CITE-Seq, cell-hashing, nucleus-hashing, or gNRA for Perturb-Seq.

If analyzing using cumulus_feature_barcoding, the feature barcode file should be in format specified in Feature barcoding assays section below;

If analyzing as part of the Sample Multiplexing data using cellranger multi, the feature barcode file should be in 10x Feature Reference format.

For cmo: It’s the CMO reference file (cmo-set option) when using custom CMOs in CellPlex data.

For vdj_t_gd: It’s the inner enrichment primer file (inner-enrichment-primers option) for VDJ-T-GD data.

Notice: This is the FeatureBarcodeFile column in previous versions of Cellranger workflow. This old name is still accepted for backward compatibility.

Link

Designed for Single Cell Multiome ATAC + Gene Expression, Feature Barcoding, Sample Multiplexing, or Flex.

Link multiple modalities together using a single link name.

cellranger-arc count, cellranger count, or cellranger multi will be triggered automatically depending on the modalities.

If empty string is provided, no link is assumed.

Link name can only contain characters from [a-zA-Z0-9\_-] for Cell Ranger to recognize.

Notice: The Link names must be unique to Sample values to avoid overwriting each other’s settings.

The sample sheet supports sequencing the same 10x channels across multiple flowcells. If a sample is sequenced across multiple flowcells, simply list it in multiple rows, with one flowcell per row. In the following example, we have 4 samples sequenced in two flowcells.

Example:
Sample,Reference,Flowcell,Chemistry,DataType
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna
3.2 Upload your sample sheet to the workspace bucket:
Example:
gcloud storage cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/

Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload FASTQ files to cloud.

4. Launch analysis

In your workspace, open cellranger_workflow in WORKFLOWS tab. Select the desired snapshot version (e.g. latest). Select Run workflow with inputs defined by file paths as below

and click SAVE button. Select Use call caching and click INPUTS. Then fill in appropriate values in the Attribute column. Alternative, you can upload a JSON file to configure input by clicking Drag or click to upload json.

Once INPUTS are appropriated filled, click RUN ANALYSIS and then click LAUNCH.

5. Workflow outputs

See the table below for workflow level outputs.

Name

Type

Description

count_outputs

Map[String, Array[String]?]

A modality-to-output map showing output URIs for all samples, organized by modality and one URI per sample.

Single-cell and single-nucleus RNA-seq

Note

Cell Ranger will send anonymized telemetry data to 10x Genomics starting from v9.0. Here is the details on Cell Ranger Pipeline Telemetry.

This option has been turned off in this cellranger_workflow, thus no data will be sent to 10x Genomics.

To process sc/snRNA-seq data, follow the specific instructions below.

Sample sheet

Reference column.

Pre-built scRNA-seq references are summarized below.

Keyword

Description

GRCh38-2024-A

Human GRCh38, comparable to cellranger reference 2024-A (GENCODE v44/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.

GRCm39-2024-A

Mouse GRCm39, comparable to cellranger reference 2024-A (GENCODE vM33/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.

GRCh38_and_GRCm39-2024-A

Human GRCh38 (v44/Ensembl 110) and mouse GRCm39 (GENCODE vM33/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.

mRatBN7.2-2024-A

Rat mRatBN7.2 reference.

GRCh38-2020-A

Human GRCh38 (GENCODE v32/Ensembl 98)

mm10-2020-A

Mouse mm10 (GENCODE vM23/Ensembl 98)

GRCh38_and_mm10-2020-A

Human GRCh38 (GENCODE v32/Ensembl 98) and mouse mm10 (GENCODE vM23/Ensembl 98)

Chemistry column.

The cellranger workflow fully supports all 10x assay configurations. The most widely used ones are listed below:

Chemistry

Explanation

auto

autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry

threeprime

Single Cell 3′

fiveprime

Single Cell 5′

ARC-v1

Gene Expression portion of 10x Multiome data

Please refer to the section of --chemistry option in Cell Ranger Command Line Arguments for all other valid chemistry keywords.

Flowcell column.

See the table in general steps section above.

Note

The workflow accepts input in TAR files which contain FASTQ files inside, and can automatically handle such cases.
DataType column.

This column is optional with a default rna. If you want to put a value, put rna here.

Example:

Sample,Reference,Flowcell,Chemistry,DataType
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,rna
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,threeprime,rna
sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna
sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna
sample_3,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna

Workflow input

For sc/snRNA-seq data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger count. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType) in CSV format

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Cloud URI of the output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

Results are written under directory output_directory and will overwrite any existing files at this location.

include_introns

Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0.

true

true

no_bam

Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0.

false

false

expect_cells

Expected number of recovered cells. Mutually exclusive with force_cells

3000

force_cells

Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells

6000

secondary

Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.)

false

false

cellranger_version

cellranger version, could be: 10.0.0, 9.0.1, 8.0.1, 7.2.0

“10.0.0”

“10.0.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.

Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.

“s3://xxxx/index.tsv”

“gs://cumulus-ref/resources/cellranger/index.tsv”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu

Number of cpus to request for one node for cellranger count

32

32

memory

Memory size string for cellranger count

“120G”

“120G”

count_disk_space

Disk space in GB needed for cellranger count

500

500

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Workflow output

See the table below for important sc/snRNA-seq outputs.

Name	Type	Description
cellranger_count.output_count_directory	Array[String]	Subworkflow output. A list of cloud URIs containing gene count matrices, one URI per sample.
cellranger_count.output_web_summary	Array[File]	Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger count output).
collect_summaries.metrics_summaries	File	Task output. An excel spreadsheet containing QCs for each sample.

Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)

cellranger_workflow can extract feature-barcode count matrices in CSV format for feature barcoding assays such as cell and nucleus hashing, CellPlex, CITE-seq, and Perturb-seq. For cell and nucleus hashing as well as CITE-seq, the feature refers to antibody. For Perturb-seq, the feature refers to guide RNA. Please follow the instructions below to configure cellranger_workflow.

Tthe workflow uses Cumulus Feature Barcoding to process antibody and Perturb-Seq data.

Prepare feature barcode files

Prepare a CSV file with the following format: feature_barcode,feature_name. See below for an example:
TTCCTGCCATTACTA,sample_1
CCGTACCTCATTGTT,sample_2
GGTAGATGTCCTCAG,sample_3
TGGTGTCATTCTTGA,sample_4
The above file describes a cell hashing application with 4 samples.

If cell hashing and CITE-seq data share a same sample index, you should concatenate hashing and CITE-seq barcodes together and add a third column indicating the feature type. See below for an example:
TTCCTGCCATTACTA,sample_1,hashing
CCGTACCTCATTGTT,sample_2,hashing
GGTAGATGTCCTCAG,sample_3,hashing
TGGTGTCATTCTTGA,sample_4,hashing
CTCATTGTAACTCCT,CD3,citeseq
GCGCAACTTGATGAT,CD8,citeseq
Then upload it to your google bucket:
gcloud storage cp antibody_index.csv gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv

Sample sheet

Reference column.

Put the reference for the associated scRNA-seq assay here, so that the generated count matrix can convey this information.

Chemistry column.

The following keywords are accepted for Chemistry column:

Chemistry

Explanation

auto

Default. Auto-detect the chemistry of your data from all possible 10x assay types.

threeprime

Auto-detect the chemistry of your data from all 3’ assay types.

fiveprime

Auto-detect the chemistry of your data from all 5’ assay types.

SC3Pv4

Single Cell 3’ v4. The workflow will auto-detect if Poly-A or CS1 capture method was applied to your data.

Notice: This is a GEM-X chemistry, and only works for Cell Ranger v8.0.0+

SC3Pv3

Single Cell 3′ v3. This is a Next GEM chemistry. The workflow will auto-detect if Poly-A or CS1 capture method was applied to your data.

SC3Pv2

Single Cell 3′ v2

SC5Pv3

Single Cell 5’ v3. Notice: This is a GEM-X chemistry, and only works for Cell Rangrer v8.0.0+

SC5Pv2

Single Cell 5′ v2

multiome

10x Multiome barcodes

Note

Not all 10x chemistry names are supported for feature barcoding, as the workflow uses Cumulus Feature Barcoding to process the data.

DataType column.

The following keywords are accepted for DataType column:

DataType

Explanation

citeseq

CITE-seq

hashing

Cell or nucleus hashing

cmo

CellPlex

adt

Hashing and CITE-seq are in the same library

crispr

Perturb-seq/CROP-seq

If neither crispr_barcode_pos nor scaffold_sequence (see Workflow input) is set, crispr refers to 10x CRISPR assays. If in addition Chemistry is set to be SC3Pv3 or its aliases, Cumulus automatically complement the middle two bases to convert 10x feature barcoding cell barcodes back to 10x RNA cell barcodes.

Otherwise, crispr refers to non 10x CRISPR assays, such as CROP-Seq. In this case, we assume feature barcoding cell barcodes are the same as the RNA cell barcodes and no cell barcode convertion will be conducted.

AuxFile column.

Put cloud URI of the feature barcode file here.

Below is an example sample sheet:

Sample,Reference,Flowcell,Chemistry,DataType,AuxFile
sample_1_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna,
sample_1_adt,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,hashing,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
sample_2_gex,GRCh38-2024-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna
sample_2_adt,GRCh38-2024-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index2.csv
sample_3_crispr,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,crispr,gs://fc-e0000000-0000-0000-0000-000000000000/crispr_index.csv

In the sample sheet above, despite the header row,

Row 1 and 2 specify the GEX and Hashing libraries of the same sample.

Row 3 and 4 specify a sample which has GEX and adt (contains both Hashing and CITE-Seq data) libraries.

Row 5 describes one gRNA guide data for Perturb-seq (see crispr in DataType field).

Workflow input

For feature barcoding data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cumulus adt. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType, and AuxFile)

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

crispr_barcode_pos

Barcode start position at Read 2 (0-based coordinate) for CRISPR

19

0

scaffold_sequence

Scaffold sequence in sgRNA for Purturb-seq, only used for crispr data type.

“GTTTAAGAGCTAAGCTGGAA”

“”

max_mismatch

Maximum hamming distance in feature barcodes for the adt task (changed to 2 as default)

2

2

read_ratio_cutoff

PCR chimeric filtering parameter. Minimum read count ratio cutoff (non-inclusive) to justify a feature per cell barcode and UMI combination.

Notice: Only enabled for crispr samples.

0.5

0.5

cumulus_feature_barcoding_version

Cumulus_feature_barcoding version for extracting feature barcode matrix.

“2.0.0”

“2.0.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

feature_num_cpu

Number of cpus for extracting feature count matrix

4

4

feature_memory

Optional memory string for extracting feature count matrix

“32G”

“32G”

feature_disk_space

Disk space in GB needed for extracting feature count matrix

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Parameters used for feature count matrix extraction

Cell barcode inclusion lists (previously known as whitelists) are automatically decided based on the Chemistry specified in the sample sheet. The association table is here.

Cell barcode matching settings are also automatically decided based on the chemistry specified:

For 10x V3 and V4 chemistry: a hamming distance of 0 is allowed for matching cell barcodes, and the UMI length is 12;

For multiome: a hamming distance of 1 is allowed for matching cell barcodes, and the UMI length is 12;

For 10x V2 chemistry: a hamming distance of 1 is allowed for matching cell barcodes, and the UMI length is 10.

For Perturb-seq data, a small number of sgRNA protospace sequences will be sequenced ultra-deeply and we may have PCR chimeric reads. Therefore, we generate filtered feature count matrices as well in a data driven manner:

First, plot the histogram of UMIs with certain number of read counts. The number of UMIs with x supporting reads decreases when x increases. We start from x = 1, and a valley between two peaks is detected if we find count[x] < count[x + 1] < count[x + 2]. We filter out all UMIs with < x supporting reads since they are likely formed due to chimeric reads.
In addition, we also filter out barcode-feature-UMI combinations that have their read count ratio, which is defined as total reads supporting barcode-feature-UMI over total reads supporting barcode-UMI, no larger than min_read_ratio parameter set above.

Workflow outputs

The table below lists important feature barcoding output when using Cumulus Feature Barcoding:

Name	Type	Description
cumulus_adt.output_count_directory	Array[String]	Subworkflow output. A list of cloud URIs containing feature-barcode count matrices, one URI per sample.

In addition, For each feature barcoding sample, a folder with the sample ID is generated under output_directory. In the folder, there are output files:

Modality: Along with sample name specified in Sample column of the sample sheet, the modality name is also part of the name prefix of the output files:
- If the feature barcode file provided in AuxFile column of sample sheet has only 2 columns, the sample’s modality is the DataType column value in the sample sheet. So the name prefix is <sample_id>.<modality>.*.
- If the feature barcode file has a 3rd column for modality names, then each modality will have its own sets of output files with name prefix <sample_id>.<modality>.*.
If the sample has crispr type in DataType, there are 3 sets of count matrices and sufficient statistics tables with different name prefixes:
- <sample_id>.<modality>.raw.* for raw count matrix,
- <sample_id>.<modality>.umi_correct.* for count matrix after UMI correction,
- <sample_id>.<modality>.chimeric_filtered.* for count matrix after UMI correction and PCR chimeric filtering.
If the sample has other DataType, there are 2 sets of count matrices and sufficient statistics tables with different name prefixes:
- <sample_id>.<modality>.raw.* for raw count matrix,
- <sample_id>.<modality>.umi_correct.* for count matrix after UMI correction.
Count Matrix: <sample_id>.<modality>.raw.h5, <sample_id>.<modality>.umi_correct.h5 and <sample_id>.<modality>.chimeric_filtered.h5. The feature count matrix is in sparse matrix format, and in 10x HDF5 format. It can be loaded by Pegasus via the following example code:
```
import pegasus as pg
mdata = pg.read_input("<sample_id>.<modality>.umi_correct.h5")
```

or by SCANPY via the following example code:

import scanpy as sc
adata = sc.read_10x_h5("<sample_id>.<modality>.umi_correct.h5", gex_only=False)

Sufficient Statistics: <sample_id>.<modality>.raw.molecule_info.h5, <sample_id>.<modality>.umi_correct.molecule_info.h5 and <sample_id>.<modality>.chimeric_filtered.molecule_info.h5. In the table, each entry is a molecule as a Barcode + Feature + UMI combination. This table is in a smplified HDF5 format from 10x molecule_info file, which contains the following HDF5 DataSets:
- /barcode_idx: Integer array of length n_mol (number of molecules). Each entry is the index of the molecule’s cell barcode, which can be found in /barcodes;
- /barcodes: String array of length n_cell (number of cell barcodes). Each entry is a cell barcode;
- /feature_idx: Integer array of length n_mol. Each entry is the index of the molecule’s feature name, which can be found in /features;
- /features: String array of length n_feature (number of features). Each entry is a feature name;
- /umi: String array of length n_mol. Each entry is the molecule’s UMI barcode;
- /count: Integer array of length n_mol. Each entry is the molecule’s count of reads.

This sufficient statistics table can be loaded by PegasusIO (v0.10.0 or above) via the following example code:

import pegasusio as pio
df_mol = pio.read_molecule_info("<sample_id>.<modality>.umi_correct.molecule_info.h5")

The resulting df_mol is a Pandas data frame of n_mol rows, with 4 columns:

Barcode: The molecule’s cell barcode.

Feature: The molecule’s feature name.

UMI: The molecule’s UMI barcode.

Count: The molecule’s count of reads.

Otherwise, you can use h5py package to load this *.molecule_info.h5 file of your own.

Report: <sample_id>.report.txt is a summary report in TXT format.
- The first lines describe
  Total number of reads parsed
  
  Number of reads with valid cell barcodes (and percentage over all parsed reads)
  
  Number of reads with valid feature barcodes (and percentage over all parsed reads)
  
  Number of reads with both valid cell and feature barcodes (and percentage over all parsed reads)
  
  Number of reads with valid cell, feature and UMI barcodes (and percentage over all parsed reads). Notice: A valid UMI should not contain N in its barcode.
- Then each modality has its own section:
  Number of valid cell barcodes
  
  Number of valid reads (with matching cell and feature barcodes)
  
  Mean number of valid reads per cell barcode
  
  Number of valid UMIs (with matching cell and feature barcodes)
  
  Mean number of valid UMIs per cell barcode
  
  Sequencing saturation
- For each section, if UMI correction and/or PCR chimeric filtering is performed, the stats above will be shown again after each of such steps.

Single-cell immune profiling

Note

Cell Ranger will send anonymized telemetry data to 10x Genomics starting from v9.0. Here is the details on Cell Ranger Pipeline Telemetry.

This option has been turned off in this cellranger_workflow, thus no data will be sent to 10x Genomics.

To process single-cell immune profiling (scIR-seq) data, follow the specific instructions below.

Sample sheet

Reference column.

Pre-built scIR-seq references are summarized below.

Keyword

Description

GRCh38_vdj_v7.1.0

Human GRCh38 V(D)J sequences, cellranger reference 7.1.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

GRCh38_vdj_v7.0.0

Human GRCh38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

GRCm38_vdj_v7.0.0

Mouse GRCm38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf

Chemistry column.

This column is not used for scIR-seq data. Put fiveprime here as a placeholder if you decide to include the Chemistry column.
DataType column.
Choose one from the availabe types below:
- vdj: The VDJ library. Let the workflow auto-detect the chain type.
- vdj_t: The VDJ-T library for T-cell receptor sequences.
- vdj_b: The VDJ-B library for B-cell receptor sequences.
- vdj_t_gd: The VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains.
AuxFile column.

Only need for vdj_t_gd type samples which use primer sequences to enrich cDNA for V(D)J sequences. In this case, provide a .txt file containing such sequences, one per line. Then this file would be given to --inner-enrichment-primers option in cellranger vdj.

Note

The --chain option in cellranger vdj is automatically decided based on the DataType value specified:

For vdj: set to --chain auto
For vdj_t and vdj_t_gd: set to --chain TR
For vdj_b: set to --chain IG

An example sample sheet is below:

Sample,Reference,Flowcell,Chemistry,DataType,AuxFile
sample1,GRCh38_vdj_v7.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,fiveprime,vdj,
sample2,GRCh38_vdj_v7.1.0,gs://my-bucket/s2_fastqs,,vdj_t_gd,gs://my-bucket/s2_enrich_primers.txt

Workflow input

For scIR-seq data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger vdj. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Sample, Reference, Flowcell, DataType, Chemistry, and AuxFile)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
vdj_denovo	Do not align reads to reference V(D)J sequences before de novo assembly	false	false
cellranger_version	cellranger version, could be: 10.0.0, 9.0.1, 8.0.1, 7.2.0	“10.0.0”	“10.0.0”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names. Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.	“s3://xxxx/index.tsv”	“gs://cumulus-ref/resources/cellranger/index.tsv”
zones	Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu	Number of cpus to request for one node for cellranger vdj	32	32
memory	Memory size string for cellranger vdj	“120G”	“120G”
vdj_disk_space	Disk space in GB needed for cellranger vdj	500	500
preemptible	Number of preemptible tries. Only works for GCP	2	2
awsQueueArn	The AWS ARN string of the job queue to be used. Only works for AWS	“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”	“”

Workflow output

See the table below for important scIR-seq outputs.

Name	Type	Description
cellranger_vdj.output_count_directory	Array[String]	Subworkflow output. A list of cloud URIs containing vdj results, one URI per sample.
cellranger_vdj.output_web_summary	Array[File]	Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger vdj output).
collect_summaries_vdj.metrics_summaries	File	Task output. An excel spreadsheet containing QCs for each sample.

Single-cell ATAC-seq

To process scATAC-seq data, follow the specific instructions below.

Sample sheet

Reference column.

Pre-built scATAC-seq references are summarized below.

Keyword

Description

GRCh38-2024-A_arc

Human GRCh38 (GENCODE v44/Ensembl 110) for cellranger arc/atac

GRCm39-2024-A_arc

Mouse GRCm39 (GENCODE vM33/Ensembl 110) for cellranger arc/atac

GRCh38-2020-A_arc_v2.0.0

Human GRCh38, cellranger-arc/atac reference 2.0.0

mm10-2020-A_arc_v2.0.0

Mouse mm10, cellranger-arc/atac reference 2.0.0

GRCh38_and_mm10-2020-A_atac_v2.0.0

Human GRCh38 and mouse mm10, cellranger-atac reference 2.0.0

Chemistry column.

By default is auto, which will not specify a given chemistry. To analyze just the individual ATAC library from a 10x multiome assay using cellranger-atac count, use ARC-v1 in the Chemistry column.
DataType column.

Set it to atac.
AuxFile column.

Leave it blank for scATAC-seq.

An example sample sheet is below:

Sample,Reference,Flowcell,DataType
sample_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9YB/Fastq,atac

Workflow input

cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger-atac count. Please see the description of inputs below. Note that required inputs are shown in bold.

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Sample, Reference, Flowcell, DataType, and Chemistry)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_output”
secondary	Perform secondary analysis of the gene-barcode matrix (dimensionality reduction, clustering and visualization). \| Note: This parameter works only for cellranger_atac_version `2.2.0` or later.	false	false
force_cells	Force pipeline to use this number of cells, bypassing the cell detection algorithm \| Note: Users can specify any positive integer since cellranger_atac_version `2.2.0`. For older versions, it has to be less than 20,000.	6000	Estimation from data
atac_dim_reduce	Choose the algorithm for dimensionality reduction prior to clustering and tsne: “lsa”, “plsa”, or “pca”	“lsa”	“lsa”
peaks	A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with `#` are allowed	“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”
cellranger_atac_version	cellranger-atac version. Available options: 2.2.0, 2.1.0, 2.0.0	“2.2.0”	“2.2.0”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names. Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.	“s3://xxxx/index.tsv”	“gs://cumulus-ref/resources/cellranger/index.tsv”
zones	Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
atac_num_cpu	Number of cpus for cellranger-atac count	64	64
atac_memory	Memory string for cellranger-atac count	“57.6G”	“57.6G”
atac_disk_space	Disk space in GB needed for cellranger-atac count	500	500
preemptible	Number of preemptible tries. Only works for GCP	2	2
awsQueueArn	The AWS ARN string of the job queue to be used. Only works for AWS	“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”	“”

Workflow output

See the table below for important scATAC-seq outputs.

Name	Type	Description
cellranger_atac_count.output_count_directory	Array[String]	Subworkflow output. A list of cloud URIs containing cellranger-atac count outputs, one URI per sample.
cellranger_atac_count.output_web_summary	Array[File]	Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger-atac count output).
collect_summaries_atac.metrics_summaries	File	Task output. An Excel spreadsheet containing QCs for each sample.

Single-cell Multiome (GEX + ATAC)

Note

Cell Ranger ARC will send anonymized telemetry data to 10x Genomics starting from v2.1. Here is the details on Cell Ranger ARC Pipeline Telemetry.

This option has been turned off in this cellranger_workflow, thus no data will be sent to 10x Genomics.

To process 10x Multiome (GEX + ATAC) data, follow the instructions below:

Sample sheet

Reference column.

Pre-built single-cell Multiome ATAC + Gene Expression references are summarized below.

Keyword

Description

GRCh38-2024-A_arc

Human GRCh38 (GENCODE v44/Ensembl 110) for cellranger arc

GRCm39-2024-A_arc

Mouse GRCm39 (GENCODE vM33/Ensembl 110) for cellranger arc

GRCh38-2020-A_arc_v2.0.0

Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 2.0.0

mm10-2020-A_arc_v2.0.0

Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 2.0.0

Chemistry column.

By default is auto, which will not specify a given chemistry.
DataType column.

For each sample, choose a data type from the table below:

DataType

Description

rna

For scRNA-Seq modality of the data

atac

For scATAC-Seq modality of the data
AuxFile column.

Leave it blank.
Link column.

Put a unique link name for all modalities that are linked. Notice: The Link name must be different from all Sample column values.

Example:

Link,Sample,Reference,Flowcell,DataType
sample1,s1_rna,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,rna
sample1,s1_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,atac

In the above example, the linked samples will be processed altogether. And the output will be one subfolder named sample1.

Workflow input

For single-cell multiomics data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files). Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType, and Link)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
include_introns	Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references.	true	true
secondary	Perform Cell Ranger ARC secondary analysis (e.g. clustering). Note: This parameter works only for cellranger_arc_version `2.1.0` or later.	false	false
no_bam	Turn this option on to disable BAM file generation.	false	false
arc_gex_exclude_introns	Disable counting of intronic reads. In this mode, only reads that are exonic and compatible with annotated splice junctions in the reference are counted. Note: using this mode will reduce the UMI counts in the feature-barcode matrix.	false	false
arc_min_atac_count	Cell caller override to define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode. Note: this input must be specified in conjunction with `arc_min_gex_count` input. With both inputs set, a barcode is defined as a cell if it contains at least `arc_min_atac_count` ATAC counts AND at least `arc_min_gex_count` GEX UMI counts.	100
arc_min_gex_count	Cell caller override to define the minimum number of GEX UMI counts for a cell barcode. Note: this input must be specified in conjunction with `arc_min_atac_count`. See the description of `arc_min_atac_count` input for details.	200
peaks	A 3-column BED file of peaks to override cellranger arc peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with `#` are allowed	“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”
cellranger_arc_version	cellranger-arc version, could be: `2.2.0`, `2.1.0`, `2.0.2.strato` (compatible with workflow v2.6.1+), `2.0.2.custom-max-cell` (with max_cell threshold set to 80,000), `2.0.2` (compatible with workflow v2.6.0 or earlier), `2.0.1`, `2.0.0` \| Note: The 20,000 total cell limit has been removed since version `2.1.0`.	“2.2.0”	“2.2.0”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names. Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.	“s3://xxxx/index.tsv”	“gs://cumulus-ref/resources/cellranger/index.tsv”
zones	Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
arc_num_cpu	Number of cpus to request for one link	64	64
arc_memory	Memory size string for one link	“160G”	“160G”
arc_disk_space	Disk space in GB needed for one link	700	700
preemptible	Number of preemptible tries. Only works for GCP	2	2
awsQueueArn	The AWS ARN string of the job queue to be used. Only works for AWS	“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”	“”

Workflow output

See the table below for important output:

Name	Type	Description
cellranger_arc_count.output_count_directory	Array[String]	A list of cloud URIs to output, one URI per link
cellranger_arc_count.output_web_summary	Array[File]	A list of htmls visualizing QCs for each link
collect_summaries_arc.metrics_summaries	File	An excel spreadsheet containing QCs for each link

Flex, Sample Multiplexing and Multiomics

Note

Cell Ranger will send anonymized telemetry data to 10x Genomics starting from v9.0. Here is the details on Cell Ranger Pipeline Telemetry.

This option has been turned off in this cellranger_workflow, thus no data will be sent to 10x Genomics.

The cellranger workflow supports processing data of 10x Flex and Sample Multiplexing type, as well as multiomics data. Follow the corresponding sections below based on your data type:

Flex Gene Expression

This section covers preparing the sample sheet for Flex (previously named Fixed RNA Profiling) data.

Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi.

Notice 1: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

Notice 2: If there is only a scRNA-Seq sample in the data, you don’t need to specify Link name. Then the workflow would use its Sample name for the whole data.

DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

Choose one from: frp, flex-v1, flex-v2

Select one from prebuilt genome references in scRNA-seq section,

or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to Flex probe barcode association (see an example below this table).

For Flex RNA-Seq samples:

flex-v1 or frp: For Flex v1 data.

flex-v2: For Flex v2 data. Notice: This data type is supported only in Cell Ranger v10.0+.

Choose one from: citeseq, crispr

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format. Notice: If multiple antibody capture samples, you need to combine feature barcodes used in all of them in one reference file.

For antibody capture samples:

citeseq: For CITE-Seq samples.

crispr: For Perturb-Seq samples. Notice: This data type used in Flex is supported only in Cell Ranger v8.0+.

An example sample name to Flex probe barcode association file is the following (see here for examples of different Flex experiment settings):
sample_id,probe_barcode_ids,description
sample1,BC001,Control
sample2,BC002,Treated
The description column is optional, which specifies the description of the samples.

Note

In the sample name to Flex probe barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, probe_barcode_ids, and description (optional).

Below is an example sample sheet for Flex data:
Sample,Reference,Flowcell,DataType,AuxFile
s1,GRCh38-2020-A,gs://my-bucket/s1_fastqs,frp,gs://my-bucket/s1_flex.csv
Notice that Link column is not required for this case.

An example sample sheet for a more complex Flex data:
Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,flex-v2,gs://my-bucket/s2_flex.csv
s2,s2_citeseq,,gs://my-bucket/s2_fastqs,citeseq,gs://my-bucket/s2_fbc.csv
s2,s2_crispr,,gs://my-bucket/s2_fastqs,crispr,gs://my-bucket/s2_fbc.csv

Flex Probe Set.

Flex uses probes that target protein-coding genes in the human or mouse transcriptome. It’s automatically determined by the genome reference and Flex chemistry version specified by users for the scRNA-Seq sample by following the table below:

Genome Reference

Flex chemistry version

Probe Set

Cell Ranger version

GRCh38-2024-A

v2

Flex_human_probe_v2.0

v10.0+

GRCh38-2024-A

v1

Flex_human_probe_v1.1

v9.0+

GRCh38-2020-A

v1

Flex_human_probe_v1.0.1

v7.1+

GRCm39-2024-A

v2

Flex_mouse_probe_v2.0

v10.0+

GRCm39-2024-A

v1

Flex_mouse_probe_v1.1

v9.0+

mm10-2020-A

v1

Flex_mouse_probe_v1.0.1

v7.1+

See Flex probe sets overview for details on these probe sets.

Chemistry column

By default, the chemistry is detected automatically which is officially recommended, so this column is usually omitted.

However, for the cases in which auto-detection fails (e.g. MFRP-RNA + MFRP-Ab-R1 for Flex Multiplex with Antibody design, because probe barcodes are on different read pairs), users can specify this Chemistry column, and give the sample-level chemistry values. Notice: This sample-level chemistry feature requires cellranger_version 8.0.1 or later.

On Chip Multiplexing

This section covers preparing the sample sheet for On-Chip Multiplexing (OCM) data.

Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

rna

Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to OCM barcode association (see an example below this table).

For RNA-Seq samples

Choose one from: vdj, vdj_t, vdj_b, vdj_t_gd

Select one from prebuilt VDJ references in Single-cell immune profiling section.

Optional. For vdj_t_gd type samples only: path to a text file containing inner enrichment primers info. This is the inner-enrichment-primers option in VDJ section of Cell Ranger multi config CSV.

For each VDJ sample, choose one from the 4 provided VDJ data types:

vdj: Leave the workflow to auto-detect.

vdj_t: VDJ-T library for T-cell receptor sequences.

vdj_b: VDJ-B library for B-cell receptor sequences.

vdj_t_gd: VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains. Notice: For such sample, A text file containing inner enrichment primers info must provided in AuxFile column.

Choose one from: citeseq, adt

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format. Notice: If adt type, you need to combine feature barcodes of both CITE-Seq and Hashing modalities in one file.

For antibody capture samples:

citeseq: For samples only containing CITE-Seq modality.

adt: For samples containing both CITE-Seq and Hashing modalities.

An example sample name to OCM barcode association file is the following:
sample_id,ocm_barcode_ids,description
sample1,OB1,Control
sample2,OB2,Treated
where description column is optional, which specifies the description of the samples.

Note

In the sample name to OCM barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, ocm_barcode_ids, and description (optional).

Below is an example sample sheet for OCM:

Sample,Reference,Flowcell,DataType,AuxFile,Link
s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_ocm.csv,s1
s1_vdj,GRCh38_vdj_v7.1.0,gs://my-bucket/s1_fastqs,vdj,,s1
s1_adt,,gs://my-bucket/s1_fastqs,citeseq,gs://my-bucket/s1_fbc.csv,s1

In the case where there is only scRNA-Seq library in your data, the Link column is optional:

Sample,Reference,Flowcell,DataType,AuxFile
s2,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_ocm.csv

Hashing with Antibody Capture

This section covers preparing the sample sheet for non-OCM hashtag oligo (HTO) data.

Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

rna

Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to HTO barcode association (see an example below this table).

For RNA-Seq samples

Choose one from: vdj, vdj_t, vdj_b, vdj_t_gd

Select one from prebuilt VDJ references in Single-cell immune profiling section.

Optional. For vdj_t_gd type samples only: path to a text file containing inner enrichment primers info. This is the inner-enrichment-primers option in VDJ section of Cell Ranger multi config CSV.

For each VDJ sample, choose one from the 4 provided VDJ data types:

vdj: Leave the workflow to auto-detect.

vdj_t: VDJ-T library for T-cell receptor sequences.

vdj_b: VDJ-B library for B-cell receptor sequences.

vdj_t_gd: VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains. Notice: For such sample, A text file containing inner enrichment primers info must provided in AuxFile column.

hashing

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format, which specifies the oligonucleotide sequences used in the data.

For antibody capture samples

An example sample name to HTO barcode association file is the following:

sample_id,hashtag_ids,description
sample1,TotalSeqB_Hashtag_1,Control
sample2,CD3_TotalSeqB,Treated

where names in hashtag_ids column must be consistent with id column in the feature reference file. The description column is optional, which specifies the description of the samples.

Note

In the sample name to HTO barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, hashtag_ids, and description (optional).

Below is an example sample sheet for HTO:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s1,s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_hto.csv
s1,s1_vdj,GRCh38_vdj_v7.1.0,gs://my-bucket/s1_fastqs,vdj,
s1,s1_hto,,gs://my-bucket/s1_fastqs,hashing,gs://my-bucket/s1_fbc_ref.csv

Or if your data contain only scRNA-Seq and antibody capture libraries:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_hto.csv
s2,s2_hto,,gs://my-bucket/s2_fastqs,hashing,gs://my-bucket/s2_fbc_ref.csv

Cell Multiplexing with CMO (CellPlex)

This section covers preparing the sample sheet for CellPlex data using Cell Multiplexing Oligos (CMO).

Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

rna

Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to CMO barcode association (see an example below this table).

For RNA-Seq samples

cmo

No need to specify a reference

Optional. If using custom CMOs, provide the path to their cmo-set reference file of 10x Feature Reference format. See here for an example.

For CMO samples.

citeseq

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format.

For CITE-Seq samples.

An example sample name to CMO barcode association file is the following:

sample_id,cmo_ids,description
sample1,CMO301,Control
sample2,CMO302,Treated

If using a cmo-set reference file, the names in cmo_ids must be consistent with id column in the CMO reference file. The description column is optional, which specifies the description of the samples.

Note

In the sample name to CMO barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, cmo_ids, and description (optional).

Below is an example sample sheet for CellPlex:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s1,s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_cmo.csv
s1,s1_cellplex,,gs://my-bucket/s1_fastqs,cmo,

Or if a CITE-Seq sample/library is also included in the data:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_cmo.csv
s2,s2_cellplex,,gs://my-bucket/s2_fastqs,cmo,
s2,s2_citeseq,,gs://my-bucket/s2_fastqs,citeseq,gs://my-bucket/s2_fbc.csv

Multiomics

To analyze multiomics (GEX + CITE-Seq/CRISPR) data, prepare your sample sheet as follows:

Link column.

A unique link name for all modalities of the same data

Chemistry column.

The workflow supports all 10x assay configurations. The most widely used ones are listed below:

Chemistry

Explanation

auto

autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry

threeprime

Single Cell 3′

fiveprime

Single Cell 5′

ARC-v1

Gene Expression portion of 10x Multiome data

Please refer to the section of --chemistry option in Cell Ranger Command Line Arguments for all other valid chemistry keywords.

DataType column.

The following keywords are accepted for DataType column:

DataType

Explanation

rna

For scRNA-seq samples

citeseq

For CITE-seq samples

crispr

For 10x CRISPR samples

AuxFile column.

Prepare your feature reference file in 10x Feature Reference format.

Notice: If multiple antibody samples are used, you need to merge them into one feature reference file, and assign it for each of the samples.

Below is an example sample sheet:

Link,Sample,Reference,DataType,Flowcell,Chemistry,AuxFile
sample_4,s4_gex,GRCh38-2020-A,rna,gs://my-bucket/s4_fastqs,auto,
sample_4,s4_citeseq,,citeseq,gs://my-bucket/s4_fastqs,SC3Pv4,gs://my-bucket/s4_feature_ref.csv

Here, by specifying sample_4 in Link column, the two modalities will be processed together. The output will be one subfolder named sample_4.

Workflow Input

All the sample multiplexing assays share the same workflow input settings. cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger multi. Revalant workflow inputs are described below, with required inputs highlighted in bold:

Name	Description	Example	Default
input_csv_file	Sample Sheet (contains Link, Sample, Reference, DataType, Flowcell, and AuxFile columns)	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”
output_directory	Output directory	“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”
include_introns	Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references	true	true
no_bam	Turn this option on to disable BAM file generation Notice: For Flex data, if this option is turned on, the genome reference will not be used in the process. (requires `cellranger_version >= "8.0.0"`)	false	false
force_cells	Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells	6000
expect_cells	Expected number of recovered cells. Mutually exclusive with force_cells	3000
secondary	Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.)	false	false
cellranger_version	Cell Ranger version to use. Available versions: 10.0.0, 9.0.1, 8.0.1, 7.2.0.	“10.0.0”	“10.0.0”
docker_registry	Docker registry to use for cellranger_workflow. Options: “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
acronym_file	The link/path of an index file in TSV format for fetching preset genome references, probe set references, chemistry whitelists, etc. by their names. Set an GS URI if running on GCP; an S3 URI if running on AWS; an absolute file path if running on HPC or local machines.	“s3://xxxx/index.tsv”	“gs://cumulus-ref/resources/cellranger/index.tsv”
zones	Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.	“us-central1-a us-west1-a”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu	Number of cpus to request per link	32	32
memory	Memory size string to request per link	“120G”	“120G”
multi_disk_space	Used by Flex and Sample Multiplexing data. Disk space in GB to request per link.	1500	1500
count_disk_space	Only used by Multiomics data. Disk space in GB to request per link	500	500
preemptible	Number of preemptible tries. This only works for GCP.	2	2
awsQueueArn	The AWS ARN string of the job queue to be used. This only works for AWS.	“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”	“”

Workflow Output

All the sample multiplexing assays share the same workflow output structure. See the table below for important outputs:

Name	Type	Description
cellranger_multi.output_multi_directory	Array[String]	Flex and Sample Multiplexing output. A list of cloud URIs to output folders, one URI per link.
cellranger_count_fbc.output_count_directory	Array[String]	Multiomics output. A list of cloud URIs to output folders, one URI per link.

Build Cell Ranger References

We provide routines wrapping Cell Ranger tools to build references for sc/snRNA-seq, scATAC-seq and single-cell immune profiling data.

Build references for sc/snRNA-seq

Note

Cell Ranger will send anonymized telemetry data to 10x Genomics starting from v9.0. Here is the details on Cell Ranger Pipeline Telemetry.

This option has been turned off in this cellranger_workflow, thus no data will be sent to 10x Genomics.

We provide a wrapper of cellranger mkref to build sc/snRNA-seq references. Please follow the instructions below.

1. Import `cellranger_create_reference`

Import cellranger_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket

Required data may include input sample sheet, genome FASTA files and gene annotation GTF files.

3. Input sample sheet

If multiple species are specified, a sample sheet in CSV format is required. We describe the sample sheet format below, with required columns highlighted in bold:

Column

Description

Genome

Genome name

Fasta

Location to the genome assembly in FASTA/FASTA.gz format

Genes

Location to the gene annotation file in GTF/GTF.gz format

Attributes

Optional, A list of key:value pairs separated by ;. If set, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

See below for an example for building Example:
Genome,Fasta,Genes,Attributes
GRCh38,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.gtf.gz,gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense
mm10,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.gtf.gz
If multiple species are specified, the reference will built under Genome names concatenated by ‘_and_’s. In the above example, the reference is stored under ‘GRCh38_and_mm10’.

4. Workflow input

Required inputs are highlighted in bold. Note that input_sample_sheet and input_fasta, input_gtf , genome and attributes are mutually exclusive.

Name

Description

Example

Default

input_sample_sheet

A sample sheet in CSV format allows users to specify more than 1 genomes to build references (e.g. human and mouse). If a sample sheet is provided, input_fasta, input_gtf, and attributes will be ignored.

“gs://fc-e0000000-0000-0000-0000-000000000000/input_sample_sheet.csv”

input_fasta

Input genome reference in either FASTA or FASTA.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”

input_gtf

Input gene annotation file in either GTF or GTF.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”

genome

Genome reference name. New reference will be stored in a folder named genome

refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_reference”

attributes

A list of key:value pairs separated by ;. If this option is not None, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details

“gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense”

pre_mrna

If we want to build pre-mRNA references, in which we use full length transcripts as exons in the annotation file. We follow 10x build Cell Ranger compatible pre-mRNA Reference Package to build pre-mRNA references

true

false

ref_version

reference version string

Ensembl v94

cellranger_version

cellranger version, could be: 10.0.0, 9.0.1, 8.0.1, 7.2.0

“10.0.0”

“10.0.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu

Number of cpus to request for one node for building indices

1

1

memory

Memory size string for cellranger mkref

“32G”

“32G”

disk_space

Optional disk space in GB

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

5. Workflow output

Name

Type

Description

output_reference

File

Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Build references for scATAC-seq

We provide a wrapper of cellranger-atac mkref to build scATAC-seq references. Please follow the instructions below.

1. Import `cellranger_atac_create_reference`

Import cellranger_atac_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_atac_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_atac_create_reference workflow in the drop-down menu.

2. Upload required data to Google Bucket

Required data include config JSON file, genome FASTA file, gene annotation file (GTF or GFF3 format) and motif input file (JASPAR format).

3. Workflow input

Required inputs are highlighted in bold.

Name

Description

Example

Default

genome

Genome reference name. New reference will be stored in a folder named genome

refdata-cellranger-atac-mm10-1.1.0

input_fasta

GSURL for input fasta file

“gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa”

input_gtf

GSURL for input GTF file

“gs://fc-e0000000-0000-0000-0000-000000000000/annotation.gtf”

organism

Name of the organism

“human”

non_nuclear_contigs

A comma separated list of names of contigs that are not in nucleus

“chrM”

“chrM”

input_motifs

Optional file containing transcription factor motifs in JASPAR format

“gs://fc-e0000000-0000-0000-0000-000000000000/motifs.pfm”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_reference”

cellranger_atac_version

cellranger-atac version, could be: 2.2.0, 2.1.0, 2.0.0

“2.2.0”

“2.2.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

memory

Memory size string for cellranger-atac mkref

“32G”

“32G”

disk_space

Optional disk space in GB

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

4. Workflow output

Name

Type

Description

output_reference

File

Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Build references for single-cell immune profiling data

Note

Cell Ranger will send anonymized telemetry data to 10x Genomics starting from v9.0. Here is the details on Cell Ranger Pipeline Telemetry.

This option has been turned off in this cellranger_workflow, thus no data will be sent to 10x Genomics.

We provide a wrapper of cellranger mkvdjref to build single-cell immune profiling references. Please follow the instructions below.

1. Import `cellranger_vdj_create_reference`

Import cellranger_vdj_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_vdj_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_vdj_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket

Required data include genome FASTA file and gene annotation file (GTF format).

3. Workflow input

Required inputs are highlighted in bold.

Name

Description

Example

Default

input_fasta

Input genome reference in either FASTA or FASTA.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”

input_gtf

Input gene annotation file in either GTF or GTF.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”

genome

Genome reference name. New reference will be stored in a folder named genome

refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_vdj_reference”

ref_version

reference version string

Ensembl v94

cellranger_version

cellranger version, could be: 10.0.0, 9.0.1, 8.0.1, 7.2.0

“10.0.0”

“10.0.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

“quay.io/cumulus” for images on Red Hat registry;

“cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

memory

Memory size string for cellranger mkvdjref

“32G”

“32G”

disk_space

Optional disk space in GB

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

4. Workflow output

Name

Type

Description

output_reference

File

Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Run Cell Ranger tools using cellranger_workflow

A general step-by-step instruction

1. Import cellranger_workflow

2. Upload sequencing data to Google bucket

3. Prepare a sample sheet

4. Launch analysis

5. Workflow outputs

Single-cell and single-nucleus RNA-seq

Sample sheet

Workflow input

Workflow output

Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)

Prepare feature barcode files

Sample sheet

Workflow input

Parameters used for feature count matrix extraction

Workflow outputs

Single-cell immune profiling

Sample sheet

Workflow input

Workflow output

Single-cell ATAC-seq

Sample sheet

Workflow input

Workflow output

Single-cell Multiome (GEX + ATAC)

Sample sheet

Workflow input

Workflow output

Flex, Sample Multiplexing and Multiomics

Flex Gene Expression

On Chip Multiplexing

Hashing with Antibody Capture

Cell Multiplexing with CMO (CellPlex)

Multiomics

Workflow Input

Workflow Output

Build Cell Ranger References

Build references for sc/snRNA-seq

1. Import cellranger_create_reference

2. Upload requred data to Google Bucket

3. Input sample sheet

4. Workflow input

5. Workflow output

Build references for scATAC-seq

1. Import cellranger_atac_create_reference

2. Upload required data to Google Bucket

3. Workflow input

4. Workflow output

Build references for single-cell immune profiling data

1. Import cellranger_vdj_create_reference

2. Upload requred data to Google Bucket

3. Workflow input

4. Workflow output

1. Import `cellranger_workflow`

1. Import `cellranger_create_reference`

1. Import `cellranger_atac_create_reference`

1. Import `cellranger_vdj_create_reference`