Run Cell Ranger tools using cellranger_workflow

cellranger_workflow wraps Cell Ranger to process single-cell/nucleus RNA-seq, single-cell ATAC-seq and single-cell immune profiling data, and supports feature barcoding (cell/nucleus hashing, CITE-seq, Perturb-seq). It also provide routines to build cellranger references.

A general step-by-step instruction

The workflow starts with FASTQ files.

Note

Starting from v3.0.0, Cumulus cellranger_workflow drops support for mkfastq. If your data start from BCL files, please first run BCL Convert to demultiplex flowcells to generate FASTQ files.

1. Import cellranger_workflow

Import cellranger_workflow workflow to your workspace by following instructions in Import workflows to Terra. You should choose workflow github.com/lilab-bcb/cumulus/CellRanger to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_workflow workflow in the drop-down menu.

2. Upload sequencing data to Google bucket

Copy your FASTQ files to your workspace bucket using gcloud storage command (you already have it if you’ve installed Google cloud SDK) in your unix terminal.

You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.

../_images/google_bucket_link.png

There are three cases:

  • Case 1: All the FASTQ files are in one top-level folder. Then you can simply upload this folder to Cloud, and in your sample sheet, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files.

  • Case 2: In the top-level folder, each sample has a dedicated subfolder containing its FASTQ files. In this case, you need to upload the whole top-level folder, and in your sample sheet, make sure Sample names and their corresponding subfolder names are identical.

  • Case 3: Each sample’s FASTQ files are wrapped in a TAR file. In this case, upload the folder which contains this TAR file. Also, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files inside the TAR files.

Notice that if your FASTQ files are downloaded from the Sequence Read Archive (SRA) from NCBI, you must rename your FASTQs to follow the Illumina file naming conventions.

Example:

gcloud storage cp -r /foo/bar/K18WBC6Z4/Fastq gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastq

where -r means copy the directory recursively, and fc-e0000000-0000-0000-0000-000000000000 should be replaced by your own workspace Google bucket name.

Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload FASTQ files to cloud.

3. Prepare a sample sheet

3.1 Sample sheet format:

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

The sample sheet describes how to generate count matrices from sequencing reads. A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column

Description

Sample

Sample name. This name must be consistent with its corresponding FASTQ filename prefix in the folder specified in Flowcell column. Sample names can only contain characters from [a-zA-Z0-9\_-] to be recognized by Cell Ranger.
Notice that if a sample has multiple sequencing runs, each of which has FASTQ files stored in dedicated location, you can specify multiple entries in the sample sheet with the same name in Sample column, and each entry accounts for one FASTQ folder location.

Reference

Provides the reference genome used by Cell Ranger for processing the sample.
The reference can be a keyword of prebuilt references (e.g. GRCh38-2020-A) that stored in Cumulus bucket, or a user specified cloud URI to a custom reference (in tarball .tar.gz format).
A full list of available keywords is included in each of the following data type sections (e.g. sc/snRNA-seq) below.

Flowcell

Indicates the cloud URI of the uploaded folder containing FASTQ files for each sample.

Chemistry

Keywords to describe the 10x chemistry used for the sample. This column is optional. Check data type sections (e.g. sc/snRNA-seq) below for the corresponding list of available keywords.

DataType

Describes the data type of each sample, with keywords chosen from the list below. This column is optional, and the default is rna.

  • rna: Gene expression (GEX) data

  • vdj: V(D)J data

  • citeseq: CITE-Seq tag data

  • hashing: Cell-hashing or nucleus-hashing tag data

  • adt: For the case where hashing and citeseq reads are in the same sample library

  • cmo: Cell multiplexing oligos used in 10x Genomics’ CellPlex assay

  • crispr: Perturb-seq guide tag data

  • atac: scATAC-Seq data

  • frp: 10x Flex gene expression (old name is Fixed RNA Profiling) data

AuxFile

The Cloud URI pointing to auxiliary files of the corresponding samples, with different usage depending on DataType values:

  • For rna: It’s used by Sample Multiplexing methods, which specifies the sample name to multiplexing barcode mapping.

  • For frp: It’s used by Flex data, which specifies the sample name to Flex probe barcode mapping.

  • For citeseq, hashing, adt, and crispr: It’s the feature barcode file, which contains the information of antibody for CITE-Seq, cell-hashing, nucleus-hashing, or gNRA for Perturb-Seq.

    • If analyzing using cumulus_feature_barcoding, the feature barcode file should be in format specified in Feature barcoding assays section below;

    • If analyzing as part of the Sample Multiplexing data using cellranger multi, the feature barcode file should be in 10x Feature Reference format.

  • For cmo: It’s the CMO reference file (cmo-set option) when using custom CMOs in CellPlex data.

  • For vdj_t_gd: It’s the inner enrichment primer file (inner-enrichment-primers option) for VDJ-T-GD data.

Notice: This is the FeatureBarcodeFile column in previous versions of Cellranger workflow. This old name is still accepted for backward compatibility.

Link

Designed for Single Cell Multiome ATAC + Gene Expression, Feature Barcoding, Sample Multiplexing, or Flex.
Link multiple modalities together using a single link name.
cellranger-arc count, cellranger count, or cellranger multi will be triggered automatically depending on the modalities.
If empty string is provided, no link is assumed.
Link name can only contain characters from [a-zA-Z0-9\_-] for Cell Ranger to recognize.
Notice: The Link names must be unique to Sample values to avoid overwriting each other’s settings.

The sample sheet supports sequencing the same 10x channels across multiple flowcells. If a sample is sequenced across multiple flowcells, simply list it in multiple rows, with one flowcell per row. In the following example, we have 4 samples sequenced in two flowcells.

Example:

Sample,Reference,Flowcell,Chemistry,DataType
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna

3.2 Upload your sample sheet to the workspace bucket:

Example:

gcloud storage cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/

Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload FASTQ files to cloud.

4. Launch analysis

In your workspace, open cellranger_workflow in WORKFLOWS tab. Select the desired snapshot version (e.g. latest). Select Run workflow with inputs defined by file paths as below

../_images/single_workflow.png

and click SAVE button. Select Use call caching and click INPUTS. Then fill in appropriate values in the Attribute column. Alternative, you can upload a JSON file to configure input by clicking Drag or click to upload json.

Once INPUTS are appropriated filled, click RUN ANALYSIS and then click LAUNCH.

5. Workflow outputs

See the table below for workflow level outputs.

Name

Type

Description

count_outputs

Map[String, Array[String]?]

A modality-to-output map showing output URIs for all samples, organized by modality and one URI per sample.


Single-cell and single-nucleus RNA-seq

To process sc/snRNA-seq data, follow the specific instructions below.

Sample sheet

  1. Reference column.

    Pre-built scRNA-seq references are summarized below.

    Keyword

    Description

    GRCh38-2024-A

    Human GRCh38, comparable to cellranger reference 2024-A (GENCODE v44/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.

    GRCm39-2024-A

    Mouse GRCm39, comparable to cellranger reference 2024-A (GENCODE vM33/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.

    GRCh38_and_GRCm39-2024-A

    Human GRCh38 (v44/Ensembl 110) and mouse GRCm39 (GENCODE vM33/Ensembl 110). Notice: This reference only supports Cell Ranger v6.0.0+.

    mRatBN7.2-2024-A

    Rat mRatBN7.2 reference.

    GRCh38-2020-A

    Human GRCh38 (GENCODE v32/Ensembl 98)

    mm10-2020-A

    Mouse mm10 (GENCODE vM23/Ensembl 98)

    GRCh38_and_mm10-2020-A

    Human GRCh38 (GENCODE v32/Ensembl 98) and mouse mm10 (GENCODE vM23/Ensembl 98)

  2. Chemistry column.

    The cellranger workflow fully supports all 10x assay configurations. The most widely used ones are listed below:

    Chemistry

    Explanation

    auto

    autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry

    threeprime

    Single Cell 3′

    fiveprime

    Single Cell 5′

    ARC-v1

    Gene Expression portion of 10x Multiome data

    Please refer to the section of --chemistry option in Cell Ranger Command Line Arguments for all other valid chemistry keywords.

  3. Flowcell column.

    See the table in general steps section above.

    Note

    The workflow accepts input in TAR files which contain FASTQ files inside, and can automatically handle such cases.

  4. DataType column.

    This column is optional with a default rna. If you want to put a value, put rna here.

  5. Example:

    Sample,Reference,Flowcell,Chemistry,DataType
    sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,rna
    sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,threeprime,rna
    sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,rna
    sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2/Fastq,fiveprime,rna
    sample_3,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna
    

Workflow input

For sc/snRNA-seq data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger count. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType) in CSV format

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Cloud URI of the output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

Results are written under directory output_directory and will overwrite any existing files at this location.

include_introns

Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0.

true

true

no_bam

Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0.

false

false

expect_cells

Expected number of recovered cells. Mutually exclusive with force_cells

3000

force_cells

Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells

6000

secondary

Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.)

false

false

cellranger_version

cellranger version, could be: 9.0.1, 8.0.1, 7.2.0

“9.0.1”

“9.0.1”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.

“s3://xxxx/index.tsv”

“gs://cumulus-ref/resources/cellranger/index.tsv”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu

Number of cpus to request for one node for cellranger count

32

32

memory

Memory size string for cellranger count

“120G”

“120G”

count_disk_space

Disk space in GB needed for cellranger count

500

500

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Workflow output

See the table below for important sc/snRNA-seq outputs.

Name

Type

Description

cellranger_count.output_count_directory

Array[String]

Subworkflow output. A list of cloud URIs containing gene count matrices, one URI per sample.

cellranger_count.output_web_summary

Array[File]

Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger count output).

collect_summaries.metrics_summaries

File

Task output. An excel spreadsheet containing QCs for each sample.


Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)

cellranger_workflow can extract feature-barcode count matrices in CSV format for feature barcoding assays such as cell and nucleus hashing, CellPlex, CITE-seq, and Perturb-seq. For cell and nucleus hashing as well as CITE-seq, the feature refers to antibody. For Perturb-seq, the feature refers to guide RNA. Please follow the instructions below to configure cellranger_workflow.

Tthe workflow uses Cumulus Feature Barcoding to process antibody and Perturb-Seq data.

Prepare feature barcode files

Prepare a CSV file with the following format: feature_barcode,feature_name. See below for an example:

TTCCTGCCATTACTA,sample_1
CCGTACCTCATTGTT,sample_2
GGTAGATGTCCTCAG,sample_3
TGGTGTCATTCTTGA,sample_4

The above file describes a cell hashing application with 4 samples.

If cell hashing and CITE-seq data share a same sample index, you should concatenate hashing and CITE-seq barcodes together and add a third column indicating the feature type. See below for an example:

TTCCTGCCATTACTA,sample_1,hashing
CCGTACCTCATTGTT,sample_2,hashing
GGTAGATGTCCTCAG,sample_3,hashing
TGGTGTCATTCTTGA,sample_4,hashing
CTCATTGTAACTCCT,CD3,citeseq
GCGCAACTTGATGAT,CD8,citeseq

Then upload it to your google bucket:

gcloud storage cp antibody_index.csv gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv

Sample sheet

  1. Reference column.

    This column is not used for extracting feature-barcode count matrix. To be consistent, you can put the reference for the associated scRNA-seq assay here.

  2. Chemistry column.

    The following keywords are accepted for Chemistry column:

    Chemistry

    Explanation

    auto

    Default. Auto-detect the chemistry of your data from all possible 10x assay types.

    threeprime

    Auto-detect the chemistry of your data from all 3’ assay types.

    fiveprime

    Auto-detect the chemistry of your data from all 5’ assay types.

    SC3Pv4

    Single Cell 3’ v4. The workflow will auto-detect if Poly-A or CS1 capture method was applied to your data.
    Notice: This is a GEM-X chemistry, and only works for Cell Ranger v8.0.0+

    SC3Pv3

    Single Cell 3′ v3. This is a Next GEM chemistry. The workflow will auto-detect if Poly-A or CS1 capture method was applied to your data.

    SC3Pv2

    Single Cell 3′ v2

    SC5Pv3

    Single Cell 5’ v3. Notice: This is a GEM-X chemistry, and only works for Cell Rangrer v8.0.0+

    SC5Pv2

    Single Cell 5′ v2

    multiome

    10x Multiome barcodes

Note

Not all 10x chemistry names are supported for feature barcoding, as the workflow uses Cumulus Feature Barcoding to process the data.

  1. DataType column.

    The following keywords are accepted for DataType column:

    DataType

    Explanation

    citeseq

    CITE-seq

    hashing

    Cell or nucleus hashing

    cmo

    CellPlex

    adt

    Hashing and CITE-seq are in the same library

    crispr

    Perturb-seq/CROP-seq
    If neither crispr_barcode_pos nor scaffold_sequence (see Workflow input) is set, crispr refers to 10x CRISPR assays. If in addition Chemistry is set to be SC3Pv3 or its aliases, Cumulus automatically complement the middle two bases to convert 10x feature barcoding cell barcodes back to 10x RNA cell barcodes.
    Otherwise, crispr refers to non 10x CRISPR assays, such as CROP-Seq. In this case, we assume feature barcoding cell barcodes are the same as the RNA cell barcodes and no cell barcode convertion will be conducted.
  2. AuxFile column.

    Put cloud URI of the feature barcode file here.

Below is an example sample sheet:

Sample,Reference,Flowcell,Chemistry,DataType,AuxFile
sample_1_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna,
sample_1_adt,,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,threeprime,hashing,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
sample_2_gex,GRCh38-2024-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,auto,rna
sample_2_adt,GRCh38-2024-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index2.csv
sample_3_crispr,,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4/Fastq,fiveprime,crispr,gs://fc-e0000000-0000-0000-0000-000000000000/crispr_index.csv

In the sample sheet above, despite the header row,

  • Row 1 and 2 specify the GEX and Hashing libraries of the same sample.

  • Row 3 and 4 specify a sample which has GEX and adt (contains both Hashing and CITE-Seq data) libraries.

  • Row 5 describes one gRNA guide data for Perturb-seq (see crispr in DataType field).

Workflow input

For feature barcoding data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cumulus adt. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType, and AuxFile)

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

crispr_barcode_pos

Barcode start position at Read 2 (0-based coordinate) for CRISPR

19

0

scaffold_sequence

Scaffold sequence in sgRNA for Purturb-seq, only used for crispr data type.

“GTTTAAGAGCTAAGCTGGAA”

“”

max_mismatch

Maximum hamming distance in feature barcodes for the adt task (changed to 2 as default)

2

2

min_read_ratio

Minimum read count ratio (non-inclusive) to justify a feature given a cell barcode and feature combination, only used for the adt task and crispr data type

0.1

0.1

cumulus_feature_barcoding_version

Cumulus_feature_barcoding version for extracting feature barcode matrix.

“1.0.0”

“1.0.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

feature_num_cpu

Number of cpus for extracting feature count matrix

4

4

feature_memory

Optional memory string for extracting feature count matrix

“32G”

“32G”

feature_disk_space

Disk space in GB needed for extracting feature count matrix

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Parameters used for feature count matrix extraction

Cell barcode inclusion lists (previously known as whitelists) are automatically decided based on the Chemistry specified in the sample sheet. The association table is here.

Cell barcode matching settings are also automatically decided based on the chemistry specified:

  • For 10x V3 and V4 chemistry: a hamming distance of 0 is allowed for matching cell barcodes, and the UMI length is 12;

  • For multiome: a hamming distance of 1 is allowed for matching cell barcodes, and the UMI length is 12;

  • For 10x V2 chemistry: a hamming distance of 1 is allowed for matching cell barcodes, and the UMI length is 10.

For Perturb-seq data, a small number of sgRNA protospace sequences will be sequenced ultra-deeply and we may have PCR chimeric reads. Therefore, we generate filtered feature count matrices as well in a data driven manner:

  1. First, plot the histogram of UMIs with certain number of read counts. The number of UMIs with x supporting reads decreases when x increases. We start from x = 1, and a valley between two peaks is detected if we find count[x] < count[x + 1] < count[x + 2]. We filter out all UMIs with < x supporting reads since they are likely formed due to chimeric reads.

  2. In addition, we also filter out barcode-feature-UMI combinations that have their read count ratio, which is defined as total reads supporting barcode-feature-UMI over total reads supporting barcode-UMI, no larger than min_read_ratio parameter set above.

Workflow outputs

The table below lists important feature barcoding output when using Cumulus Feature Barcoding:

Name

Type

Description

cumulus_adt.output_count_directory

Array[String]

Subworkflow output. A list of cloud URIs containing feature-barcode count matrices, one URI per sample.

In addition, For each antibody tag or crispr tag sample, a folder with the sample ID is generated under output_directory. In the folder, two files — sample_id.csv and sample_id.stat.csv.gz — are generated.

sample_id.csv is the feature count matrix. It has the following format. The first line describes the column names: Antibody/CRISPR,cell_barcode_1,cell_barcode_2,...,cell_barcode_n. The following lines describe UMI counts for each feature barcode, with the following format: feature_name,umi_count_1,umi_count_2,...,umi_count_n.

sample_id.stat.csv.gz stores the gzipped sufficient statistics. It has the following format. The first line describes the column names: Barcode,UMI,Feature,Count. The following lines describe the read counts for every barcode-umi-feature combination.

If the feature barcode file has a third column, there will be two files for each feature type in the third column. For example, if hashing presents, sample_id.hashing.csv and sample_id.hashing.stat.csv.gz will be generated.

sample_id.report.txt is a summary report in TXT format. The first lines describe the total number of reads parsed, the number of reads with valid cell barcodes (and percentage over all parsed reads), the number of reads with valid feature barcodes (and percentage over all parsed reads) and the number of reads with both valid cell and feature barcodes (and percentage over all parsed reads). It is then followed by sections describing each feature type. In each section, 7 lines are shown: section title, number of valid cell barcodes (with matching cell barcode and feature barcode) in this section, number of reads for these cell barcodes, mean number of reads per cell barcode, number of UMIs for these cell barcodes, mean number of UMIs per cell barcode and sequencing saturation.

If data type is crispr, three additional files, sample_id.umi_count.pdf, sample_id.filt.csv and sample_id.filt.stat.csv.gz, are generated.

sample_id.umi_count.pdf plots number of UMIs against UMI with certain number of reads and colors UMIs with high likelihood of being chimeric in blue and other UMIs in red. This plot is generated purely based on number of reads each UMI has. For better visualization, we do not show UMIs with > 50 read counts (rare in data).

sample_id.filt.csv is the filtered feature count matrix. It has the same format as sample_id.csv.

sample_id.filt.stat.csv.gz is the filtered sufficient statistics. It has the same format as sample_id.stat.csv.gz.


Single-cell immune profiling

To process single-cell immune profiling (scIR-seq) data, follow the specific instructions below.

Sample sheet

  1. Reference column.

    Pre-built scIR-seq references are summarized below.

    Keyword

    Description

    GRCh38_vdj_v7.1.0

    Human GRCh38 V(D)J sequences, cellranger reference 7.1.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

    GRCh38_vdj_v7.0.0

    Human GRCh38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf

    GRCm38_vdj_v7.0.0

    Mouse GRCm38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf

  2. Chemistry column.

    This column is not used for scIR-seq data. Put fiveprime here as a placeholder if you decide to include the Chemistry column.

  3. DataType column.

    Choose one from the availabe types below:

    • vdj: The VDJ library. Let the workflow auto-detect the chain type.

    • vdj_t: The VDJ-T library for T-cell receptor sequences.

    • vdj_b: The VDJ-B library for B-cell receptor sequences.

    • vdj_t_gd: The VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains.

  4. AuxFile column.

    Only need for vdj_t_gd type samples which use primer sequences to enrich cDNA for V(D)J sequences. In this case, provide a .txt file containing such sequences, one per line. Then this file would be given to --inner-enrichment-primers option in cellranger vdj.

Note

The --chain option in cellranger vdj is automatically decided based on the DataType value specified:
  • For vdj: set to --chain auto

  • For vdj_t and vdj_t_gd: set to --chain TR

  • For vdj_b: set to --chain IG

An example sample sheet is below:

Sample,Reference,Flowcell,Chemistry,DataType,AuxFile
sample1,GRCh38_vdj_v7.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,fiveprime,vdj,
sample2,GRCh38_vdj_v7.1.0,gs://my-bucket/s2_fastqs,,vdj_t_gd,gs://my-bucket/s2_enrich_primers.txt

Workflow input

For scIR-seq data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger vdj. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, DataType, Chemistry, and AuxFile)

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

vdj_denovo

Do not align reads to reference V(D)J sequences before de novo assembly

false

false

cellranger_version

cellranger version, could be: 9.0.1, 8.0.1, 7.2.0

“9.0.1”

“9.0.1”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.

“s3://xxxx/index.tsv”

“gs://cumulus-ref/resources/cellranger/index.tsv”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu

Number of cpus to request for one node for cellranger vdj

32

32

memory

Memory size string for cellranger vdj

“120G”

“120G”

vdj_disk_space

Disk space in GB needed for cellranger vdj

500

500

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Workflow output

See the table below for important scIR-seq outputs.

Name

Type

Description

cellranger_vdj.output_count_directory

Array[String]

Subworkflow output. A list of cloud URIs containing vdj results, one URI per sample.

cellranger_vdj.output_web_summary

Array[File]

Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger vdj output).

collect_summaries_vdj.metrics_summaries

File

Task output. An excel spreadsheet containing QCs for each sample.


Single-cell ATAC-seq

To process scATAC-seq data, follow the specific instructions below.

Sample sheet

  1. Reference column.

    Pre-built scATAC-seq references are summarized below.

    Keyword

    Description

    GRCh38-2020-A_arc_v2.0.0

    Human GRCh38, cellranger-arc/atac reference 2.0.0

    mm10-2020-A_arc_v2.0.0

    Mouse mm10, cellranger-arc/atac reference 2.0.0

    GRCh38_and_mm10-2020-A_atac_v2.0.0

    Human GRCh38 and mouse mm10, cellranger-atac reference 2.0.0

  2. Chemistry column.

    By default is auto, which will not specify a given chemistry. To analyze just the individual ATAC library from a 10x multiome assay using cellranger-atac count, use ARC-v1 in the Chemistry column.

  3. DataType column.

    Set it to atac.

  4. AuxFile column.

    Leave it blank for scATAC-seq.

An example sample sheet is below:

Sample,Reference,Flowcell,DataType
sample_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9YB/Fastq,atac

Workflow input

cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger-atac count. Please see the description of inputs below. Note that required inputs are shown in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, DataType, and Chemistry)

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_output”

force_cells

Force pipeline to use this number of cells, bypassing the cell detection algorithm

6000

atac_dim_reduce

Choose the algorithm for dimensionality reduction prior to clustering and tsne: “lsa”, “plsa”, or “pca”

“lsa”

“lsa”

peaks

A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with # are allowed

“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”

cellranger_atac_version

cellranger-atac version. Available options: 2.1.0, 2.0.0

“2.1.0”

“2.1.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.

“s3://xxxx/index.tsv”

“gs://cumulus-ref/resources/cellranger/index.tsv”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

atac_num_cpu

Number of cpus for cellranger-atac count

64

64

atac_memory

Memory string for cellranger-atac count

“57.6G”

“57.6G”

atac_disk_space

Disk space in GB needed for cellranger-atac count

500

500

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Workflow output

See the table below for important scATAC-seq outputs.

Name

Type

Description

cellranger_atac_count.output_count_directory

Array[String]

Subworkflow output. A list of cloud URIs containing cellranger-atac count outputs, one URI per sample.

cellranger_atac_count.output_web_summary

Array[File]

Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger-atac count output).

collect_summaries_atac.metrics_summaries

File

Task output. An Excel spreadsheet containing QCs for each sample.


Single-cell Multiome (GEX + ATAC)

To process 10x Multiome (GEX + ATAC) data, follow the instructions below:

Sample sheet

  1. Reference column.

    Pre-built single-cell Multiome ATAC + Gene Expression references are summarized below.

    Keyword

    Description

    GRCh38-2020-A_arc_v2.0.0

    Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 2.0.0

    mm10-2020-A_arc_v2.0.0

    Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 2.0.0

  2. Chemistry column.

    By default is auto, which will not specify a given chemistry.

  3. DataType column.

    For each sample, choose a data type from the table below:

    DataType

    Description

    rna

    For scRNA-Seq modality of the data

    atac

    For scATAC-Seq modality of the data

  4. AuxFile column.

    Leave it blank.

  5. Link column.

    Put a unique link name for all modalities that are linked. Notice: The Link name must be different from all Sample column values.

  6. Example:

    Link,Sample,Reference,Flowcell,DataType
    sample1,s1_rna,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,rna
    sample1,s1_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ/Fastq,atac
    

In the above example, the linked samples will be processed altogether. And the output will be one subfolder named sample1.

Workflow input

For single-cell multiomics data, cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files). Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Sample, Reference, Flowcell, Chemistry, DataType, and Link)

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

include_introns

Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references.

true

true

no_bam

Turn this option on to disable BAM file generation.

false

false

arc_gex_exclude_introns

Disable counting of intronic reads. In this mode, only reads that are exonic and compatible with annotated splice junctions in the reference are counted.
Note: using this mode will reduce the UMI counts in the feature-barcode matrix.

false

false

arc_min_atac_count

Cell caller override to define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode.
Note: this input must be specified in conjunction with arc_min_gex_count input.
With both inputs set, a barcode is defined as a cell if it contains at least arc_min_atac_count ATAC counts AND at least arc_min_gex_count GEX UMI counts.

100

arc_min_gex_count

Cell caller override to define the minimum number of GEX UMI counts for a cell barcode.
Note: this input must be specified in conjunction with arc_min_atac_count. See the description of arc_min_atac_count input for details.

200

peaks

A 3-column BED file of peaks to override cellranger arc peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with # are allowed

“gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”

cellranger_arc_version

cellranger-arc version, could be: 2.0.2.strato (compatible with workflow v2.6.1+), 2.0.2.custom-max-cell (with max_cell threshold set to 80,000), 2.0.2 (compatible with workflow v2.6.0 or earlier), 2.0.1, 2.0.0

“2.0.2.strato”

“2.0.2.strato”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry barcode inclusion lists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines.

“s3://xxxx/index.tsv”

“gs://cumulus-ref/resources/cellranger/index.tsv”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

arc_num_cpu

Number of cpus to request for one link

64

64

arc_memory

Memory size string for one link

“160G”

“160G”

arc_disk_space

Disk space in GB needed for one link

700

700

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Workflow output

See the table below for important output:

Name

Type

Description

cellranger_arc_count.output_count_directory

Array[String]

A list of cloud URIs to output, one URI per link

cellranger_arc_count.output_web_summary

Array[File]

A list of htmls visualizing QCs for each link

collect_summaries_arc.metrics_summaries

File

An excel spreadsheet containing QCs for each link


Flex, Sample Multiplexing and Multiomics

The cellranger workflow supports processing data of 10x Flex and Sample Multiplexing type, as well as multiomics data. Follow the corresponding sections below based on your data type:

Flex Gene Expression

This section covers preparing the sample sheet for Flex (previously named Fixed RNA Profiling) data.

  1. Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi.
  • Notice 1: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

  • Notice 2: If there is only a scRNA-Seq sample in the data, you don’t need to specify Link name. Then the workflow would use its Sample name for the whole data.

  1. DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

frp

Select one from prebuilt genome references in scRNA-seq section,
or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to Flex probe barcode association (see an example below this table).

For RNA-Seq samples

Choose one from: citeseq, crispr

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format. Notice: If multiple antibody capture samples, you need to combine feature barcodes used in all of them in one reference file.

For antibody capture samples:

  • citeseq: For CITE-Seq samples.

  • crispr: For Perturb-Seq samples. Notice: This data type used in Flex is supported only in Cell Ranger v8.0+.

An example sample name to Flex probe barcode association file is the following (see here for examples of different Flex experiment settings):

sample_id,probe_barcode_ids,description
sample1,BC001,Control
sample2,BC002,Treated

The description column is optional, which specifies the description of the samples.

Note

In the sample name to Flex probe barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, probe_barcode_ids, and description (optional).

Below is an example sample sheet for Flex data:

Sample,Reference,Flowcell,DataType,AuxFile
s1,GRCh38-2020-A,gs://my-bucket/s1_fastqs,frp,gs://my-bucket/s1_flex.csv

Notice that Link column is not required for this case.

An example sample sheet for a more complex Flex data:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,frp,gs://my-bucket/s2_flex.csv
s2,s2_citeseq,,gs://my-bucket/s2_fastqs,citeseq,gs://my-bucket/s2_fbc.csv
s2,s2_crispr,,gs://my-bucket/s2_fastqs,crispr,gs://my-bucket/s2_fbc.csv
  1. Flex Probe Set.

Flex uses probes that target protein-coding genes in the human or mouse transcriptome. It’s automatically determined by the genome reference specified by users for the scRNA-Seq sample by following the table below:

Genome Reference

Probe Set

Cell Ranger version

GRCh38-2024-A

Flex_human_probe_v1.1

v9.0+

GRCh38-2020-A

Flex_human_probe_v1.0.1

v7.1+

GRCm39-2024-A

Flex_mouse_probe_v1.1

v9.0+

mm10-2020-A

Flex_mouse_probe_v1.0.1

v7.1+

See Flex probe sets overview for details on these probe sets.

On Chip Multiplexing

This section covers preparing the sample sheet for On-Chip Multiplexing (OCM) data.

  1. Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

  1. DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

rna

Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to OCM barcode association (see an example below this table).

For RNA-Seq samples

Choose one from: vdj, vdj_t, vdj_b, vdj_t_gd

Select one from prebuilt VDJ references in Single-cell immune profiling section.

Optional. For vdj_t_gd type samples only: path to a text file containing inner enrichment primers info. This is the inner-enrichment-primers option in VDJ section of Cell Ranger multi config CSV.

For each VDJ sample, choose one from the 4 provided VDJ data types:

  • vdj: Leave the workflow to auto-detect.

  • vdj_t: VDJ-T library for T-cell receptor sequences.

  • vdj_b: VDJ-B library for B-cell receptor sequences.

  • vdj_t_gd: VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains. Notice: For such sample, A text file containing inner enrichment primers info must provided in AuxFile column.

Choose one from: citeseq, adt

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format. Notice: If adt type, you need to combine feature barcodes of both CITE-Seq and Hashing modalities in one file.

For antibody capture samples:

  • citeseq: For samples only containing CITE-Seq modality.

  • adt: For samples containing both CITE-Seq and Hashing modalities.

An example sample name to OCM barcode association file is the following:

sample_id,ocm_barcode_ids,description
sample1,OB1,Control
sample2,OB2,Treated

where description column is optional, which specifies the description of the samples.

Note

In the sample name to OCM barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, ocm_barcode_ids, and description (optional).

Below is an example sample sheet for OCM:

Sample,Reference,Flowcell,DataType,AuxFile,Link
s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_ocm.csv,s1
s1_vdj,GRCh38_vdj_v7.1.0,gs://my-bucket/s1_fastqs,vdj,,s1
s1_adt,,gs://my-bucket/s1_fastqs,citeseq,gs://my-bucket/s1_fbc.csv,s1

In the case where there is only scRNA-Seq library in your data, the Link column is optional:

Sample,Reference,Flowcell,DataType,AuxFile
s2,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_ocm.csv

Hashing with Antibody Capture

This section covers preparing the sample sheet for non-OCM hashtag oligo (HTO) data.

  1. Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

  1. DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

rna

Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to HTO barcode association (see an example below this table).

For RNA-Seq samples

Choose one from: vdj, vdj_t, vdj_b, vdj_t_gd

Select one from prebuilt VDJ references in Single-cell immune profiling section.

Optional. For vdj_t_gd type samples only: path to a text file containing inner enrichment primers info. This is the inner-enrichment-primers option in VDJ section of Cell Ranger multi config CSV.

For each VDJ sample, choose one from the 4 provided VDJ data types:

  • vdj: Leave the workflow to auto-detect.

  • vdj_t: VDJ-T library for T-cell receptor sequences.

  • vdj_b: VDJ-B library for B-cell receptor sequences.

  • vdj_t_gd: VDJ-T-GD library for T-cell receptor enriched for gamma (TRG) and delta (TRD) chains. Notice: For such sample, A text file containing inner enrichment primers info must provided in AuxFile column.

hashing

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format, which specifies the oligonucleotide sequences used in the data.

For antibody capture samples

An example sample name to HTO barcode association file is the following:

sample_id,hashtag_ids,description
sample1,TotalSeqB_Hashtag_1,Control
sample2,CD3_TotalSeqB,Treated

where names in hashtag_ids column must be consistent with id column in the feature reference file. The description column is optional, which specifies the description of the samples.

Note

In the sample name to HTO barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, hashtag_ids, and description (optional).

Below is an example sample sheet for HTO:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s1,s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_hto.csv
s1,s1_vdj,GRCh38_vdj_v7.1.0,gs://my-bucket/s1_fastqs,vdj,
s1,s1_hto,,gs://my-bucket/s1_fastqs,hashing,gs://my-bucket/s1_fbc_ref.csv

Or if your data contain only scRNA-Seq and antibody capture libraries:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_hto.csv
s2,s2_hto,,gs://my-bucket/s2_fastqs,hashing,gs://my-bucket/s2_fbc_ref.csv

Cell Multiplexing with CMO (CellPlex)

This section covers preparing the sample sheet for CellPlex data using Cell Multiplexing Oligos (CMO).

  1. Sample and Link column.

Sample column is for specifying the name of each sample in your data. They must be unique to each other in the sample sheet.

Link column is for specifying the name of your whole data, so that the workflow knows which samples should be put together to run cellranger multi. Notice: You should use a unique Link name for all samples belonging to the same data/experiment. Moreover, the Link name must be different from all Sample names.

  1. DataType, Reference, and AuxFile column.

For each sample, choose a data type from the table below, and prepare its corresponding auxiliary file if needed:

DataType

Reference

AuxFile

Description

rna

Select one from prebuilt genome references in scRNA-seq section, or provide a cloud URI of a custom reference in .tar.gz format.

Path to a text file including the sample name to CMO barcode association (see an example below this table).

For RNA-Seq samples

cmo

No need to specify a reference

Optional. If using custom CMOs, provide the path to their cmo-set reference file of 10x Feature Reference format. See here for an example.

For CMO samples.

citeseq

No need to specify a reference

Path to its feature reference file of 10x Feature Reference format.

For CITE-Seq samples.

An example sample name to CMO barcode association file is the following:

sample_id,cmo_ids,description
sample1,CMO301,Control
sample2,CMO302,Treated

If using a cmo-set reference file, the names in cmo_ids must be consistent with id column in the CMO reference file. The description column is optional, which specifies the description of the samples.

Note

In the sample name to CMO barcode file, the header line is optional. But if users don’t specify this header line, the order of columns must be fixed as sample_id, cmo_ids, and description (optional).

Below is an example sample sheet for CellPlex:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s1,s1_gex,GRCh38-2020-A,gs://my-bucket/s1_fastqs,rna,gs://my-bucket/s1_cmo.csv
s1,s1_cellplex,,gs://my-bucket/s1_fastqs,cmo,

Or if a CITE-Seq sample/library is also included in the data:

Link,Sample,Reference,Flowcell,DataType,AuxFile
s2,s2_gex,GRCh38-2020-A,gs://my-bucket/s2_fastqs,rna,gs://my-bucket/s2_cmo.csv
s2,s2_cellplex,,gs://my-bucket/s2_fastqs,cmo,
s2,s2_citeseq,,gs://my-bucket/s2_fastqs,citeseq,gs://my-bucket/s2_fbc.csv

Multiomics

To analyze multiomics (GEX + CITE-Seq/CRISPR) data, prepare your sample sheet as follows:

  1. Link column.

A unique link name for all modalities of the same data

  1. Chemistry column.

The workflow supports all 10x assay configurations. The most widely used ones are listed below:

Chemistry

Explanation

auto

autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry

threeprime

Single Cell 3′

fiveprime

Single Cell 5′

ARC-v1

Gene Expression portion of 10x Multiome data

Please refer to the section of --chemistry option in Cell Ranger Command Line Arguments for all other valid chemistry keywords.

  1. DataType column.

The following keywords are accepted for DataType column:

DataType

Explanation

rna

For scRNA-seq samples

citeseq

For CITE-seq samples

crispr

For 10x CRISPR samples

  1. AuxFile column.

Prepare your feature reference file in 10x Feature Reference format.

Notice: If multiple antibody samples are used, you need to merge them into one feature reference file, and assign it for each of the samples.

Below is an example sample sheet:

Link,Sample,Reference,DataType,Flowcell,Chemistry,AuxFile
sample_4,s4_gex,GRCh38-2020-A,rna,gs://my-bucket/s4_fastqs,auto,
sample_4,s4_citeseq,,citeseq,gs://my-bucket/s4_fastqs,SC3Pv4,gs://my-bucket/s4_feature_ref.csv

Here, by specifying sample_4 in Link column, the two modalities will be processed together. The output will be one subfolder named sample_4.

Workflow Input

All the sample multiplexing assays share the same workflow input settings. cellranger_workflow takes sequencing reads as input (FASTQ files, or TAR files containing FASTQ files), and runs cellranger multi. Revalant workflow inputs are described below, with required inputs highlighted in bold:

Name

Description

Example

Default

input_csv_file

Sample Sheet (contains Link, Sample, Reference, DataType, Flowcell, and AuxFile columns)

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”

include_introns

Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references

true

true

no_bam

Turn this option on to disable BAM file generation

false

false

force_cells

Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells

6000

expect_cells

Expected number of recovered cells. Mutually exclusive with force_cells

3000

secondary

Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.)

false

false

cellranger_version

Cell Ranger version to use. Available versions: 9.0.1, 8.0.1, 7.2.0.

“9.0.1”

“9.0.1”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

acronym_file

The link/path of an index file in TSV format for fetching preset genome references, probe set references, chemistry whitelists, etc. by their names.
Set an GS URI if running on GCP; an S3 URI if running on AWS; an absolute file path if running on HPC or local machines.

“s3://xxxx/index.tsv”

“gs://cumulus-ref/resources/cellranger/index.tsv”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu

Number of cpus to request per link

32

32

memory

Memory size string to request per link

“120G”

“120G”

multi_disk_space

Used by Flex and Sample Multiplexing data. Disk space in GB to request per link.

1500

1500

count_disk_space

Only used by Multiomics data. Disk space in GB to request per link

500

500

preemptible

Number of preemptible tries. This only works for GCP.

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. This only works for AWS.

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

Workflow Output

All the sample multiplexing assays share the same workflow output structure. See the table below for important outputs:

Name

Type

Description

cellranger_multi.output_multi_directory

Array[String]

Flex and Sample Multiplexing output. A list of cloud URIs to output folders, one URI per link.

cellranger_count_fbc.output_count_directory

Array[String]

Multiomics output. A list of cloud URIs to output folders, one URI per link.


Build Cell Ranger References

We provide routines wrapping Cell Ranger tools to build references for sc/snRNA-seq, scATAC-seq and single-cell immune profiling data.

Build references for sc/snRNA-seq

We provide a wrapper of cellranger mkref to build sc/snRNA-seq references. Please follow the instructions below.

1. Import cellranger_create_reference

Import cellranger_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket

Required data may include input sample sheet, genome FASTA files and gene annotation GTF files.

3. Input sample sheet

If multiple species are specified, a sample sheet in CSV format is required. We describe the sample sheet format below, with required columns highlighted in bold:

Column

Description

Genome

Genome name

Fasta

Location to the genome assembly in FASTA/FASTA.gz format

Genes

Location to the gene annotation file in GTF/GTF.gz format

Attributes

Optional, A list of key:value pairs separated by ;. If set, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

See below for an example for building Example:

Genome,Fasta,Genes,Attributes
GRCh38,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.gtf.gz,gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense
mm10,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.gtf.gz

If multiple species are specified, the reference will built under Genome names concatenated by ‘_and_’s. In the above example, the reference is stored under ‘GRCh38_and_mm10’.

4. Workflow input

Required inputs are highlighted in bold. Note that input_sample_sheet and input_fasta, input_gtf , genome and attributes are mutually exclusive.

Name

Description

Example

Default

input_sample_sheet

A sample sheet in CSV format allows users to specify more than 1 genomes to build references (e.g. human and mouse). If a sample sheet is provided, input_fasta, input_gtf, and attributes will be ignored.

“gs://fc-e0000000-0000-0000-0000-000000000000/input_sample_sheet.csv”

input_fasta

Input genome reference in either FASTA or FASTA.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”

input_gtf

Input gene annotation file in either GTF or GTF.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”

genome

Genome reference name. New reference will be stored in a folder named genome

refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_reference”

attributes

A list of key:value pairs separated by ;. If this option is not None, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details

“gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense”

pre_mrna

If we want to build pre-mRNA references, in which we use full length transcripts as exons in the annotation file. We follow 10x build Cell Ranger compatible pre-mRNA Reference Package to build pre-mRNA references

true

false

ref_version

reference version string

Ensembl v94

cellranger_version

cellranger version, could be: 9.0.1, 8.0.1, 7.2.0

“9.0.1”

“9.0.1”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

num_cpu

Number of cpus to request for one node for building indices

1

1

memory

Memory size string for cellranger mkref

“32G”

“32G”

disk_space

Optional disk space in GB

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

5. Workflow output

Name

Type

Description

output_reference

File

Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.


Build references for scATAC-seq

We provide a wrapper of cellranger-atac mkref to build scATAC-seq references. Please follow the instructions below.

1. Import cellranger_atac_create_reference

Import cellranger_atac_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_atac_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_atac_create_reference workflow in the drop-down menu.

2. Upload required data to Google Bucket

Required data include config JSON file, genome FASTA file, gene annotation file (GTF or GFF3 format) and motif input file (JASPAR format).

3. Workflow input

Required inputs are highlighted in bold.

Name

Description

Example

Default

genome

Genome reference name. New reference will be stored in a folder named genome

refdata-cellranger-atac-mm10-1.1.0

input_fasta

GSURL for input fasta file

“gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa”

input_gtf

GSURL for input GTF file

“gs://fc-e0000000-0000-0000-0000-000000000000/annotation.gtf”

organism

Name of the organism

“human”

non_nuclear_contigs

A comma separated list of names of contigs that are not in nucleus

“chrM”

“chrM”

input_motifs

Optional file containing transcription factor motifs in JASPAR format

“gs://fc-e0000000-0000-0000-0000-000000000000/motifs.pfm”

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_reference”

cellranger_atac_version

cellranger-atac version, could be: 2.1.0, 2.0.0

“2.1.0”

“2.1.0”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

memory

Memory size string for cellranger-atac mkref

“32G”

“32G”

disk_space

Optional disk space in GB

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

4. Workflow output

Name

Type

Description

output_reference

File

Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.


Build references for single-cell immune profiling data

We provide a wrapper of cellranger mkvdjref to build single-cell immune profiling references. Please follow the instructions below.

1. Import cellranger_vdj_create_reference

Import cellranger_vdj_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_vdj_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_vdj_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket

Required data include genome FASTA file and gene annotation file (GTF format).

3. Workflow input

Required inputs are highlighted in bold.

Name

Description

Example

Default

input_fasta

Input genome reference in either FASTA or FASTA.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”

input_gtf

Input gene annotation file in either GTF or GTF.gz format

“gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”

genome

Genome reference name. New reference will be stored in a folder named genome

refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0

output_directory

Output directory

“gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_vdj_reference”

ref_version

reference version string

Ensembl v94

cellranger_version

cellranger version, could be: 9.0.1, 8.0.1, 7.2.0

“9.0.1”

“9.0.1”

docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

zones

Google cloud zones. For GCP Batch backend, the zones are automatically restricted by the Batch settings.

“us-central1-a us-west1-a”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

memory

Memory size string for cellranger mkvdjref

“32G”

“32G”

disk_space

Optional disk space in GB

100

100

preemptible

Number of preemptible tries. Only works for GCP

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. Only works for AWS

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

4. Workflow output

Name

Type

Description

output_reference

File

Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.