Run Cell Ranger tools using cellranger_workflow

cellranger_workflow wraps Cell Ranger to process single-cell/nucleus RNA-seq, single-cell ATAC-seq and single-cell immune profiling data, and supports feature barcoding (cell/nucleus hashing, CITE-seq, Perturb-seq). It also provide routines to build cellranger references.

A general step-by-step instruction

This section mainly considers jobs starting from BCL files. If your job starts with FASTQ files, and only need to run cellranger count part, please refer to this subsection.

1. Import cellranger_workflow

Import cellranger_workflow workflow to your workspace by following instructions in Import workflows to Terra. You should choose workflow github.com/lilab-bcb/cumulus/CellRanger to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_workflow workflow in the drop-down menu.

2. Upload sequencing data to Google bucket

Copy your sequencing output to your workspace bucket using gsutil (you already have it if you’ve installed Google cloud SDK) in your unix terminal.

You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.

../_images/google_bucket_link1.png

Use gsutil cp [OPTION]... src_url dst_url to copy data to your workspace bucket. For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket:

gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4

-m means copy in parallel, -r means copy the directory recursively, and gs://fc-e0000000-0000-0000-0000-000000000000 should be replaced by your own workspace Google bucket URL.

Note

If input is a folder of BCL files, users do not need to upload the whole folder to the Google bucket. Instead, they only need to upload the following files:

RunInfo.xml
RTAComplete.txt
runParameters.xml
Data/Intensities/s.locs
Data/Intensities/BaseCalls

If data are generated using MiSeq or NextSeq, the location files are inside lane subfloders L001 under Data/Intensities/. In addition, if users’ data only come from a subset of lanes (e.g. L001 and L002), users only need to upload lane subfolders from the subset (e.g. Data/Intensities/BaseCalls/L001, Data/Intensities/BaseCalls/L002 and Data/Intensities/L001, Data/Intensities/L002 if sequencer is MiSeq or NextSeq).

Alternatively, users can submit jobs through command line interface (CLI) using altocumulus, which will smartly upload BCL folders according to the above rules.

3. Prepare a sample sheet

3.1 Sample sheet format:

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

The sample sheet describes how to demultiplex flowcells and generate channel-specific count matrices. Note that Sample, Lane, and Index columns are defined exactly the same as in 10x’s simple CSV layout file.

A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column Description
Sample Contains sample names. Each 10x channel should have a unique sample name. Sample name can only contain characters from [a-zA-Z0-9_-].
Reference
Provides the reference genome used by Cell Ranger for each 10x channel.
The elements in the reference column can be either Google bucket URLs to reference tarballs or keywords such as GRCh38-2020-A.
A full list of available keywords is included in each of the following data type sections (e.g. sc/snRNA-seq) below.
Flowcell
Indicates the Google bucket URLs of uploaded BCL folders.
If starts with FASTQ files, this should be Google bucket URLs of uploaded FASTQ folders.
The FASTQ folders should contain one subfolder for each sample in the flowcell with the sample name as the subfolder name.
Each subfolder contains FASTQ files for that sample.
Lane
Tells which lanes the sample was pooled into.
Can be either single lane (e.g. 8) or a range (e.g. 7-8) or all (e.g. *).
Index Sample index (e.g. SI-GA-A12).
Chemistry Describes the 10x chemistry used for the sample. This column is optional.
DataType
Describes the data type of the sample — rna, vdj, citeseq, hashing, cmo, crispr, atac.
rna refers to gene expression data (cellranger count),
vdj refers to V(D)J data (cellranger vdj),
citeseq refers to CITE-Seq tag data,
hashing refers to cell-hashing or nucleus-hashing tag data,
adt, which refers to the case where hashing and citeseq reads are in a sample library.
cmo refers to cell multiplexing oligos used in 10x Genomics’ CellPlex assay,
crispr refers to Perturb-seq guide tag data,
atac refers to scATAC-Seq data (cellranger-atac count),
This column is optional and the default data type is rna.
FeatureBarcodeFile
Google bucket urls pointing to feature barcode files for rna, citeseq, hashing, cmo and crispr data.
Features can be either targeted genes for targeted gene expression analysis, antibody for CITE-Seq, cell-hashing, nucleus-hashing or gRNA for Perburb-seq.
If cmo data is analyzed separately using cumulus_feature_barcoding, file format should follow the guide in Feature barcoding assays section, otherwise follow the guide in Single-cell multiomics section.
This column is only required for targeted gene expression analysis (rna), CITE-Seq (citeseq), cell-hashing or nucleus-hashing (hashing), CellPlex (cmo) and Perturb-seq (crispr).
Link
Designed for Single Cell Multiome ATAC + Gene Expression, Feature Barcoding, or CellPlex.
Link multiple modalities together using a single link name.
cellranger-arc count, cellranger count, or cellranger multi will be triggered automatically depending on the modalities.
If empty string is provided, no link is assumed.
Link name can only contain characters from [a-zA-Z0-9_-].

The sample sheet supports sequencing the same 10x channels across multiple flowcells. If a sample is sequenced across multiple flowcells, simply list it in multiple rows, with one flowcell per row. In the following example, we have 4 samples sequenced in two flowcells.

Example:

Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,SI-GA-A8,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,3-4,SI-GA-B8,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,5-6,SI-GA-C8,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,7-8,SI-GA-D8,fiveprime,rna
sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,1-2,SI-GA-A8,threeprime,rna
sample_2,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,3-4,SI-GA-B8,SC3Pv3,rna
sample_3,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,5-6,SI-GA-C8,fiveprime,rna
sample_4,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,7-8,SI-GA-D8,fiveprime,rna

3.2 Upload your sample sheet to the workspace bucket:

Example:

gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/

4. Launch analysis

In your workspace, open cellranger_workflow in WORKFLOWS tab. Select the desired snapshot version (e.g. latest). Select Run workflow with inputs defined by file paths as below

../_images/single_workflow.png

and click SAVE button. Select Use call caching and click INPUTS. Then fill in appropriate values in the Attribute column. Alternative, you can upload a JSON file to configure input by clicking Drag or click to upload json.

Once INPUTS are appropriated filled, click RUN ANALYSIS and then click LAUNCH.

5. Notice: run cellranger mkfastq if you are non Broad Institute users

Non Broad Institute users that wish to run cellranger mkfastq must create a custom docker image that contains bcl2fastq.

See bcl2fastq instructions.

6. Run cellranger count only

Sometimes, users might want to perform demultiplexing locally and only run the count part on the cloud. This section describes how to only run the count part via cellranger_workflow.

  1. Copy your FASTQ files to the workspace using gsutil in your unix terminal. There are two cases:

    • Case 1: All the FASTQ files are in one top-level folder. Then you can simply upload this folder to Cloud, and in your sample sheet, make sure Sample names are consistent with the filename prefix of their corresponding FASTQ files.
    • Case 2: In the top-level folder, each sample has a dedicated subfolder containing its FASTQ files. In this case, you need to upload the whole top-level folder, and in your sample sheet, make sure Sample names and their corresponding subfolder names are identical.

    Notice that if your FASTQ files are downloaded from the Sequence Read Archive (SRA) from NCBI, you must rename your FASTQs to follow the bcl2fastq file naming conventions.

    Example:

    gsutil -m cp -r /foo/bar/fastq_path/K18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastq
    
  2. Create a sample sheet following the similar structure as above, except the following differences:

    • Flowcell column should list Google bucket URLs of the FASTQ folders for flowcells.
    • Lane and Index columns are NOT required in this case.

    Example:

    Sample,Reference,Flowcell
    sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/K18WBC6Z4_fastq
    
  3. Set optional input run_mkfastq to false.

7. Workflow outputs

See the table below for workflow level outputs.

Name Type Description
fastq_outputs Array[Array[String]?] The top-level array contains results (as arrays) for different data modalities. The inner-level array contains cloud locations of FASTQ files, one url per flowcell.
count_outputs Array[Array[String]?] The top-level array contains results (as arrays) for different data modalities. The inner-level array contains cloud locations of count matrices, one url per sample.
count_matrix String Cloud url for a template count_matrix.csv to run Cumulus. It only contains sc/snRNA-Seq samples.

Single-cell and single-nucleus RNA-seq

To process sc/snRNA-seq data, follow the specific instructions below.

Sample sheet

  1. Reference column.

    Pre-built scRNA-seq references are summarized below.

    Keyword Description
    GRCh38-2020-A Human GRCh38 (GENCODE v32/Ensembl 98)
    mm10-2020-A Mouse mm10 (GENCODE vM23/Ensembl 98)
    GRCh38_and_mm10-2020-A Human GRCh38 (GENCODE v32/Ensembl 98) and mouse mm10 (GENCODE vM23/Ensembl 98)
    GRCh38_v3.0.0 Human GRCh38, cellranger reference 3.0.0, Ensembl v93 gene annotation
    hg19_v3.0.0 Human hg19, cellranger reference 3.0.0, Ensembl v87 gene annotation
    mm10_v3.0.0 Mouse mm10, cellranger reference 3.0.0, Ensembl v93 gene annotation
    GRCh38_and_mm10_v3.1.0 Human (GRCh38) and mouse (mm10), cellranger references 3.1.0, Ensembl v93 gene annotations for both human and mouse
    hg19_and_mm10_v3.0.0 Human (hg19) and mouse (mm10), cellranger reference 3.0.0, Ensembl v93 gene annotations for both human and mouse
    GRCh38_v1.2.0 or GRCh38 Human GRCh38, cellranger reference 1.2.0, Ensembl v84 gene annotation
    hg19_v1.2.0 or hg19 Human hg19, cellranger reference 1.2.0, Ensembl v82 gene annotation
    mm10_v1.2.0 or mm10 Mouse mm10, cellranger reference 1.2.0, Ensembl v84 gene annotation
    GRCh38_and_mm10_v1.2.0 or GRCh38_and_mm10 Human and mouse, built from GRCh38 and mm10 cellranger references, Ensembl v84 gene annotations are used
    GRCh38_and_SARSCoV2 Human GRCh38 and SARS-COV-2 RNA genome, cellranger reference 3.0.0, generated by Carly Ziegler. The SARS-COV-2 viral sequence and gtf are as described in [Kim et al. Cell 2020] (https://github.com/hyeshik/sars-cov-2-transcriptome, BetaCov/South Korea/KCDC03/2020 based on NC_045512.2). The GTF was edited to include only CDS regions, and regions were added to describe the 5’ UTR (“SARSCoV2_5prime”), the 3’ UTR (“SARSCoV2_3prime”), and reads aligning to anywhere within the Negative Strand(“SARSCoV2_NegStrand”). Additionally, trailing A’s at the 3’ end of the virus were excluded from the SARSCoV2 fasta, as these were found to drive spurious viral alignment in pre-COVID19 samples.

    Pre-built snRNA-seq references are summarized below.

    Keyword Description
    GRCh38_premrna_v3.0.0 Human, introns included, built from GRCh38 cellranger reference 3.0.0, Ensembl v93 gene annotation, treating annotated transcripts as exons
    GRCh38_premrna_v1.2.0 or GRCh38_premrna Human, introns included, built from GRCh38 cellranger reference 1.2.0, Ensembl v84 gene annotation, treating annotated transcripts as exons
    mm10_premrna_v1.2.0 or mm10_premrna Mouse, introns included, built from mm10 cellranger reference 1.2.0, Ensembl v84 gene annotation, treating annotated transcripts as exons
    GRCh38_premrna_and_mm10_premrna_v1.2.0 or GRCh38_premrna_and_mm10_premrna Human and mouse, introns included, built from GRCh38_premrna_v1.2.0 and mm10_premrna_v1.2.0
    GRCh38_premrna_and_SARSCoV2 Human, introns included, built from GRCh38_premrna_v3.0.0, and SARS-COV-2 RNA genome. This reference was generated by Carly Ziegler. The SARS-COV-2 RNA genome is from [Kim et al. Cell 2020] (https://github.com/hyeshik/sars-cov-2-transcriptome, BetaCov/South Korea/KCDC03/2020 based on NC_045512.2). Please see the description of GRCh38_and_SARSCoV2 above for details.
  2. Index column.

  3. Chemistry column.

    According to cellranger count’s documentation, chemistry can be

    Chemistry Explanation
    auto autodetection (default). If the index read has extra bases besides cell barcode and UMI, autodetection might fail. In this case, please specify the chemistry
    threeprime Single Cell 3′
    fiveprime Single Cell 5′
    SC3Pv1 Single Cell 3′ v1
    SC3Pv2 Single Cell 3′ v2
    SC3Pv3 Single Cell 3′ v3. You should set cellranger version input parameter to >= 3.0.2
    SC5P-PE Single Cell 5′ paired-end (both R1 and R2 are used for alignment)
    SC5P-R2 Single Cell 5′ R2-only (where only R2 is used for alignment)
  4. DataType column.

    This column is optional with a default rna. If you want to put a value, put rna here.

  5. FetureBarcodeFile column.

    Put target panel CSV file here for targeted expressiond data. Note that if a target panel CSV is present, cell ranger version must be >= 4.0.0.

  6. Example:

    Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType,FeatureBarcodeFile
    sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,SI-GA-A8,threeprime,rna
    sample_1,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,1-2,SI-GA-A8,threeprime,rna
    sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,5-6,SI-GA-C8,fiveprime,rna
    sample_2,mm10-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9Z2,5-6,SI-GA-C8,fiveprime,rna
    sample_3,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,3,SI-TT-A1,auto,rna,gs://fc-e0000000-0000-0000-0000-000000000000/immunology_v1.0_GRCh38-2020-A.target_panel.csv
    

Workflow input

For sc/snRNA-seq data, cellranger_workflow takes Illumina outputs as input and runs cellranger mkfastq and cellranger count. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name Description Example Default
input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output” Results are written under directory output_directory and will overwrite any existing files at this location.
run_mkfastq If you want to run cellranger mkfastq true true
run_count If you want to run cellranger count true true
delete_input_bcl_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false
mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0  
mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false
mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false
mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”  
mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false
force_cells Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells 6000  
expect_cells Expected number of recovered cells. Mutually exclusive with force_cells 3000  
include_introns Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0. true true
no_bam Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0. false false
secondary Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.) false false
cellranger_version cellranger version, could be: 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0 “7.0.0” “7.0.0”
config_version config docker version used for processing sample sheets, could be 0.2, 0.1 “0.2” “0.2”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
mkfastq_docker_registry Docker registry to use for cellranger mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”
acronym_file
The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.
Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.
“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu Number of cpus to request for one node for cellranger mkfastq and cellranger count 32 32
memory Memory size string for cellranger mkfastq and cellranger count “120G” “120G”
mkfastq_disk_space Optional disk space in GB for mkfastq 1500 1500
count_disk_space Disk space in GB needed for cellranger count 500 500
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Workflow output

See the table below for important sc/snRNA-seq outputs.

Name Type Description
cellranger_mkfastq.output_fastqs_directory Array[String]? Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_count.output_count_directory Array[String]? Subworkflow output. A list of cloud urls containing gene count matrices, one url per sample.
cellranger_count.output_web_summary Array[File]? Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger count output).
collect_summaries.metrics_summaries File? Task output. A excel spreadsheet containing QCs for each sample.
count_matrix String Workflow output. Cloud url for a template count_matrix.csv to run Cumulus.

Feature barcoding assays (cell & nucleus hashing, CITE-seq and Perturb-seq)

cellranger_workflow can extract feature-barcode count matrices in CSV format for feature barcoding assays such as cell and nucleus hashing, CellPlex, CITE-seq, and Perturb-seq. For cell and nucleus hashing as well as CITE-seq, the feature refers to antibody. For Perturb-seq, the feature refers to guide RNA. Please follow the instructions below to configure cellranger_workflow.

Prepare feature barcode files

Prepare a CSV file with the following format: feature_barcode,feature_name. See below for an example:

TTCCTGCCATTACTA,sample_1
CCGTACCTCATTGTT,sample_2
GGTAGATGTCCTCAG,sample_3
TGGTGTCATTCTTGA,sample_4

The above file describes a cell hashing application with 4 samples.

If cell hashing and CITE-seq data share a same sample index, you should concatenate hashing and CITE-seq barcodes together and add a third column indicating the feature type. See below for an example:

TTCCTGCCATTACTA,sample_1,hashing
CCGTACCTCATTGTT,sample_2,hashing
GGTAGATGTCCTCAG,sample_3,hashing
TGGTGTCATTCTTGA,sample_4,hashing
CTCATTGTAACTCCT,CD3,citeseq
GCGCAACTTGATGAT,CD8,citeseq

Then upload it to your google bucket:

gsutil antibody_index.csv gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv

Sample sheet

  1. Reference column.

    This column is not used for extracting feature-barcode count matrix. To be consistent, please put the reference for the associated scRNA-seq assay here.

  2. Index column.

    The ADT/HTO index can be either Illumina index primer sequence (e.g. ATTACTCG, also known as D701), or 10x single cell RNA-seq sample index set names (e.g. SI-GA-A12).

    Note 1: All ADT/HTO index sequences (including 10x’s) should have the same length (8 bases). If one index sequence is shorter (e.g. ATCACG), pad it with P7 sequence (e.g. ATCACGAT).

    Note 2: It is users’ responsibility to avoid index collision between 10x genomics’ RNA indexes (e.g. SI-GA-A8) and Illumina index sequences for used here (e.g. ATTACTCG).

    Note 3: For NextSeq runs, please reverse complement the ADT/HTO index primer sequence (e.g. use reverse complement CGAGTAAT instead of ATTACTCG).

  3. Chemistry column.

    The following keywords are accepted for Chemistry column:

    Chemistry Explanation
    auto Default. This is an alias for Single Cell 3’ v3 (SC3Pv3)
    threeprime This is another alias for Single Cell 3’ v3
    SC3Pv3 Single Cell 3′ v3
    SC3Pv2 Single Cell 3′ v2
    fiveprime Single Cell 5′
    SC5P-PE Single Cell 5′ paired-end (both R1 and R2 are used for alignment)
    SC5P-R2 Single Cell 5′ R2-only (where only R2 is used for alignment)
    multiome 10x Multiome barcodes
  4. DataType column.

    The following keywords are accepted for DataType column:

    DataType Explanation
    citeseq CITE-seq
    hashing Cell or nucleus hashing
    cmo CellPlex
    adt Hashing and CITE-seq are in the same library
    crispr
    Perturb-seq/CROP-seq
    If neither crispr_barcode_pos nor scaffold_sequence (see Workflow input) is set, crispr refers to 10x CRISPR assays. If in addition Chemistry is set to be SC3Pv3 or its aliases, Cumulus automatically complement the middle two bases to convert 10x feature barcoding cell barcodes back to 10x RNA cell barcodes.
    Otherwise, crispr refers to non 10x CRISPR assays, such as CROP-Seq. In this case, we assume feature barcoding cell barcodes are the same as the RNA cell barcodes and no cell barcode convertion will be conducted.
  5. FetureBarcodeFile column.

    Put Google Bucket URL of the feature barcode file here.

  6. Example:

    Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType,FeatureBarcodeFile
    sample_1_rna,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,SI-GA-A8,threeprime,rna
    sample_1_adt,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,1-2,ATTACTCG,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
    sample_2_adt,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,3-4,TCCGGAGA,SC3Pv3,adt,gs://fc-e0000000-0000-0000-0000-000000000000/antibody_index.csv
    sample_3_crispr,GRCh38_v3.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4,5-6,CGCTCATT,SC3Pv3,crispr,gs://fc-e0000000-0000-0000-0000-000000000000/crispr_index.csv
    

In the sample sheet above, despite the header row,

  • First row describes the normal 3’ RNA assay;
  • Second row describes its associated antibody tag data, which can from either a CITE-seq, cell hashing, or nucleus hashing experiment.
  • Third row describes another tag data, which is in 10x genomics’ V3 chemistry. For tag and crispr data, it is important to explicitly state the chemistry (e.g. SC3Pv3).
  • Last row describes one gRNA guide data for Perturb-seq (see crispr in DataType field).

Workflow input

For feature barcoding data, cellranger_workflow takes Illumina outputs as input and runs cellranger mkfastq and cumulus adt. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name Description Example Default
input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”  
run_mkfastq If you want to run cellranger mkfastq true true
run_count If you want to run cumulus adt true true
delete_input_bcl_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false
mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0  
mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false
mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false
mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”  
mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false
crispr_barcode_pos Barcode start position at Read 2 (0-based coordinate) for CRISPR 19 0
scaffold_sequence Scaffold sequence in sgRNA for Purturb-seq, only used for crispr data type. “GTTTAAGAGCTAAGCTGGAA” “”
max_mismatch Maximum hamming distance in feature barcodes for the adt task (changed to 2 as default) 2 2
min_read_ratio Minimum read count ratio (non-inclusive) to justify a feature given a cell barcode and feature combination, only used for the adt task and crispr data type 0.1 0.1
cellranger_version cellranger version, could be 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0 “7.0.0” “7.0.0”
cumulus_feature_barcoding_version Cumulus_feature_barcoding version for extracting feature barcode matrix. Version available: 0.9.0, 0.8.0, 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0. “0.9.0” “0.9.0”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
mkfastq_docker_registry Docker registry to use for cellranger mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”
acronym_file
The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.
Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.
“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu Number of cpus to request for one node for cellranger mkfastq 32 32
memory Memory size string for cellranger mkfastq “120G” “120G”
feature_num_cpu Number of cpus for extracting feature count matrix 4 4
feature_memory Optional memory string for extracting feature count matrix “32G” “32G”
mkfastq_disk_space Optional disk space in GB for mkfastq 1500 1500
feature_disk_space Disk space in GB needed for extracting feature count matrix 100 100
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Parameters used for feature count matrix extraction

If the chemistry is V2, 10x genomics v2 cell barcode white list will be used, a hamming distance of 1 is allowed for matching cell barcodes, and the UMI length is 10. If the chemistry is V3, 10x genomics v3 cell barcode white list will be used, a hamming distance of 0 is allowed for matching cell barcodes, and the UMI length is 12.

For Perturb-seq data, a small number of sgRNA protospace sequences will be sequenced ultra-deeply and we may have PCR chimeric reads. Therefore, we generate filtered feature count matrices as well in a data driven manner:

  1. First, plot the histogram of UMIs with certain number of read counts. The number of UMIs with x supporting reads decreases when x increases. We start from x = 1, and a valley between two peaks is detected if we find count[x] < count[x + 1] < count[x + 2]. We filter out all UMIs with < x supporting reads since they are likely formed due to chimeric reads.
  2. In addition, we also filter out barcode-feature-UMI combinations that have their read count ratio, which is defined as total reads supporting barcode-feature-UMI over total reads supporting barcode-UMI, no larger than min_read_ratio parameter set above.

Workflow outputs

See the table below for important outputs.

Name Type Description
cellranger_mkfastq.output_fastqs_directory Array[String]? Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cumulus_adt.output_count_directory Array[String]? Subworkflow output. A list of cloud urls containing feature-barcode count matrices, one url per sample.

In addition, For each antibody tag or crispr tag sample, a folder with the sample ID is generated under output_directory. In the folder, two files — sample_id.csv and sample_id.stat.csv.gz — are generated.

sample_id.csv is the feature count matrix. It has the following format. The first line describes the column names: Antibody/CRISPR,cell_barcode_1,cell_barcode_2,...,cell_barcode_n. The following lines describe UMI counts for each feature barcode, with the following format: feature_name,umi_count_1,umi_count_2,...,umi_count_n.

sample_id.stat.csv.gz stores the gzipped sufficient statistics. It has the following format. The first line describes the column names: Barcode,UMI,Feature,Count. The following lines describe the read counts for every barcode-umi-feature combination.

If the feature barcode file has a third column, there will be two files for each feature type in the third column. For example, if hashing presents, sample_id.hashing.csv and sample_id.hashing.stat.csv.gz will be generated.

sample_id.report.txt is a summary report in TXT format. The first lines describe the total number of reads parsed, the number of reads with valid cell barcodes (and percentage over all parsed reads), the number of reads with valid feature barcodes (and percentage over all parsed reads) and the number of reads with both valid cell and feature barcodes (and percentage over all parsed reads). It is then followed by sections describing each feature type. In each section, 7 lines are shown: section title, number of valid cell barcodes (with matching cell barcode and feature barcode) in this section, number of reads for these cell barcodes, mean number of reads per cell barcode, number of UMIs for these cell barcodes, mean number of UMIs per cell barcode and sequencing saturation.

If data type is crispr, three additional files, sample_id.umi_count.pdf, sample_id.filt.csv and sample_id.filt.stat.csv.gz, are generated.

sample_id.umi_count.pdf plots number of UMIs against UMI with certain number of reads and colors UMIs with high likelihood of being chimeric in blue and other UMIs in red. This plot is generated purely based on number of reads each UMI has. For better visualization, we do not show UMIs with > 50 read counts (rare in data).

sample_id.filt.csv is the filtered feature count matrix. It has the same format as sample_id.csv.

sample_id.filt.stat.csv.gz is the filtered sufficient statistics. It has the same format as sample_id.stat.csv.gz.


Single-cell ATAC-seq

To process scATAC-seq data, follow the specific instructions below.

Sample sheet

  1. Reference column.

    Pre-built scATAC-seq references are summarized below.

    Keyword Description
    GRCh38-2020-A_arc_v2.0.0 Human GRCh38, cellranger-arc/atac reference 2.0.0
    mm10-2020-A_arc_v2.0.0 Mouse mm10, cellranger-arc/atac reference 2.0.0
    GRCh38_and_mm10-2020-A_atac_v2.0.0 Human GRCh38 and mouse mm10, cellranger-atac reference 2.0.0
    GRCh38_atac_v1.2.0 Human GRCh38, cellranger-atac reference 1.2.0
    mm10_atac_v1.2.0 Mouse mm10, cellranger-atac reference 1.2.0
    hg19_atac_v1.2.0 Human hg19, cellranger-atac reference 1.2.0
    b37_atac_v1.2.0 Human b37 build, cellranger-atac reference 1.2.0
    GRCh38_and_mm10_atac_v1.2.0 Human GRCh38 and mouse mm10, cellranger-atac reference 1.2.0
    hg19_and_mm10_atac_v1.2.0 Human hg19 and mouse mm10, cellranger-atac reference 1.2.0
    GRCh38_atac_v1.1.0 Human GRCh38, cellranger-atac reference 1.1.0
    mm10_atac_v1.1.0 Mouse mm10, cellranger-atac reference 1.1.0
    hg19_atac_v1.1.0 Human hg19, cellranger-atac reference 1.1.0
    b37_atac_v1.1.0 Human b37 build, cellranger-atac reference 1.1.0
    GRCh38_and_mm10_atac_v1.1.0 Human GRCh38 and mouse mm10, cellranger-atac reference 1.1.0
    hg19_and_mm10_atac_v1.1.0 Human hg19 and mouse mm10, cellranger-atac reference 1.1.0
  2. Index column.

  3. Chemistry column.

    By default is auto, which will not specify a given chemistry. To analyze just the individual ATAC library from a 10x multiome assay using cellranger-atac count, use ARC-v1 in the Chemistry column.

  4. DataType column.

    Set it to atac.

  5. FetureBarcodeFile column.

    Leave it blank for scATAC-seq.

  6. Example:

    Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType
    sample_atac,GRCh38_atac_v1.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9YB,*,SI-NA-A1,auto,atac
    

Workflow input

cellranger_workflow takes Illumina outputs as input and runs cellranger-atac mkfastq and cellranger-atac count. Please see the description of inputs below. Note that required inputs are shown in bold.

Name Description Example Default
input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_output”  
run_mkfastq If you want to run cellranger-atac mkfastq true true
run_count If you want to run cellranger-atac count true true
delete_input_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false
mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0  
mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false
mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false
mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”  
mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false
force_cells Force pipeline to use this number of cells, bypassing the cell detection algorithm 6000  
atac_dim_reduce Choose the algorithm for dimensionality reduction prior to clustering and tsne: “lsa”, “plsa”, or “pca” “lsa” “lsa”
peaks A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with # are allowed “gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”  
cellranger_atac_version cellranger-atac version. Available options: 2.1.0, 2.0.0, 1.2.0, 1.1.0 “2.1.0” “2.1.0”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
mkfastq_docker_registry Docker registry to use for cellranger-atac mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”
acronym_file
The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.
Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.
“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
atac_num_cpu Number of cpus for cellranger-atac count 64 64
atac_memory Memory string for cellranger-atac count “57.6G” “57.6G”
mkfastq_disk_space Optional disk space in GB for cellranger-atac mkfastq 1500 1500
atac_disk_space Disk space in GB needed for cellranger-atac count 500 500
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Workflow output

See the table below for important scATAC-seq outputs.

Name Type Description
cellranger_atac_mkfastq.output_fastqs_directory Array[String]? Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_atac_count.output_count_directory Array[String]? Subworkflow output. A list of cloud urls containing cellranger-atac count outputs, one url per sample.
cellranger_atac_count.output_web_summary Array[File]? Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger-atac count output).
collect_summaries_atac.metrics_summaries File? Task output. A excel spreadsheet containing QCs for each sample.

Aggregate scATAC-Seq Samples

To aggregate multiple scATAC-Seq samples, follow the instructions below:

  1. Import cellranger_atac_aggr workflow. Please see Step 1 here, and the name of workflow is “cumulus/cellranger_atac_aggr”.
  2. Set the inputs of workflow. Please see the description of inputs below. Notice that required inputs are shown in bold:
Name Description Example Default
aggr_id Aggregate ID. “aggr_sample”  
input_counts_directories A string contains comma-separated URLs to directories of samples to be aggregated. “gs://fc-e0000000-0000-0000-0000-000000000000/data/sample1,gs://fc-e0000000-0000-0000-0000-000000000000/data/sample2”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/aggregate_result”  
genome The reference genome name used by Cell Ranger, can be either a keyword of pre-built genome, or a Google Bucket URL. See this table for the list of keywords of pre-built genomes. “GRCh38_atac_v1.2.0”  
normalize Sample normalization mode. Options are: none, depth, or signal. “none” “none”
secondary Perform secondary analysis (dimensionality reduction, clustering and visualization). false false
dim_reduce Choose the algorithm for dimensionality reduction prior to clustering and tsne. Options are: lsa, plsa, or pca. “lsa” “lsa”
peaks A 3-column BED file of peaks to override cellranger atac peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with # are allowed “gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”  
cellranger_atac_version Cell Ranger ATAC version to use. Options: 2.1.0, 2.0.0, 1.2.0, 1.1.0. “2.1.0” “2.1.0”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-b”
num_cpu Number of cpus to request for cellranger atac aggr. 64 64
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
memory Memory size string for cellranger atac aggr. “57.6G” “57.6G”
disk_space Disk space in GB needed for cellranger atac aggr. 500 500
preemptible Number of preemptible tries. 2 2
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
  1. Check out the output in output_directory/aggr_id folder, where output_directory and aggr_id are the inputs you set in Step 2.

Single-cell immune profiling

To process single-cell immune profiling (scIR-seq) data, follow the specific instructions below.

Sample sheet

  1. Reference column.

    Pre-built scIR-seq references are summarized below.

    Keyword Description
    GRCh38_vdj_v7.0.0 Human GRCh38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
    GRCm38_vdj_v7.0.0 Mouse GRCm38 V(D)J sequences, cellranger reference 7.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf
    GRCh38_vdj_v5.0.0 Human GRCh38 V(D)J sequences, cellranger reference 5.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
    GRCm38_vdj_v5.0.0 Mouse GRCm38 V(D)J sequences, cellranger reference 5.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf
    GRCh38_vdj_v4.0.0 Human GRCh38 V(D)J sequences, cellranger reference 4.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
    GRCm38_vdj_v4.0.0 Mouse GRCm38 V(D)J sequences, cellranger reference 4.0.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf
    GRCh38_vdj_v3.1.0 Human GRCh38 V(D)J sequences, cellranger reference 3.1.0, annotation built from Ensembl Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
    GRCm38_vdj_v3.1.0 Mouse GRCm38 V(D)J sequences, cellranger reference 3.1.0, annotation built from Ensembl Mus_musculus.GRCm38.94.gtf
    GRCh38_vdj_v2.0.0 or GRCh38_vdj Human GRCh38 V(D)J sequences, cellranger reference 2.0.0, annotation built from Ensembl Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf and vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf
    GRCm38_vdj_v2.2.0 or GRCm38_vdj Mouse GRCm38 V(D)J sequences, cellranger reference 2.2.0, annotation built from Ensembl Mus_musculus.GRCm38.90.chr_patch_hapl_scaff.gtf
  2. Index column.

  3. Chemistry column.

    This column is not used for scIR-seq data. Put fiveprime here as a placeholder if you decide to include the Chemistry column.

  4. DataType column.

    Set it to vdj.

  5. FetureBarcodeFile column.

    Leave it blank for scIR-seq.

  6. Example:

    Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType
    sample_vdj,GRCh38_vdj_v3.1.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ,1,SI-GA-A1,fiveprime,vdj
    

Workflow input

For scIR-seq data, cellranger_workflow takes Illumina outputs as input and runs cellranger mkfastq and cellranger vdj. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name Description Example Default
input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”  
run_mkfastq If you want to run cellranger mkfastq true true
run_count If you want to run cellranger vdj true true
delete_input_bcl_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false
mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0  
mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false
mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false
mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”  
mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false
vdj_denovo Do not align reads to reference V(D)J sequences before de novo assembly false false
vdj_chain

Force the analysis to be carried out for a particular chain type. The accepted values are:

  • “auto” for auto detection based on TR vs IG representation;
  • “TR” for T cell receptors;
  • “IG” for B cell receptors.
“auto” “auto”
cellranger_version cellranger version, could be 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0 “7.0.0” “7.0.0”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
mkfastq_docker_registry Docker registry to use for cellranger mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”
acronym_file
The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.
Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.
“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu Number of cpus to request for one node for cellranger mkfastq and cellranger vdj 32 32
memory Memory size string for cellranger mkfastq and cellranger vdj “120G” “120G”
mkfastq_disk_space Optional disk space in GB for mkfastq 1500 1500
vdj_disk_space Disk space in GB needed for cellranger vdj 500 500
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Workflow output

See the table below for important scIR-seq outputs.

Name Type Description
cellranger_mkfastq.output_fastqs_directory Array[String]? Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_vdj.output_vdj_directory Array[String]? Subworkflow output. A list of cloud urls containing vdj results, one url per sample.
cellranger_vdj.output_web_summary Array[File]? Subworkflow output. A list of htmls visualizing QCs for each sample (cellranger vdj output).
collect_summaries_vdj.metrics_summaries File? Task output. A excel spreadsheet containing QCs for each sample.

Single-cell multiomics

To utilize cellranger arc/cellranger multi/cellranger count for single-cell multiomics, follow the specific instructions below. In particular, we put each single modality in one separate lin in the sample sheet as described above. We then use the Link column to link multiple modalities together. Depending on the modalities included, cellranger arc (Multiome ATAC + Gene Expression), cellranger multi (CellPlex), or cellranger count (Feature Barcode) will be triggered. Note that cumulus_feature_barcoding/demuxEM would not be triggered for hashing/citeseq in this setting.

Sample sheet

  1. Reference column.

    Pre-built Multiome ATAC + Gene Expression references are summarized below. CellPlex and Feature Barcode use the same reference as in Single-cell and single-nucleus RNA-seq.

    Keyword Description
    GRCh38-2020-A_arc_v2.0.0 Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 2.0.0
    mm10-2020-A_arc_v2.0.0 Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 2.0.0
    GRCh38-2020-A_arc_v1.0.0 Human GRCh38 sequences (GENCODE v32/Ensembl 98), cellranger arc reference 1.0.0
    mm10-2020-A_arc_v1.0.0 Mouse GRCm38 sequences (GENCODE vM23/Ensembl 98), cellranger arc reference 1.0.0
  2. DataType column.

    For each modality, set it to the corresponding data type.

  3. FetureBarcodeFile column.

    For RNA-seq modality, only set this if a target panel is provided. For CMO (CellPlex), provide sample name - CMO tag association as follows:

    sample1,CMO301|CMO302
    sample2,CMO303
    

    For CITESeq, Perturb-seq and hashing, provide one CSV file as defined in Feature Barcode Reference. Note that one feature barcode reference should be provided for all feature-barcode related modalities (e.g. citeseq, hashing, crispr) and all these modalities should put the same reference file in FeatureBarcodeFile column.

  4. Link column.

    Put a sample unique link name for all modalities that are linked.

  5. Example:

    Sample,Reference,Flowcell,Lane,Index,DataType,FeatureBarcodeFile,Link
    sample1_rna,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ,*,SI-TT-A1,rna,,sample1
    sample1_atac,GRCh38-2020-A_arc_v2.0.0,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZZ,*,SI-TT-N1,atac,,sample1
    sample2_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZX,*,SI-TT-A2,rna,,sample2
    sample2_cmo,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZX,*,SI-TT-N2,cmo,gs://fc-e0000000-0000-0000-0000-000000000000/cmo.csv,sample2
    sample3_rna,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZY,*,SI-TT-A3,rna,,sample3
    sample3_citeseq,GRCh38-2020-A,gs://fc-e0000000-0000-0000-0000-000000000000/VK10WBC9ZY,*,SI-TT-N3,citeseq,gs://fc-e0000000-0000-0000-0000-000000000000/feature_ref.csv,sample3
    

In the above example, three linked samples are provided. cellranger arc, cellranger multi and cellranger count will be triggered respectively.

Workflow input

For single-cell multiomics data, cellranger_workflow takes Illumina outputs as input and runs cellranger-arc mkfastq/cellranger mkfastq and cellranger-arc ount/cellranger multi/cellranger count. Revalant workflow inputs are described below, with required inputs highlighted in bold.

Name Description Example Default
input_csv_file Sample Sheet (contains Sample, Reference, Flowcell, Lane, Index as required and Chemistry, DataType, FeatureBarcodeFile, Link as optional) “gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet.csv”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_output”  
run_mkfastq If you want to run cellranger-arc mkfastq/cellranger mkfastq true true
run_count If you want to run cellranger-arc count/cellranger multi/cellranger count true true
delete_input_bcl_directory If delete BCL directories after demux. If false, you should delete this folder yourself so as to not incur storage charges false false
mkfastq_barcode_mismatches Number of mismatches allowed in matching barcode indices (bcl2fastq2 default is 1) 0  
mkfastq_force_single_index If 10x-supplied i7/i5 paired indices are specified, but the flowcell was run with only one sample index, allow the demultiplex to proceed using the i7 half of the sample index pair false false
mkfastq_filter_single_index Only demultiplex samples identified by an i7-only sample index, ignoring dual-indexed samples. Dual-indexed samples will not be demultiplexed false false
mkfastq_use_bases_mask Override the read lengths as specified in RunInfo.xml “Y28n*,I8n*,N10,Y90n*”  
mkfastq_delete_undetermined Delete undetermined FASTQ files generated by bcl2fastq2 true false
force_cells Force pipeline to use this number of cells, bypassing the cell detection algorithm, mutually exclusive with expect_cells. This option is used by cellranger multi and cellranger count. 6000  
expect_cells Expected number of recovered cells. Mutually exclusive with force_cells. This option is used by cellranger multi and cellranger count. 3000  
include_introns Turn this option on to also count reads mapping to intronic regions. With this option, users do not need to use pre-mRNA references. Note that if this option is set, cellranger_version must be >= 5.0.0. This option is used by cellranger multi and cellranger count. true true
arc_gex_exclude_introns
Disable counting of intronic reads. In this mode, only reads that are exonic and compatible with annotated splice junctions in the reference are counted.
Note: using this mode will reduce the UMI counts in the feature-barcode matrix.
false false
no_bam Turn this option on to disable BAM file generation. This option is only available if cellranger_version >= 5.0.0. This option is used by cellranger-arc count, cellranger multi and cellranger count. false false
arc_min_atac_count
Cell caller override to define the minimum number of ATAC transposition events in peaks (ATAC counts) for a cell barcode.
Note: this input must be specified in conjunction with arc_min_gex_count input.
With both inputs set, a barcode is defined as a cell if it contains at least arc_min_atac_count ATAC counts AND at least arc_min_gex_count GEX UMI counts.
100  
arc_min_gex_count
Cell caller override to define the minimum number of GEX UMI counts for a cell barcode.
Note: this input must be specified in conjunction with arc_min_atac_count. See the description of arc_min_atac_count input for details.
200  
peaks A 3-column BED file of peaks to override cellranger arc peak caller. Peaks must be sorted by position and not contain overlapping peaks; comment lines beginning with # are allowed “gs://fc-e0000000-0000-0000-0000-000000000000/common_peaks.bed”  
secondary Perform Cell Ranger secondary analysis (dimensionality reduction, clustering, etc.). This option is used by cellranger multi and cellranger count. false false
cmo_set CMO set CSV file, delaring CMO constructs and associated barcodes. See CMO reference for details. Used only for cellranger multi. “gs://fc-e0000000-0000-0000-0000-000000000000/cmo_set.csv”  
cellranger_version cellranger version, could be 7.0.0, 6.1.2, 6.1.1, 6.0.2, 6.0.1, 6.0.0, 5.0.1, 5.0.0 “7.0.0” “7.0.0”
cellranger_arc_version cellranger-arc version, could be 2.0.1, 2.0.0, 1.0.1, 1.0.0 “2.0.1” “2.0.1”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
mkfastq_docker_registry Docker registry to use for cellranger-arc mkfastq/cellranger mkfastq. Default is the registry to which only Broad users have access. See bcl2fastq for making your own registry. “gcr.io/broad-cumulus” “gcr.io/broad-cumulus”
acronym_file
The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names.
Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.
“s3://xxxx/index.tsv” “gs://regev-lab/resources/cellranger/index.tsv”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu Number of cpus to request for one node for cellranger mkfastq and cellranger vdj 32 32
memory Memory size string for cellranger/cellranger-arc mkfastq and cellranger vdj “120G” “120G”
mkfastq_disk_space Optional disk space in GB for mkfastq 1500 1500
count_disk_space Disk space in GB needed for cellranger count 500 500
arc_num_cpu Number of cpus to request for one node for cellranger-arc count 64 64
arc_memory Memory size string for cellranger-arc count “160G” “160G”
arc_disk_space Disk space in GB needed for cellranger-arc count 700 700
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

Workflow output

See the table below for important sc/snRNA-seq outputs.

Name Type Description
cellranger_arc_mkfastq.output_fastqs_directory / cellranger_mkfastq.output_fastqs_directory Array[String]? Subworkflow output. A list of cloud urls containing FASTQ files, one url per flowcell.
cellranger_arc_count.output_count_directory / cellranger_multi.output_multi_directory / cellranger_count_fbc.output_count_directory Array[String]? Subworkflow output. A list of cloud urls containing cellranger-arc count, cellranger multi or cellranger count outputs, one url per sample.
cellranger_arc_count.output_web_summary / cellranger_count_fbc.output_web_summary Array[File]? A list of htmls visualizing QCs for each sample (cellranger-arc count / cellranger count output).
collect_summaries_arc.metrics_summaries / collect_summaries_fbc.metrics_summaries File? A excel spreadsheet containing QCs for each sample.

Build Cell Ranger References

We provide routines wrapping Cell Ranger tools to build references for sc/snRNA-seq, scATAC-seq and single-cell immune profiling data.

Build references for sc/snRNA-seq

We provide a wrapper of cellranger mkref to build sc/snRNA-seq references. Please follow the instructions below.

1. Import cellranger_create_reference

Import cellranger_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/kalarman-cell-observatory/cumulus/Cellranger_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket

Required data may include input sample sheet, genome FASTA files and gene annotation GTF files.

3. Input sample sheet

If multiple species are specified, a sample sheet in CSV format is required. We describe the sample sheet format below, with required columns highlighted in bold:

Column Description
Genome Genome name
Fasta Location to the genome assembly in FASTA/FASTA.gz format
Genes Location to the gene annotation file in GTF/GTF.gz format
Attributes Optional, A list of key:value pairs separated by ;. If set, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details

Please note that the columns in the CSV can be in any order, but that the column names must match the recognized headings.

See below for an example for building Example:

Genome,Fasta,Genes,Attributes
GRCh38,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.gtf.gz,gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense
mm10,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.fa.gz,gs://fc-e0000000-0000-0000-0000-000000000000/mm10.gtf.gz

If multiple species are specified, the reference will built under Genome names concatenated by ‘_and_’s. In the above example, the reference is stored under ‘GRCh38_and_mm10’.

4. Workflow input

Required inputs are highlighted in bold. Note that input_sample_sheet and input_fasta, input_gtf , genome and attributes are mutually exclusive.

Name Description Example Default
input_sample_sheet A sample sheet in CSV format allows users to specify more than 1 genomes to build references (e.g. human and mouse). If a sample sheet is provided, input_fasta, input_gtf, and attributes will be ignored. “gs://fc-e0000000-0000-0000-0000-000000000000/input_sample_sheet.csv”  
input_fasta Input genome reference in either FASTA or FASTA.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”  
input_gtf Input gene annotation file in either GTF or GTF.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”  
genome Genome reference name. New reference will be stored in a folder named genome refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_reference”  
attributes A list of key:value pairs separated by ;. If this option is not None, cellranger mkgtf will be called to filter the user-provided GTF file. See 10x filter with mkgtf for more details “gene_biotype:protein_coding;gene_biotype:lincRNA;gene_biotype:antisense”  
pre_mrna If we want to build pre-mRNA references, in which we use full length transcripts as exons in the annotation file. We follow 10x build Cell Ranger compatible pre-mRNA Reference Package to build pre-mRNA references true false
ref_version reference version string Ensembl v94  
cellranger_version cellranger version, could be: 7.0.0, 6.1.2, 6.1.1 “7.0.0” “7.0.0”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
num_cpu Number of cpus to request for one node for building indices 1 1
memory Memory size string for cellranger mkref “32G” “32G”
disk_space Optional disk space in GB 100 100
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

5. Workflow output

Name Type Description
output_reference File Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Build references for scATAC-seq

We provide a wrapper of cellranger-atac mkref to build scATAC-seq references. Please follow the instructions below.

1. Import cellranger_atac_create_reference

Import cellranger_atac_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_atac_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_atac_create_reference workflow in the drop-down menu.

2. Upload required data to Google Bucket

Required data include config JSON file, genome FASTA file, gene annotation file (GTF or GFF3 format) and motif input file (JASPAR format).

3. Workflow input

Required inputs are highlighted in bold.

Name Description Example Default
genome Genome reference name. New reference will be stored in a folder named genome refdata-cellranger-atac-mm10-1.1.0  
input_fasta GSURL for input fasta file “gs://fc-e0000000-0000-0000-0000-000000000000/GRCh38.fa”  
input_gtf GSURL for input GTF file “gs://fc-e0000000-0000-0000-0000-000000000000/annotation.gtf”  
organism Name of the organism “human”  
non_nuclear_contigs A comma separated list of names of contigs that are not in nucleus “chrM” “chrM”
input_motifs Optional file containing transcription factor motifs in JASPAR format “gs://fc-e0000000-0000-0000-0000-000000000000/motifs.pfm”  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_atac_reference”  
cellranger_atac_version cellranger-atac version, could be: 2.1.0, 2.0.0, 1.2.0, 1.1.0 “2.1.0” “2.1.0”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
memory Memory size string for cellranger-atac mkref “32G” “32G”
disk_space Optional disk space in GB 100 100
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

4. Workflow output

Name Type Description
output_reference File Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.

Build references for single-cell immune profiling data

We provide a wrapper of cellranger mkvdjref to build single-cell immune profiling references. Please follow the instructions below.

1. Import cellranger_vdj_create_reference

Import cellranger_vdj_create_reference workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Cellranger_vdj_create_reference to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export cellranger_vdj_create_reference workflow in the drop-down menu.

2. Upload requred data to Google Bucket

Required data include genome FASTA file and gene annotation file (GTF format).

3. Workflow input

Required inputs are highlighted in bold.

Name Description Example Default
input_fasta Input genome reference in either FASTA or FASTA.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.dna.toplevel.fa.gz”  
input_gtf Input gene annotation file in either GTF or GTF.gz format “gs://fc-e0000000-0000-0000-0000-000000000000/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz”  
genome Genome reference name. New reference will be stored in a folder named genome refdata-cellranger-vdj-GRCh38-alts-ensembl-3.1.0  
output_directory Output directory “gs://fc-e0000000-0000-0000-0000-000000000000/cellranger_vdj_reference”  
ref_version reference version string Ensembl v94  
cellranger_version cellranger version, could be: 7.0.0, 6.1.2, 6.1.1 “7.0.0” “7.0.0”
docker_registry

Docker registry to use for cellranger_workflow. Options:

  • “quay.io/cumulus” for images on Red Hat registry;
  • “cumulusprod” for backup images on Docker Hub.
“quay.io/cumulus” “quay.io/cumulus”
zones Google cloud zones “us-central1-a us-west1-a” “us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
memory Memory size string for cellranger mkvdjref “32G” “32G”
disk_space Optional disk space in GB 100 100
backend

Cloud backend for file transfer. Available options:

  • “gcp” for Google Cloud;
  • “aws” for Amazon AWS;
  • “local” for local machine.
“gcp” “gcp”
preemptible Number of preemptible tries 2 2
awsMaxRetries Number of maximum retries when running on AWS. This works only when backend is aws. 5 5

4. Workflow output

Name Type Description
output_reference File Gzipped reference folder with name genome.tar.gz. We will also store a copy of the gzipped tarball under output_directory specified in the input.