snRNA-Seq data

This demultiplexing workflow generates gene-count matrices from cell-hashing/nucleus-hashing/genetic-pooling data by demultiplexing.

In the workflow, demuxEM is used for analyzing cell-hashing/nucleus-hashing data, while souporcell and popscle (including demuxlet and freemuxlet) are for genetic-pooling data.

Prepare input data and import workflow

1. Run `cellranger_workflow`

To demultiplex, you’ll need raw gene count and hashtag matrices for cell-hashing/nucleus-hashing data, or raw gene count matrices and genome BAM files for genetic-pooling data. You can generate these data by running the cellranger_workflow.

Please refer to the cellranger_workflow tutorial for details.

When finished, you should be able to find the raw gene count matrix (e.g. raw_gene_bc_matrices_h5.h5), hashtag matrix (e.g. sample_1_ADT.csv) / genome BAM file (e.g. possorted_genome_bam.bam) for each sample.

2. Import `demultiplexing`

Import demultiplexing workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Demultiplexing to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export demultiplexing workflow in the drop-down menu.

3. Prepare a sample sheet

3.1 Sample sheet format:

Create a sample sheet, sample_sheet_demux.csv, which describes the metadata for each pair of RNA and hashtag data. A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column

Description

OUTNAME

Output name for one pair of RNA and hashtag data. Must be unique per pair.

RNA

Google bucket url to the raw gene count matrix generated in Step 1.

TagFile/ADT

Google bucket url to the hashtag file generated in Step 1. The column name can be either TagFile or ADT, where ADT is for backward compatibility with older snapshots.

TYPE

Assay type, which can be cell-hashing, nucleus-hashing, or genetic-pooling.

Genotype

Google bucket url to the reference genotypes in vcf.gz format. This column is required in the following cases:

Run genetic-pooling assay with souporcell algorithm (i.e. TYPE is genetic-pooling, demultiplexing_algorithm input is souporcell):

Run with reference genotypes, i.e. souporcell_de_novo_mode is false.

Run in de novo mode (i.e. souporcell_de_novo_mode is true), but need to match the resulting cluster names by information from reference genotypes (see description of souporcell_rename_donors input below).

Run genetic-pooling assay with popscle algorithm (i.e. TYPE is genetic-pooling, demultiplexing_algorithm input is popscle):

popscle_num_samples input is 0. In this case, demuxlet will be run with reference genotypes.

popscle_num_samples input is larger than 0. In this case, reference genotypes will be only used to generate pileups, then freemuxlet will be used for demultiplexing without reference genotypes.

Example:
OUTNAME,RNA,TagFile,TYPE,Genotype
sample_1,gs://exp/data_1/raw_gene_bc_matrices_h5.h5,gs://exp/data_1/sample_1_ADT.csv,cell-hashing
sample_2,gs://exp/data_2/raw_gene_bc_matrices_h5.h5,gs://exp/data_2/sample_2_ADT.csv,nucleus-hashing
sample_3,gs://exp/data_3/raw_gene_bc_matrices_h5.h5,gs://exp/data_3/possorted_genome_bam.bam,genetic-pooling
sample_4,gs://exp/data_4/raw_gene_bc_matrices_h5.h5,gs://exp/data_4/possorted_genome_bam.bam,genetic-pooling,gs://exp/variants/ref_genotypes.vcf.gz
3.2 Upload your sample sheet to the workspace bucket:

Use gsutil (you already have it if you’ve installed gcloud CLI) in your unix terminal to upload your sample sheet to workspace bucket.

Example:
gsutil cp /foo/bar/projects/sample_sheet_demux.csv gs://fc-e0000000-0000-0000-0000-000000000000/

Workflow inputs

Below are inputs for demultiplexing workflow. We’ll first introduce global inputs, and then inputs for each of the demultiplexing tools. Notice that required inputs are in bold.

global inputs

Name	Description	Example	Default
input_sample_sheet	Input CSV file describing metadata of RNA and hashtag data pairing.	“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet_demux.csv”
output_directory	This is the output directory (gs url + path) for all results. There will be one folder per RNA-hashtag data pair under this directory.	“gs://fc-e0000000-0000-0000-0000-000000000000/demux_output”
genome	Reference genome name. Its usage depends on the assay type: For cell-hashing or nucleus-hashing, only write this name as an annotation into the resulting count matrix file. For genetic-pooling, if demultiplexing_algorithm input is `souporcell`, you should choose one name from this genome reference list. For genetic-pooling, if demultiplexing_algorithm input is `popscle`, reference genome name is not needed.	“GRCh38”
demultiplexing_algorithm	demultiplexing algorithm to use for genetic-pooling data. Options: “souporcell”: Use souporcell, a reference-genotypes-free algorithm for demultiplexing droplet scRNA-Seq data. “popscle”: Use popscle, a canonical algorithm for demultiplexing droplet scRNA-Seq data, including demuxlet (with reference genotypes) and freemuxlet (reference-genotype-free) components.	“souporcell”	“souporcell”
min_num_genes	Only demultiplex cells/nuclei with at least <min_num_genes> expressed genes	100	100
zones	Google cloud zones to consider for execution.	“us-east1-d us-west1-a us-west1-b”	“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”
docker_registry	Docker registry to use. “quay.io/cumulus” for images on Red Hat registry; “cumulusprod” for backup images on Docker Hub.	“quay.io/cumulus”	“quay.io/cumulus”
config_version	Version of config docker image to use. This docker is used for parsing the input sample sheet for downstream execution. Available options: `0.2`, `0.1`.	“0.2”	“0.2”
backend	Cloud infrastructure backend to use. Available options: “gcp” for Google Cloud; “aws” for Amazon AWS; “local” for local machine.	“gcp”	“gcp”
ref_index_file	The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if backend is `gcp`; an S3 URI for `aws` backend; an absolute file path for `local` backend.	“s3://xxxx/index.tsv”	“gs://regev-lab/resources/cellranger/index.tsv”
preemptible	Number of maximum preemptible tries allowed. This works only when backend is `gcp`.	2	2
awsQueueArn	The AWS ARN string of the job queue to be used. This only works for `aws` backend.	“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”	“”

demuxEM inputs

Name	Description	Example	Default
demuxEM_alpha_on_samples	demuxEM parameter. The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse.	0.0	0.0
demuxEM_min_num_umis	demuxEM parameter. Only demultiplex cells/nuclei with at least <demuxEM_min_num_umis> of UMIs.	100	100
demuxEM_min_signal_hashtag	demuxEM parameter. Any cell/nucleus with less than <demuxEM_min_signal_hashtag> hashtags from the signal will be marked as unknown.	10.0	10.0
demuxEM_random_state	demuxEM parameter. The random seed used in the KMeans algorithm to separate empty ADT droplets from others.	0	0
demuxEM_generate_diagnostic_plots	demuxEM parameter. If generate a series of diagnostic plots, including the background/signal between HTO counts, estimated background probabilities, HTO distributions of cells and non-cells, etc.	true	true
demuxEM_generate_gender_plot	demuxEM parameter. If generate violin plots using gender-specific genes (e.g. Xist). <demuxEM_generate_gender_plot> is a comma-separated list of gene names	“XIST”
demuxEM_version	demuxEM version to use. Choose from “0.1.7”, “0.1.6” and “0.1.5”.	“0.1.7”	“0.1.7”
demuxEM_num_cpu	demuxEM parameter. Number of CPUs to request for demuxEM per pair.	8	8
demuxEM_memory	demuxEM parameter. Memory size string for demuxEM per pair.	“10G”	“10G”
demuxEM_disk_space	demuxEM parameter. Disk space (integer) in GB needed for demuxEM per pair.	20	20

souporcell inputs

Name	Description	Example	Default
souporcell_version	souporcell version to use. Available versions: `2.5` released on 2023/06/22. `2022.12`: Based on commitment 9fb527 on 2022/12/13. `2021.03`: Based on commitment 1bd9f1 on 2021/03/07. `2020.07`: Based on commitment 0d09fb on 2020/07/27. `2020.03`: Based on commitment eeddcd on 2020/03/31.	“2.5”	“2.5”
souporcell_num_clusters	souporcell parameter. Number of expected clusters when doing clustering. This needs to be set when running souporcell.	8	1
souporcell_de_novo_mode	souporcell parameter. If `true`, run souporcell in de novo mode without reference genotypes: If input souporcell_common_variants is further provided, use this common variants list instead of calling SNPs de novo. If a reference genotype vcf file is provided in the sample sheet, use it only for matching the cluster labels computed by souporcell. If `false`, run souporcell with `--known_genotypes` option using the reference genotype vcf file specified in sample sheet.	true	true
souporcell_num_clusters	souporcell parameter. Number of expected clusters when doing clustering. This needs to be set when running souporcell.	8	1
souporcell_common_variants	souporcell parameter. Users can provide a common variants list in VCF format for Souporcell to use, instead of calling SNPs de novo. Notice: This input is enabled only when souporcell_de_novo_mode is `false`.	“1000genome.common.variants.vcf.gz”
souporcell_skip_remap	souporcell parameter. Skip remap step. Only recommended in non denovo mode or common variants are provided.	true	false
souporcell_rename_donors	souporcell parameter. A comma-separated list of donor names for matching clusters achieved by souporcell. Must be consistent with souporcell_num_clusters input. If this input is empty, use cluster labels from the reference genotype vcf file if provided in the sample sheet; if this vcf file is not provided, simply name clusters as Donor1, Donor2, … If this input is not empty, and a reference genotype vcf file is provided in the sample sheet, first match the cluster labels using those from this vcf file, then rename to donor names specified in this input. If this input is not empty, and NO reference genotype vcf file is provided in the sample sheet, simply match the cluster labels in one-to-one correspondence with donor names specified in this input.	“CB1,CB2,CB3,CB4”
souporcell_num_cpu	souporcell parameter. Number of CPUs to request for souporcell per pair.	32	32
souporcell_memory	souporcell parameter. Memory size string for souporcell per pair.	“120G”	“120G”
souporcell_disk_space	souporcell parameter. Disk space (integer) in GB needed for souporcell per pair.	500	500

Popscle inputs

Name	Description	Example	Default
popscle_num_samples	popscle parameter. Number of samples to be multiplexed together: If `0`, run with demuxlet using reference genotypes. Otherwise, run with freemuxlet in de novo mode without reference genotypes.	4	0
popscle_min_MQ	popscle parameter. Minimum mapping quality to consider (lower MQ will be ignored).	20	20
popscle_min_TD	popscle parameter. Minimum distance to the tail (lower will be ignored).	0	0
popscle_tag_group	popscle parameter. Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use `CB`.	“CB”	“CB”
popscle_tag_UMI	popscle parameter. Tag representing UMIs. For 10x genomics, use `UB`.	“UB”	“UB”
popscle_field	popscle parameter. FORMAT field to extract from: genotype (`GT`), genotype likelihood (`GL`), or posterior probability (`GP`).	“GT”	“GT”
popscle_alpha	popscle parameter. Grid of alpha to search for, in a comma separated list format of all alpha values to be considered.	“0.1,0.2,0.3,0.4,0.5”	“0.1,0.2,0.3,0.4,0.5”
popscle_rename_donors	popscle parameter. A comma-separated list of donor names for renaming clusters achieved by popscle. Must be consistent with popscle_num_samples input. By default, the resulting donors are Donor1, Donor2, …	“CB1,CB2,CB3,CB4”
popscle_version	popscle parameter. popscle version to use. Available options: `2021.05`: Based on commitment da70fc7 on 2021/05/05. `0.1b`: Based on version 0.1-beta released on 2019/10/03.	“2021.05”	“2021.05”
popscle_num_cpu	popscle parameter. Number of CPU used by popscle per pair.	1	1
popscle_memory	popscle parameter. Memory size string per pair.	“120G”	“120G”
popscle_extra_disk_space	popscle parameter. Extra disk space size (integer) in GB needed for popscle per pair, besides the disk size required to hold input files specified in the sample sheet.	100	100

Workflow outputs

See the table below for demultiplexing workflow outputs.

Name	Type	Description
output_folders	Array[String]	A list of Google Bucket URLs of the output folders. Each folder is associated with one RNA-hashtag pair in the given sample sheet.
output_zarr_files	Array[File]	A list of demultiplexed RNA count matrices in zarr format. Each zarr file is associated with one RNA-hashtag pair in the given sample sheet. Please refere to section load demultiplexing results into Python and R for its structure.

In the output subfolder of each cell-hashing/nuclei-hashing RNA-hashtag data pair, you can find the following files:

Name	Description
output_name_demux.zarr.zip	Demultiplexed RNA raw count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure.
output_name.out.demuxEM.zarr.zip	This file contains intermediate results for both RNA and hashing count matrices. To load this file into Python, you need to first install Pegasusio on your local machine. Then use `import pegasusio as io; data = io.read_input("output_name.out.demuxEM.zarr.zip")` in Python environment. It contains 2 UnimodalData objects: one with key name suffix `-hashing` is the hashtag count matrix, the other one with key name suffix `-rna` is the demultiplexed RNA count matrix. To load the hashtag count matrix, type `hash_data = data.get_data('<genome>-hashing')`, where `<genome>` is the genome name of the data. The count matrix is `hash_data.X`; cell barcode attributes are stored in `hash_data.obs`; sample names are in `hash_data.var_names`. Moreover, the estimated background probability regarding hashtags is in `hash_data.uns['background_probs']`. To load the RNA matrix, type `rna_data = data.get_data('<genome>-rna')`, where `<genome>` is the genome name of the data. It only contains cells which have estimated sample assignments. The count matrix is `rna_data.X`. Cell barcode attributes are stored in `rna_data.obs`: `rna_data.obs['demux_type']` stores the estimated droplet types (singlet/doublet/unknown) of cells; `rna_data.obs['assignment']` stores the estimated hashtag(s) that each cell belongs to. Moreover, for cell-hashing/nucleus-hashing data, you can find estimated sample fractions (sample1, sample2, …, samplen, background) for each droplet in `rna_data.obsm['raw_probs']`.
output_name.ambient_hashtag.hist.pdf	Optional output. A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets.
output_name.background_probabilities.bar.pdf	Optional output. A bar plot visualizing the estimated hashtag background probability distribution.
output_name.real_content.hist.pdf	Optional output. A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay.
output_name.rna_demux.hist.pdf	Optional output. A histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells.
output_name.gene_name.violin.pdf	Optional outputs. Violin plots depicting gender-specific gene expression across samples. We can have multiple plots if a gene list is provided in `demuxEM_generate_gender_plot` field of cumulus_hashing_cite_seq inputs.

In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by souporcell, you can find the following files:

Name	Description
output_name_demux.zarr.zip	Demultiplexed RNA count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure.
clusters.tsv	Inferred droplet type and cluster assignment for each cell barcode.
cluster_genotypes.vcf	Inferred genotypes for each cluster.
match_donors.log	Log of matching donors step, with information of donor matching included.

In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by demuxlet, you can find the following files:

Name	Description
output_name_demux.zarr.zip	Demultiplexed RNA count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure.
output_name.best (demuxlet) or output_name.clust1.samples.gz (freemuxlet)	Inferred droplet type and cluster assignment for each cell barcode.

Load demultiplexing results into Python and R

To load demultiplexed RNA count matrix into Python, you need to install Python package pegasusio first. Then follow the codes below:

import pegasusio as io
data = io.read_input('output_name_demux.zarr.zip')

Once you load the data object, you can find estimated droplet types (singlet/doublet/unknown) in data.obs['demux_type']. Notices that there are cell barcodes with no sample associated, and therefore have no droplet type.

You can also find estimated sample assignments in data.obs['assignment'].

For cell-hashing/nucleus-hashing data, if one sample name can correspond to multiple feature barcodes, each feature barcode is assigned to a unique sample name, and this deduplicated sample assignment results are in data.obs['assignment.dedup'].

To load the results into R, you need to install R package reticulate in addition to Python package pegasusio. Then follow the codes below:

library(reticulate)
ad <- import("pegasusio", convert = FALSE)
data <- ad$read_input("output_name_demux.zarr.zip")

Results are in data$obs['demux_type'], data$obs['assignment'], and similarly as above, for cell-hashing/nucleus-hashing data, you’ll find an additional field data$obs['assignment.dedup'] for deduplicated sample assignment in the case that one sample name can correspond to multiple feature barcodes.