Demultiplex genetic-pooling/cell-hashing/nucleus-hashing sc/snRNA-Seq data

This demultiplexing workflow generates gene-count matrices from cell-hashing/nucleus-hashing/genetic-pooling data by demultiplexing.

In the workflow, demuxEM is used for analyzing cell-hashing/nucleus-hashing data, while souporcell and popscle (including demuxlet and freemuxlet) are for genetic-pooling data.

Prepare input data and import workflow

1. Run cellranger_workflow

To demultiplex, you’ll need raw gene count and hashtag matrices for cell-hashing/nucleus-hashing data, or raw gene count matrices and genome BAM files for genetic-pooling data. You can generate these data by running the cellranger_workflow.

Please refer to the cellranger_workflow tutorial for details.

When finished, you should be able to find the raw gene count matrix (e.g. raw_gene_bc_matrices_h5.h5), hashtag matrix (e.g. sample_1_ADT.csv) / genome BAM file (e.g. possorted_genome_bam.bam) for each sample.

2. Import demultiplexing

Import demultiplexing workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Demultiplexing to import.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export demultiplexing workflow in the drop-down menu.

3. Prepare a sample sheet

3.1 Sample sheet format:

Create a sample sheet, sample_sheet_demux.csv, which describes the metadata for each pair of RNA and hashtag data. A brief description of the sample sheet format is listed below (required column headers are shown in bold).

Column

Description

OUTNAME

Output name for one pair of RNA and hashtag data. Must be unique per pair.

RNA

Google bucket url to the raw gene count matrix generated in Step 1.

TagFile/ADT

Google bucket url to the hashtag file generated in Step 1. The column name can be either TagFile or ADT, where ADT is for backward compatibility with older snapshots.

TYPE

Assay type, which can be cell-hashing, nucleus-hashing, or genetic-pooling.

Genotype

Google bucket url to the reference genotypes in vcf.gz format. This column is required in the following cases:

  • Run genetic-pooling assay with souporcell algorithm (i.e. TYPE is genetic-pooling, demultiplexing_algorithm input is souporcell):

    • Run with reference genotypes, i.e. souporcell_de_novo_mode is false.

    • Run in de novo mode (i.e. souporcell_de_novo_mode is true), but need to match the resulting cluster names by information from reference genotypes (see description of souporcell_rename_donors input below).

  • Run genetic-pooling assay with popscle algorithm (i.e. TYPE is genetic-pooling, demultiplexing_algorithm input is popscle):

    • popscle_num_samples input is 0. In this case, demuxlet will be run with reference genotypes.

    • popscle_num_samples input is larger than 0. In this case, reference genotypes will be only used to generate pileups, then freemuxlet will be used for demultiplexing without reference genotypes.

Example:

OUTNAME,RNA,TagFile,TYPE,Genotype
sample_1,gs://exp/data_1/raw_gene_bc_matrices_h5.h5,gs://exp/data_1/sample_1_ADT.csv,cell-hashing
sample_2,gs://exp/data_2/raw_gene_bc_matrices_h5.h5,gs://exp/data_2/sample_2_ADT.csv,nucleus-hashing
sample_3,gs://exp/data_3/raw_gene_bc_matrices_h5.h5,gs://exp/data_3/possorted_genome_bam.bam,genetic-pooling
sample_4,gs://exp/data_4/raw_gene_bc_matrices_h5.h5,gs://exp/data_4/possorted_genome_bam.bam,genetic-pooling,gs://exp/variants/ref_genotypes.vcf.gz

3.2 Upload your sample sheet to the workspace bucket:

Use gsutil (you already have it if you’ve installed gcloud CLI) in your unix terminal to upload your sample sheet to workspace bucket.

Example:

gsutil cp /foo/bar/projects/sample_sheet_demux.csv gs://fc-e0000000-0000-0000-0000-000000000000/

Workflow inputs

Below are inputs for demultiplexing workflow. We’ll first introduce global inputs, and then inputs for each of the demultiplexing tools. Notice that required inputs are in bold.

global inputs

Name

Description

Example

Default

input_sample_sheet

Input CSV file describing metadata of RNA and hashtag data pairing.

“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet_demux.csv”

output_directory

This is the output directory (gs url + path) for all results. There will be one folder per RNA-hashtag data pair under this directory.

“gs://fc-e0000000-0000-0000-0000-000000000000/demux_output”

genome

Reference genome name. Its usage depends on the assay type:

  • For cell-hashing or nucleus-hashing, only write this name as an annotation into the resulting count matrix file.

  • For genetic-pooling, if demultiplexing_algorithm input is souporcell, you should choose one name from this genome reference list.

  • For genetic-pooling, if demultiplexing_algorithm input is popscle, reference genome name is not needed.

“GRCh38”

demultiplexing_algorithm

demultiplexing algorithm to use for genetic-pooling data. Options:

  • “souporcell”: Use souporcell, a reference-genotypes-free algorithm for demultiplexing droplet scRNA-Seq data.

  • “popscle”: Use popscle, a canonical algorithm for demultiplexing droplet scRNA-Seq data, including demuxlet (with reference genotypes) and freemuxlet (reference-genotype-free) components.

“souporcell”

“souporcell”

min_num_genes

Only demultiplex cells/nuclei with at least <min_num_genes> expressed genes

100

100

zones

Google cloud zones to consider for execution.

“us-east1-d us-west1-a us-west1-b”

“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c”

docker_registry

Docker registry to use.

  • “quay.io/cumulus” for images on Red Hat registry;

  • “cumulusprod” for backup images on Docker Hub.

“quay.io/cumulus”

“quay.io/cumulus”

config_version

Version of config docker image to use. This docker is used for parsing the input sample sheet for downstream execution. Available options: 0.2, 0.1.

“0.2”

“0.2”

backend

Cloud infrastructure backend to use. Available options:

  • “gcp” for Google Cloud;

  • “aws” for Amazon AWS;

  • “local” for local machine.

“gcp”

“gcp”

ref_index_file

The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if backend is gcp; an S3 URI for aws backend; an absolute file path for local backend.

“s3://xxxx/index.tsv”

“gs://regev-lab/resources/cellranger/index.tsv”

preemptible

Number of maximum preemptible tries allowed. This works only when backend is gcp.

2

2

awsQueueArn

The AWS ARN string of the job queue to be used. This only works for aws backend.

“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf”

“”

demuxEM inputs

Name

Description

Example

Default

demuxEM_alpha_on_samples

demuxEM parameter. The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse.

0.0

0.0

demuxEM_min_num_umis

demuxEM parameter. Only demultiplex cells/nuclei with at least <demuxEM_min_num_umis> of UMIs.

100

100

demuxEM_min_signal_hashtag

demuxEM parameter. Any cell/nucleus with less than <demuxEM_min_signal_hashtag> hashtags from the signal will be marked as unknown.

10.0

10.0

demuxEM_random_state

demuxEM parameter. The random seed used in the KMeans algorithm to separate empty ADT droplets from others.

0

0

demuxEM_generate_diagnostic_plots

demuxEM parameter. If generate a series of diagnostic plots, including the background/signal between HTO counts, estimated background probabilities, HTO distributions of cells and non-cells, etc.

true

true

demuxEM_generate_gender_plot

demuxEM parameter. If generate violin plots using gender-specific genes (e.g. Xist). <demuxEM_generate_gender_plot> is a comma-separated list of gene names

“XIST”

demuxEM_version

demuxEM version to use. Choose from “0.1.7”, “0.1.6” and “0.1.5”.

“0.1.7”

“0.1.7”

demuxEM_num_cpu

demuxEM parameter. Number of CPUs to request for demuxEM per pair.

8

8

demuxEM_memory

demuxEM parameter. Memory size string for demuxEM per pair.

“10G”

“10G”

demuxEM_disk_space

demuxEM parameter. Disk space (integer) in GB needed for demuxEM per pair.

20

20

souporcell inputs

Name

Description

Example

Default

souporcell_version

souporcell version to use. Available versions:

  • 2.5 released on 2023/06/22.

  • 2022.12: Based on commitment 9fb527 on 2022/12/13.

    • 2021.03: Based on commitment 1bd9f1 on 2021/03/07.

  • 2020.07: Based on commitment 0d09fb on 2020/07/27.

  • 2020.03: Based on commitment eeddcd on 2020/03/31.

“2.5”

“2.5”

souporcell_num_clusters

souporcell parameter. Number of expected clusters when doing clustering.
This needs to be set when running souporcell.

8

1

souporcell_de_novo_mode

souporcell parameter.

  • If true, run souporcell in de novo mode without reference genotypes:

    • If input souporcell_common_variants is further provided, use this common variants list instead of calling SNPs de novo.

    • If a reference genotype vcf file is provided in the sample sheet, use it only for matching the cluster labels computed by souporcell.

  • If false, run souporcell with --known_genotypes option using the reference genotype vcf file specified in sample sheet.

true

true

souporcell_num_clusters

souporcell parameter. Number of expected clusters when doing clustering.
This needs to be set when running souporcell.

8

1

souporcell_common_variants

souporcell parameter. Users can provide a common variants list in VCF format for Souporcell to use, instead of calling SNPs de novo.
Notice: This input is enabled only when souporcell_de_novo_mode is false.

“1000genome.common.variants.vcf.gz”

souporcell_skip_remap

souporcell parameter. Skip remap step. Only recommended in non denovo mode or common variants are provided.

true

false

souporcell_rename_donors

souporcell parameter. A comma-separated list of donor names for matching clusters achieved by souporcell. Must be consistent with souporcell_num_clusters input.

  • If this input is empty, use cluster labels from the reference genotype vcf file if provided in the sample sheet; if this vcf file is not provided, simply name clusters as Donor1, Donor2, …

  • If this input is not empty, and a reference genotype vcf file is provided in the sample sheet, first match the cluster labels using those from this vcf file, then rename to donor names specified in this input.

  • If this input is not empty, and NO reference genotype vcf file is provided in the sample sheet, simply match the cluster labels in one-to-one correspondence with donor names specified in this input.

“CB1,CB2,CB3,CB4”

souporcell_num_cpu

souporcell parameter. Number of CPUs to request for souporcell per pair.

32

32

souporcell_memory

souporcell parameter. Memory size string for souporcell per pair.

“120G”

“120G”

souporcell_disk_space

souporcell parameter. Disk space (integer) in GB needed for souporcell per pair.

500

500

Popscle inputs

Name

Description

Example

Default

popscle_num_samples

popscle parameter. Number of samples to be multiplexed together:

  • If 0, run with demuxlet using reference genotypes.

  • Otherwise, run with freemuxlet in de novo mode without reference genotypes.

4

0

popscle_min_MQ

popscle parameter. Minimum mapping quality to consider (lower MQ will be ignored).

20

20

popscle_min_TD

popscle parameter. Minimum distance to the tail (lower will be ignored).

0

0

popscle_tag_group

popscle parameter. Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB.

“CB”

“CB”

popscle_tag_UMI

popscle parameter. Tag representing UMIs. For 10x genomics, use UB.

“UB”

“UB”

popscle_field

popscle parameter. FORMAT field to extract from: genotype (GT), genotype likelihood (GL), or posterior probability (GP).

“GT”

“GT”

popscle_alpha

popscle parameter. Grid of alpha to search for, in a comma separated list format of all alpha values to be considered.

“0.1,0.2,0.3,0.4,0.5”

“0.1,0.2,0.3,0.4,0.5”

popscle_rename_donors

popscle parameter. A comma-separated list of donor names for renaming clusters achieved by popscle. Must be consistent with popscle_num_samples input.
By default, the resulting donors are Donor1, Donor2, …

“CB1,CB2,CB3,CB4”

popscle_version

popscle parameter. popscle version to use. Available options:

  • 2021.05: Based on commitment da70fc7 on 2021/05/05.

  • 0.1b: Based on version 0.1-beta released on 2019/10/03.

“2021.05”

“2021.05”

popscle_num_cpu

popscle parameter. Number of CPU used by popscle per pair.

1

1

popscle_memory

popscle parameter. Memory size string per pair.

“120G”

“120G”

popscle_extra_disk_space

popscle parameter. Extra disk space size (integer) in GB needed for popscle per pair, besides the disk size required to hold input files specified in the sample sheet.

100

100


Workflow outputs

See the table below for demultiplexing workflow outputs.

Name

Type

Description

output_folders

Array[String]

A list of Google Bucket URLs of the output folders. Each folder is associated with one RNA-hashtag pair in the given sample sheet.

output_zarr_files

Array[File]

A list of demultiplexed RNA count matrices in zarr format. Each zarr file is associated with one RNA-hashtag pair in the given sample sheet. Please refere to section load demultiplexing results into Python and R for its structure.

In the output subfolder of each cell-hashing/nuclei-hashing RNA-hashtag data pair, you can find the following files:

Name

Description

output_name_demux.zarr.zip

Demultiplexed RNA raw count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure.

output_name.out.demuxEM.zarr.zip

This file contains intermediate results for both RNA and hashing count matrices.
To load this file into Python, you need to first install Pegasusio on your local machine. Then use import pegasusio as io; data = io.read_input("output_name.out.demuxEM.zarr.zip") in Python environment.
It contains 2 UnimodalData objects: one with key name suffix -hashing is the hashtag count matrix, the other one with key name suffix -rna is the demultiplexed RNA count matrix.
To load the hashtag count matrix, type hash_data = data.get_data('<genome>-hashing'), where <genome> is the genome name of the data. The count matrix is hash_data.X; cell barcode attributes are stored in hash_data.obs; sample names are in hash_data.var_names. Moreover, the estimated background probability regarding hashtags is in hash_data.uns['background_probs'].
To load the RNA matrix, type rna_data = data.get_data('<genome>-rna'), where <genome> is the genome name of the data. It only contains cells which have estimated sample assignments. The count matrix is rna_data.X. Cell barcode attributes are stored in rna_data.obs: rna_data.obs['demux_type'] stores the estimated droplet types (singlet/doublet/unknown) of cells; rna_data.obs['assignment'] stores the estimated hashtag(s) that each cell belongs to. Moreover, for cell-hashing/nucleus-hashing data, you can find estimated sample fractions (sample1, sample2, …, samplen, background) for each droplet in rna_data.obsm['raw_probs'].

output_name.ambient_hashtag.hist.pdf

Optional output. A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets.

output_name.background_probabilities.bar.pdf

Optional output. A bar plot visualizing the estimated hashtag background probability distribution.

output_name.real_content.hist.pdf

Optional output. A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay.

output_name.rna_demux.hist.pdf

Optional output. A histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells.

output_name.gene_name.violin.pdf

Optional outputs. Violin plots depicting gender-specific gene expression across samples. We can have multiple plots if a gene list is provided in demuxEM_generate_gender_plot field of cumulus_hashing_cite_seq inputs.

In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by souporcell, you can find the following files:

Name

Description

output_name_demux.zarr.zip

Demultiplexed RNA count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure.

clusters.tsv

Inferred droplet type and cluster assignment for each cell barcode.

cluster_genotypes.vcf

Inferred genotypes for each cluster.

match_donors.log

Log of matching donors step, with information of donor matching included.

In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by demuxlet, you can find the following files:

Name

Description

output_name_demux.zarr.zip

Demultiplexed RNA count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure.

output_name.best (demuxlet) or output_name.clust1.samples.gz (freemuxlet)

Inferred droplet type and cluster assignment for each cell barcode.


Load demultiplexing results into Python and R

To load demultiplexed RNA count matrix into Python, you need to install Python package pegasusio first. Then follow the codes below:

import pegasusio as io
data = io.read_input('output_name_demux.zarr.zip')

Once you load the data object, you can find estimated droplet types (singlet/doublet/unknown) in data.obs['demux_type']. Notices that there are cell barcodes with no sample associated, and therefore have no droplet type.

You can also find estimated sample assignments in data.obs['assignment'].

For cell-hashing/nucleus-hashing data, if one sample name can correspond to multiple feature barcodes, each feature barcode is assigned to a unique sample name, and this deduplicated sample assignment results are in data.obs['assignment.dedup'].

To load the results into R, you need to install R package reticulate in addition to Python package pegasusio. Then follow the codes below:

library(reticulate)
ad <- import("pegasusio", convert = FALSE)
data <- ad$read_input("output_name_demux.zarr.zip")

Results are in data$obs['demux_type'], data$obs['assignment'], and similarly as above, for cell-hashing/nucleus-hashing data, you’ll find an additional field data$obs['assignment.dedup'] for deduplicated sample assignment in the case that one sample name can correspond to multiple feature barcodes.