Demultiplex genetic-pooling/cell-hashing/nucleus-hashing sc/snRNA-Seq data
This demultiplexing
workflow generates gene-count matrices from cell-hashing/nucleus-hashing/genetic-pooling data by demultiplexing.
In the workflow, demuxEM
is used for analyzing cell-hashing/nucleus-hashing data, while souporcell
and popscle
(including demuxlet and freemuxlet) are for genetic-pooling data.
Prepare input data and import workflow
1. Run cellranger_workflow
To demultiplex, you’ll need raw gene count and hashtag matrices for cell-hashing/nucleus-hashing data, or raw gene count matrices and genome BAM files for genetic-pooling data. You can generate these data by running the
cellranger_workflow
.Please refer to the cellranger_workflow tutorial for details.
When finished, you should be able to find the raw gene count matrix (e.g.
raw_gene_bc_matrices_h5.h5
), hashtag matrix (e.g.sample_1_ADT.csv
) / genome BAM file (e.g.possorted_genome_bam.bam
) for each sample.
2. Import demultiplexing
Import demultiplexing workflow to your workspace by following instructions in Import workflows to Terra. You should choose github.com/lilab-bcb/cumulus/Demultiplexing to import.
Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export demultiplexing workflow in the drop-down menu.
3. Prepare a sample sheet
3.1 Sample sheet format:
Create a sample sheet, sample_sheet_demux.csv, which describes the metadata for each pair of RNA and hashtag data. A brief description of the sample sheet format is listed below (required column headers are shown in bold).
Column
Description
OUTNAME
Output name for one pair of RNA and hashtag data. Must be unique per pair.
RNA
Google bucket url to the raw gene count matrix generated in Step 1.
TagFile/ADT
Google bucket url to the hashtag file generated in Step 1. The column name can be either TagFile or ADT, where ADT is for backward compatibility with older snapshots.
TYPE
Assay type, which can be
cell-hashing
,nucleus-hashing
, orgenetic-pooling
.Genotype
Google bucket url to the reference genotypes in
vcf.gz
format. This column is required in the following cases:
Run
genetic-pooling
assay withsouporcell
algorithm (i.e. TYPE isgenetic-pooling
, demultiplexing_algorithm input issouporcell
):
Run with reference genotypes, i.e. souporcell_de_novo_mode is
false
.Run in de novo mode (i.e. souporcell_de_novo_mode is
true
), but need to match the resulting cluster names by information from reference genotypes (see description of souporcell_rename_donors input below).Run
genetic-pooling
assay withpopscle
algorithm (i.e. TYPE isgenetic-pooling
, demultiplexing_algorithm input ispopscle
):
popscle_num_samples input is
0
. In this case, demuxlet will be run with reference genotypes.popscle_num_samples input is larger than
0
. In this case, reference genotypes will be only used to generate pileups, then freemuxlet will be used for demultiplexing without reference genotypes.Example:
OUTNAME,RNA,TagFile,TYPE,Genotype sample_1,gs://exp/data_1/raw_gene_bc_matrices_h5.h5,gs://exp/data_1/sample_1_ADT.csv,cell-hashing sample_2,gs://exp/data_2/raw_gene_bc_matrices_h5.h5,gs://exp/data_2/sample_2_ADT.csv,nucleus-hashing sample_3,gs://exp/data_3/raw_gene_bc_matrices_h5.h5,gs://exp/data_3/possorted_genome_bam.bam,genetic-pooling sample_4,gs://exp/data_4/raw_gene_bc_matrices_h5.h5,gs://exp/data_4/possorted_genome_bam.bam,genetic-pooling,gs://exp/variants/ref_genotypes.vcf.gz3.2 Upload your sample sheet to the workspace bucket:
Use gsutil (you already have it if you’ve installed gcloud CLI) in your unix terminal to upload your sample sheet to workspace bucket.
Example:
gsutil cp /foo/bar/projects/sample_sheet_demux.csv gs://fc-e0000000-0000-0000-0000-000000000000/
Workflow inputs
Below are inputs for demultiplexing workflow. We’ll first introduce global inputs, and then inputs for each of the demultiplexing tools. Notice that required inputs are in bold.
global inputs
Name |
Description |
Example |
Default |
---|---|---|---|
input_sample_sheet |
Input CSV file describing metadata of RNA and hashtag data pairing. |
“gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet_demux.csv” |
|
output_directory |
This is the output directory (gs url + path) for all results. There will be one folder per RNA-hashtag data pair under this directory. |
“gs://fc-e0000000-0000-0000-0000-000000000000/demux_output” |
|
genome |
Reference genome name. Its usage depends on the assay type:
|
“GRCh38” |
|
demultiplexing_algorithm |
demultiplexing algorithm to use for genetic-pooling data. Options:
|
“souporcell” |
“souporcell” |
min_num_genes |
Only demultiplex cells/nuclei with at least <min_num_genes> expressed genes |
100 |
100 |
zones |
Google cloud zones to consider for execution. |
“us-east1-d us-west1-a us-west1-b” |
“us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c” |
docker_registry |
Docker registry to use.
|
“quay.io/cumulus” |
“quay.io/cumulus” |
config_version |
Version of config docker image to use. This docker is used for parsing the input sample sheet for downstream execution. Available options: |
“0.2” |
“0.2” |
backend |
Cloud infrastructure backend to use. Available options:
|
“gcp” |
“gcp” |
ref_index_file |
The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if backend is |
“s3://xxxx/index.tsv” |
“gs://regev-lab/resources/cellranger/index.tsv” |
preemptible |
Number of maximum preemptible tries allowed. This works only when backend is |
2 |
2 |
awsQueueArn |
The AWS ARN string of the job queue to be used. This only works for |
“arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf” |
“” |
demuxEM inputs
Name |
Description |
Example |
Default |
---|---|---|---|
demuxEM_alpha_on_samples |
demuxEM parameter. The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. |
0.0 |
0.0 |
demuxEM_min_num_umis |
demuxEM parameter. Only demultiplex cells/nuclei with at least <demuxEM_min_num_umis> of UMIs. |
100 |
100 |
demuxEM_min_signal_hashtag |
demuxEM parameter. Any cell/nucleus with less than <demuxEM_min_signal_hashtag> hashtags from the signal will be marked as unknown. |
10.0 |
10.0 |
demuxEM_random_state |
demuxEM parameter. The random seed used in the KMeans algorithm to separate empty ADT droplets from others. |
0 |
0 |
demuxEM_generate_diagnostic_plots |
demuxEM parameter. If generate a series of diagnostic plots, including the background/signal between HTO counts, estimated background probabilities, HTO distributions of cells and non-cells, etc. |
true |
true |
demuxEM_generate_gender_plot |
demuxEM parameter. If generate violin plots using gender-specific genes (e.g. Xist). <demuxEM_generate_gender_plot> is a comma-separated list of gene names |
“XIST” |
|
demuxEM_version |
demuxEM version to use. Choose from “0.1.7”, “0.1.6” and “0.1.5”. |
“0.1.7” |
“0.1.7” |
demuxEM_num_cpu |
demuxEM parameter. Number of CPUs to request for demuxEM per pair. |
8 |
8 |
demuxEM_memory |
demuxEM parameter. Memory size string for demuxEM per pair. |
“10G” |
“10G” |
demuxEM_disk_space |
demuxEM parameter. Disk space (integer) in GB needed for demuxEM per pair. |
20 |
20 |
souporcell inputs
Name |
Description |
Example |
Default |
---|---|---|---|
souporcell_version |
souporcell version to use. Available versions:
|
“2.5” |
“2.5” |
souporcell_num_clusters |
souporcell parameter. Number of expected clusters when doing clustering.
This needs to be set when running souporcell.
|
8 |
1 |
souporcell_de_novo_mode |
souporcell parameter.
|
true |
true |
souporcell_num_clusters |
souporcell parameter. Number of expected clusters when doing clustering.
This needs to be set when running souporcell.
|
8 |
1 |
souporcell_common_variants |
souporcell parameter. Users can provide a common variants list in VCF format for Souporcell to use, instead of calling SNPs de novo.
Notice: This input is enabled only when souporcell_de_novo_mode is
false . |
“1000genome.common.variants.vcf.gz” |
|
souporcell_skip_remap |
souporcell parameter. Skip remap step. Only recommended in non denovo mode or common variants are provided. |
true |
false |
souporcell_rename_donors |
souporcell parameter. A comma-separated list of donor names for matching clusters achieved by souporcell. Must be consistent with souporcell_num_clusters input.
|
“CB1,CB2,CB3,CB4” |
|
souporcell_num_cpu |
souporcell parameter. Number of CPUs to request for souporcell per pair. |
32 |
32 |
souporcell_memory |
souporcell parameter. Memory size string for souporcell per pair. |
“120G” |
“120G” |
souporcell_disk_space |
souporcell parameter. Disk space (integer) in GB needed for souporcell per pair. |
500 |
500 |
Popscle inputs
Name |
Description |
Example |
Default |
---|---|---|---|
popscle_num_samples |
popscle parameter. Number of samples to be multiplexed together:
|
4 |
0 |
popscle_min_MQ |
popscle parameter. Minimum mapping quality to consider (lower MQ will be ignored). |
20 |
20 |
popscle_min_TD |
popscle parameter. Minimum distance to the tail (lower will be ignored). |
0 |
0 |
popscle_tag_group |
popscle parameter. Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use |
“CB” |
“CB” |
popscle_tag_UMI |
popscle parameter. Tag representing UMIs. For 10x genomics, use |
“UB” |
“UB” |
popscle_field |
popscle parameter. FORMAT field to extract from: genotype ( |
“GT” |
“GT” |
popscle_alpha |
popscle parameter. Grid of alpha to search for, in a comma separated list format of all alpha values to be considered. |
“0.1,0.2,0.3,0.4,0.5” |
“0.1,0.2,0.3,0.4,0.5” |
popscle_rename_donors |
popscle parameter. A comma-separated list of donor names for renaming clusters achieved by popscle. Must be consistent with popscle_num_samples input.
By default, the resulting donors are Donor1, Donor2, …
|
“CB1,CB2,CB3,CB4” |
|
popscle_version |
popscle parameter. popscle version to use. Available options: |
“2021.05” |
“2021.05” |
popscle_num_cpu |
popscle parameter. Number of CPU used by popscle per pair. |
1 |
1 |
popscle_memory |
popscle parameter. Memory size string per pair. |
“120G” |
“120G” |
popscle_extra_disk_space |
popscle parameter. Extra disk space size (integer) in GB needed for popscle per pair, besides the disk size required to hold input files specified in the sample sheet. |
100 |
100 |
Workflow outputs
See the table below for demultiplexing workflow outputs.
Name |
Type |
Description |
---|---|---|
output_folders |
Array[String] |
A list of Google Bucket URLs of the output folders. Each folder is associated with one RNA-hashtag pair in the given sample sheet. |
output_zarr_files |
Array[File] |
A list of demultiplexed RNA count matrices in zarr format. Each zarr file is associated with one RNA-hashtag pair in the given sample sheet. Please refere to section load demultiplexing results into Python and R for its structure. |
In the output subfolder of each cell-hashing/nuclei-hashing RNA-hashtag data pair, you can find the following files:
Name |
Description |
---|---|
output_name_demux.zarr.zip |
Demultiplexed RNA raw count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure. |
output_name.out.demuxEM.zarr.zip |
This file contains intermediate results for both RNA and hashing count matrices.
To load this file into Python, you need to first install Pegasusio on your local machine. Then use
import pegasusio as io; data = io.read_input("output_name.out.demuxEM.zarr.zip") in Python environment.It contains 2 UnimodalData objects: one with key name suffix
-hashing is the hashtag count matrix, the other one with key name suffix -rna is the demultiplexed RNA count matrix.To load the hashtag count matrix, type
hash_data = data.get_data('<genome>-hashing') , where <genome> is the genome name of the data. The count matrix is hash_data.X ; cell barcode attributes are stored in hash_data.obs ; sample names are in hash_data.var_names . Moreover, the estimated background probability regarding hashtags is in hash_data.uns['background_probs'] .To load the RNA matrix, type
rna_data = data.get_data('<genome>-rna') , where <genome> is the genome name of the data. It only contains cells which have estimated sample assignments. The count matrix is rna_data.X . Cell barcode attributes are stored in rna_data.obs : rna_data.obs['demux_type'] stores the estimated droplet types (singlet/doublet/unknown) of cells; rna_data.obs['assignment'] stores the estimated hashtag(s) that each cell belongs to. Moreover, for cell-hashing/nucleus-hashing data, you can find estimated sample fractions (sample1, sample2, …, samplen, background) for each droplet in rna_data.obsm['raw_probs'] . |
output_name.ambient_hashtag.hist.pdf |
Optional output. A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets. |
output_name.background_probabilities.bar.pdf |
Optional output. A bar plot visualizing the estimated hashtag background probability distribution. |
output_name.real_content.hist.pdf |
Optional output. A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay. |
output_name.rna_demux.hist.pdf |
Optional output. A histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells. |
output_name.gene_name.violin.pdf |
Optional outputs. Violin plots depicting gender-specific gene expression across samples. We can have multiple plots if a gene list is provided in |
In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by souporcell, you can find the following files:
Name |
Description |
---|---|
output_name_demux.zarr.zip |
Demultiplexed RNA count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure. |
clusters.tsv |
Inferred droplet type and cluster assignment for each cell barcode. |
cluster_genotypes.vcf |
Inferred genotypes for each cluster. |
match_donors.log |
Log of matching donors step, with information of donor matching included. |
In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by demuxlet, you can find the following files:
Name |
Description |
---|---|
output_name_demux.zarr.zip |
Demultiplexed RNA count matrix in zarr format. Please refer to section load demultiplexing results into Python and R for its structure. |
output_name.best (demuxlet) or output_name.clust1.samples.gz (freemuxlet) |
Inferred droplet type and cluster assignment for each cell barcode. |
Load demultiplexing results into Python and R
To load demultiplexed RNA count matrix into Python, you need to install Python package pegasusio first. Then follow the codes below:
import pegasusio as io
data = io.read_input('output_name_demux.zarr.zip')
Once you load the data object, you can find estimated droplet types (singlet/doublet/unknown) in data.obs['demux_type']
. Notices that there are cell barcodes with no sample associated, and therefore have no droplet type.
You can also find estimated sample assignments in data.obs['assignment']
.
For cell-hashing/nucleus-hashing data, if one sample name can correspond to multiple feature barcodes, each feature barcode is assigned to a unique sample name, and this deduplicated sample assignment results are in data.obs['assignment.dedup']
.
To load the results into R, you need to install R package reticulate
in addition to Python package pegasusio
. Then follow the codes below:
library(reticulate)
ad <- import("pegasusio", convert = FALSE)
data <- ad$read_input("output_name_demux.zarr.zip")
Results are in data$obs['demux_type']
, data$obs['assignment']
, and similarly as above, for cell-hashing/nucleus-hashing data, you’ll find an additional field data$obs['assignment.dedup']
for deduplicated sample assignment in the case that one sample name can correspond to multiple feature barcodes.