Demultiplex genetic-pooling/cell-hashing/nucleus-hashing sc/snRNA-Seq data -------------------------------------------------------------------------- This ``demultiplexing`` workflow generates gene-count matrices from cell-hashing/nucleus-hashing/genetic-pooling data by demultiplexing. In the workflow, ``demuxEM`` is used for analyzing cell-hashing/nucleus-hashing data, while ``souporcell`` and ``popscle`` (including *demuxlet* and *freemuxlet*) are for genetic-pooling data. Prepare input data and import workflow ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Run ``cellranger_workflow`` ++++++++++++++++++++++++++++++++ To demultiplex, you'll need raw gene count and hashtag matrices for cell-hashing/nucleus-hashing data, or raw gene count matrices and genome BAM files for genetic-pooling data. You can generate these data by running the ``cellranger_workflow``. Please refer to the `cellranger_workflow tutorial`_ for details. When finished, you should be able to find the raw gene count matrix (e.g. ``raw_gene_bc_matrices_h5.h5``), hashtag matrix (e.g. ``sample_1_ADT.umi_correct.h5``) / genome BAM file (e.g. ``possorted_genome_bam.bam``) for each sample. 2. Import ``demultiplexing`` ++++++++++++++++++++++++++++++ Import *demultiplexing* workflow to your workspace by following instructions in `Import workflows to Terra`_. You should choose **github.com/lilab-bcb/cumulus/Demultiplexing** to import. Moreover, in the workflow page, click the ``Export to Workspace...`` button, and select the workspace to which you want to export *demultiplexing* workflow in the drop-down menu. 3. Prepare a sample sheet ++++++++++++++++++++++++++++ **3.1 Sample sheet format:** Create a sample sheet, **sample_sheet_demux.csv**, which describes the metadata for each pair of RNA and hashtag data. A brief description of the sample sheet format is listed below **(required column headers are shown in bold)**. .. list-table:: :widths: 5 30 :header-rows: 1 * - Column - Description * - **OUTNAME** - Output name for one pair of RNA and hashtag data. Must be unique per pair. * - **RNA** - Google bucket url to the raw gene count matrix generated in Step 1. * - **TagFile/ADT** - Google bucket url to the hashtag file generated in Step 1. The column name can be either *TagFile* or *ADT*, where *ADT* is for backward compatibility with older snapshots. * - **TYPE** - Assay type, which can be ``cell-hashing``, ``nucleus-hashing``, or ``genetic-pooling``. * - Genotype - Google bucket url to the reference genotypes in ``vcf.gz`` format. This column is **required** in the following cases: - Run ``genetic-pooling`` assay with ``souporcell`` algorithm (i.e. *TYPE* is ``genetic-pooling``, *demultiplexing_algorithm* input is ``souporcell``): - Run with reference genotypes, i.e. *souporcell_de_novo_mode* is ``false``. - Run in *de novo* mode (i.e. *souporcell_de_novo_mode* is ``true``), but need to match the resulting cluster names by information from reference genotypes (see description of *souporcell_rename_donors* input below). - Run ``genetic-pooling`` assay with ``popscle`` algorithm (i.e. *TYPE* is ``genetic-pooling``, *demultiplexing_algorithm* input is ``popscle``): - *popscle_num_samples* input is ``0``. In this case, *demuxlet* will be run with reference genotypes. - *popscle_num_samples* input is larger than ``0``. In this case, reference genotypes will be only used to generate pileups, then *freemuxlet* will be used for demultiplexing **without** reference genotypes. Example:: OUTNAME,RNA,TagFile,TYPE,Genotype sample_1,gs://exp/data_1/raw_gene_bc_matrices_h5.h5,gs://exp/data_1/sample_1_ADT.umi_correct.h5,cell-hashing sample_2,gs://exp/data_2/raw_gene_bc_matrices_h5.h5,gs://exp/data_2/sample_2_ADT.umi_correct.h5,nucleus-hashing sample_3,gs://exp/data_3/raw_gene_bc_matrices_h5.h5,gs://exp/data_3/possorted_genome_bam.bam,genetic-pooling sample_4,gs://exp/data_4/raw_gene_bc_matrices_h5.h5,gs://exp/data_4/possorted_genome_bam.bam,genetic-pooling,gs://exp/variants/ref_genotypes.vcf.gz **3.2 Upload your sample sheet to the workspace bucket:** Use gsutil_ (you already have it if you've installed gcloud CLI) in your unix terminal to upload your sample sheet to workspace bucket. Example:: gsutil cp /foo/bar/projects/sample_sheet_demux.csv gs://fc-e0000000-0000-0000-0000-000000000000/ --------------- Workflow inputs ^^^^^^^^^^^^^^^^ Below are inputs for *demultiplexing* workflow. We'll first introduce global inputs, and then inputs for each of the demultiplexing tools. Notice that required inputs are in bold. global inputs +++++++++++++++ .. list-table:: :widths: 5 20 10 5 :header-rows: 1 * - Name - Description - Example - Default * - **input_sample_sheet** - Input CSV file describing metadata of RNA and hashtag data pairing. - "gs://fc-e0000000-0000-0000-0000-000000000000/sample_sheet_demux.csv" - * - **output_directory** - This is the output directory (gs url + path) for all results. There will be one folder per RNA-hashtag data pair under this directory. - "gs://fc-e0000000-0000-0000-0000-000000000000/demux_output" - * - genome - Reference genome name. Its usage depends on the assay type: - For *cell-hashing* or *nucleus-hashing*, only write this name as an annotation into the resulting count matrix file. - For *genetic-pooling*, if *demultiplexing_algorithm* input is ``souporcell``, you should choose one name from this `genome reference`_ list. - For *genetic-pooling*, if *demultiplexing_algorithm* input is ``popscle``, reference genome name is not needed. - "GRCh38" - * - demultiplexing_algorithm - demultiplexing algorithm to use for *genetic-pooling* data. Options: - "souporcell": Use souporcell_, a reference-genotypes-free algorithm for demultiplexing droplet scRNA-Seq data. - "popscle": Use popscle_, a canonical algorithm for demultiplexing droplet scRNA-Seq data, including *demuxlet* (with reference genotypes) and *freemuxlet* (reference-genotype-free) components. - "souporcell" - "souporcell" * - min_num_genes - Only demultiplex cells/nuclei with at least expressed genes - 100 - 100 * - zones - Google cloud zones to consider for execution. - "us-east1-d us-west1-a us-west1-b" - "us-central1-a us-central1-b us-central1-c us-central1-f us-east1-b us-east1-c us-east1-d us-west1-a us-west1-b us-west1-c" * - docker_registry - Docker registry to use. - "quay.io/cumulus" for images on Red Hat registry; - "cumulusprod" for backup images on Docker Hub. - "quay.io/cumulus" - "quay.io/cumulus" * - ref_index_file - The link/path of an index file in TSV format for fetching preset genome references, chemistry whitelists, etc. by their names. Set an GS URI if running on GCP; an S3 URI for AWS; an absolute file path for HPC or local machines. - "s3://xxxx/index.tsv" - "gs://cumulus-ref/resources/cellranger/index.tsv" * - preemptible - Number of maximum preemptible tries allowed. Only works for GCP - 2 - 2 * - awsQueueArn - The AWS ARN string of the job queue to be used. Only works for AWS - "arn:aws:batch:us-east-1:xxx:job-queue/priority-gwf" - "" demuxEM inputs ++++++++++++++++ .. list-table:: :widths: 5 20 10 5 :header-rows: 1 * - Name - Description - Example - Default * - demuxEM_alpha_on_samples - demuxEM parameter. The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. - 0.0 - 0.0 * - demuxEM_min_num_umis - demuxEM parameter. Only demultiplex cells/nuclei with at least of UMIs. - 100 - 100 * - demuxEM_min_signal_hashtag - demuxEM parameter. Any cell/nucleus with less than hashtags from the signal will be marked as unknown. - 10.0 - 10.0 * - demuxEM_random_state - demuxEM parameter. The random seed used in the KMeans algorithm to separate empty ADT droplets from others. - 0 - 0 * - demuxEM_generate_diagnostic_plots - demuxEM parameter. If generate a series of diagnostic plots, including the background/signal between HTO counts, estimated background probabilities, HTO distributions of cells and non-cells, etc. - true - true * - demuxEM_generate_gender_plot - demuxEM parameter. If generate violin plots using gender-specific genes (e.g. Xist). is a comma-separated list of gene names - "XIST" - * - demuxEM_version - demuxEM version to use. Choose from "0.1.8", "0.1.7", "0.1.6" and "0.1.5". - "0.1.8" - "0.1.8" * - demuxEM_num_cpu - demuxEM parameter. Number of CPUs to request for demuxEM per pair. - 8 - 8 * - demuxEM_memory - demuxEM parameter. Memory size string for demuxEM per pair. - "10G" - "10G" * - demuxEM_disk_space - demuxEM parameter. Disk space (integer) in GB needed for demuxEM per pair. - 20 - 20 souporcell inputs ++++++++++++++++++ .. list-table:: :widths: 5 20 10 5 :header-rows: 1 * - Name - Description - Example - Default * - souporcell_version - souporcell version to use. Available versions: - ``2.5`` released on 2023/06/22. - ``2022.12``: Based on commitment `9fb527 `_ on 2022/12/13. - ``2021.03``: Based on commitment `1bd9f1 `_ on 2021/03/07. - ``2020.07``: Based on commitment `0d09fb `_ on 2020/07/27. - ``2020.03``: Based on commitment `eeddcd `_ on 2020/03/31. - "2.5" - "2.5" * - souporcell_num_clusters - | souporcell parameter. Number of expected clusters when doing clustering. | **This needs to be set when running souporcell.** - 8 - 1 * - souporcell_de_novo_mode - souporcell parameter. - If ``true``, run souporcell in de novo mode without reference genotypes: - If input *souporcell_common_variants* is further provided, use this common variants list instead of calling SNPs de novo. - If a reference genotype vcf file is provided in the sample sheet, use it **only** for matching the cluster labels computed by souporcell. - If ``false``, run souporcell with ``--known_genotypes`` option using the reference genotype vcf file specified in sample sheet. - true - true * - souporcell_num_clusters - | souporcell parameter. Number of expected clusters when doing clustering. | **This needs to be set when running souporcell.** - 8 - 1 * - souporcell_common_variants - | souporcell parameter. Users can provide a common variants list in VCF format for Souporcell to use, instead of calling SNPs de novo. | **Notice:** This input is enabled only when *souporcell_de_novo_mode* is ``false``. - "1000genome.common.variants.vcf.gz" - * - souporcell_skip_remap - souporcell parameter. Skip remap step. Only recommended in non denovo mode or common variants are provided. - true - false * - souporcell_umi_tag - souporcell parameter. If the UMI tag in your data is not ``UB``, then set your UMI tag here to overwrite the default. - "UM" - "UB" * - souporcell_rename_donors - souporcell parameter. A comma-separated list of donor names for matching clusters achieved by souporcell. Must be consistent with *souporcell_num_clusters* input. - If this input is empty, use cluster labels from the reference genotype vcf file if provided in the sample sheet; if this vcf file is not provided, simply name clusters as *Donor1*, *Donor2*, ... - If this input is not empty, and a reference genotype vcf file is provided in the sample sheet, first match the cluster labels using those from this vcf file, then rename to donor names specified in this input. - If this input is not empty, and **NO** reference genotype vcf file is provided in the sample sheet, simply match the cluster labels in one-to-one correspondence with donor names specified in this input. - "CB1,CB2,CB3,CB4" - * - souporcell_num_cpu - souporcell parameter. Number of CPUs to request for souporcell per pair. - 32 - 32 * - souporcell_memory - souporcell parameter. Memory size string for souporcell per pair. - "120G" - "120G" * - souporcell_disk_space - souporcell parameter. Disk space (integer) in GB needed for souporcell per pair. - 500 - 500 Popscle inputs +++++++++++++++++ .. list-table:: :widths: 5 20 10 5 :header-rows: 1 * - Name - Description - Example - Default * - popscle_num_samples - popscle parameter. Number of samples to be multiplexed together: - If ``0``, run with *demuxlet* using reference genotypes. - Otherwise, run with *freemuxlet* in de novo mode without reference genotypes. - 4 - 0 * - popscle_min_MQ - popscle parameter. Minimum mapping quality to consider (lower MQ will be ignored). - 20 - 20 * - popscle_min_TD - popscle parameter. Minimum distance to the tail (lower will be ignored). - 0 - 0 * - popscle_tag_group - popscle parameter. Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use ``CB``. - "CB" - "CB" * - popscle_tag_UMI - popscle parameter. Tag representing UMIs. For 10x genomics, use ``UB``. - "UB" - "UB" * - popscle_field - popscle parameter. FORMAT field to extract from: genotype (``GT``), genotype likelihood (``GL``), or posterior probability (``GP``). - "GT" - "GT" * - popscle_alpha - popscle parameter. Grid of alpha to search for, in a comma separated list format of all alpha values to be considered. - "0.1,0.2,0.3,0.4,0.5" - "0.1,0.2,0.3,0.4,0.5" * - popscle_rename_donors - | popscle parameter. A comma-separated list of donor names for renaming clusters achieved by popscle. Must be consistent with *popscle_num_samples* input. | By default, the resulting donors are *Donor1*, *Donor2*, ... - "CB1,CB2,CB3,CB4" - * - popscle_version - popscle parameter. popscle version to use. Available options: - ``2021.05``: Based on commitment `da70fc7 `_ on 2021/05/05. - ``0.1b``: Based on version `0.1-beta `_ released on 2019/10/03. - "2021.05" - "2021.05" * - popscle_num_cpu - popscle parameter. Number of CPU used by popscle per pair. - 1 - 1 * - popscle_memory - popscle parameter. Memory size string per pair. - "120G" - "120G" * - popscle_extra_disk_space - popscle parameter. Extra disk space size (integer) in GB needed for popscle per pair, besides the disk size required to hold input files specified in the sample sheet. - 100 - 100 --------------------- Workflow outputs ^^^^^^^^^^^^^^^^^^ See the table below for *demultiplexing* workflow outputs. .. list-table:: :widths: 5 5 10 :header-rows: 1 * - Name - Type - Description * - output_folders - Array[String] - A list of Google Bucket URLs of the output folders. Each folder is associated with one RNA-hashtag pair in the given sample sheet. * - output_zarr_files - Array[File] - A list of demultiplexed RNA count matrices in zarr format. Each zarr file is associated with one RNA-hashtag pair in the given sample sheet. Please refere to section `load demultiplexing results into Python and R`_ for its structure. In the output subfolder of each cell-hashing/nuclei-hashing RNA-hashtag data pair, you can find the following files: .. list-table:: :widths: 5 10 :header-rows: 1 * - Name - Description * - output_name_demux.zarr.zip - Demultiplexed RNA raw count matrix in zarr format. Please refer to section `load demultiplexing results into Python and R`_ for its structure. * - output_name.out.demuxEM.zarr.zip - | This file contains intermediate results for both RNA and hashing count matrices. | To load this file into Python, you need to first install `Pegasusio`_ on your local machine. Then use ``import pegasusio as io; data = io.read_input("output_name.out.demuxEM.zarr.zip")`` in Python environment. | It contains 2 UnimodalData objects: one with key name suffix ``-hashing`` is the hashtag count matrix, the other one with key name suffix ``-rna`` is the demultiplexed RNA count matrix. | To load the hashtag count matrix, type ``hash_data = data.get_data('-hashing')``, where ```` is the genome name of the data. The count matrix is ``hash_data.X``; cell barcode attributes are stored in ``hash_data.obs``; sample names are in ``hash_data.var_names``. Moreover, the estimated background probability regarding hashtags is in ``hash_data.uns['background_probs']``. | To load the RNA matrix, type ``rna_data = data.get_data('-rna')``, where ```` is the genome name of the data. It only contains cells which have estimated sample assignments. The count matrix is ``rna_data.X``. Cell barcode attributes are stored in ``rna_data.obs``: ``rna_data.obs['demux_type']`` stores the estimated droplet types (singlet/doublet/unknown) of cells; ``rna_data.obs['assignment']`` stores the estimated hashtag(s) that each cell belongs to. Moreover, for cell-hashing/nucleus-hashing data, you can find estimated sample fractions (sample1, sample2, ..., samplen, background) for each droplet in ``rna_data.obsm['raw_probs']``. * - output_name.ambient_hashtag.hist.pdf - Optional output. A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets. * - output_name.background_probabilities.bar.pdf - Optional output. A bar plot visualizing the estimated hashtag background probability distribution. * - output_name.real_content.hist.pdf - Optional output. A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay. * - output_name.rna_demux.hist.pdf - Optional output. A histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells. * - output_name.gene_name.violin.pdf - Optional outputs. Violin plots depicting gender-specific gene expression across samples. We can have multiple plots if a gene list is provided in ``demuxEM_generate_gender_plot`` field of cumulus_hashing_cite_seq inputs. In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by *souporcell*, you can find the following files: .. list-table:: :widths: 5 10 :header-rows: 1 * - Name - Description * - output_name_demux.zarr.zip - Demultiplexed RNA count matrix in zarr format. Please refer to section `load demultiplexing results into Python and R`_ for its structure. * - clusters.tsv - Inferred droplet type and cluster assignment for each cell barcode. * - cluster_genotypes.vcf - Inferred genotypes for each cluster. * - match_donors.log - Log of matching donors step, with information of donor matching included. In the output subfolder of each genetic-pooling RNA-hashtag data pair generated by *demuxlet*, you can find the following files: .. list-table:: :widths: 5 10 :header-rows: 1 * - Name - Description * - output_name_demux.zarr.zip - Demultiplexed RNA count matrix in zarr format. Please refer to section `load demultiplexing results into Python and R`_ for its structure. * - output_name.best (demuxlet) or output_name.clust1.samples.gz (freemuxlet) - Inferred droplet type and cluster assignment for each cell barcode. --------------------------------- Load demultiplexing results into Python and R ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To load demultiplexed RNA count matrix into Python, you need to install Python package pegasusio_ first. Then follow the codes below:: import pegasusio as io data = io.read_input('output_name_demux.zarr.zip') Once you load the data object, you can find estimated droplet types (singlet/doublet/unknown) in ``data.obs['demux_type']``. Notices that there are cell barcodes with no sample associated, and therefore have no droplet type. You can also find estimated sample assignments in ``data.obs['assignment']``. For cell-hashing/nucleus-hashing data, if one sample name can correspond to multiple feature barcodes, each feature barcode is assigned to a unique sample name, and this deduplicated sample assignment results are in ``data.obs['assignment.dedup']``. To load the results into R, you need to install R package ``reticulate`` in addition to Python package ``pegasusio``. Then follow the codes below:: library(reticulate) ad <- import("pegasusio", convert = FALSE) data <- ad$read_input("output_name_demux.zarr.zip") Results are in ``data$obs['demux_type']``, ``data$obs['assignment']``, and similarly as above, for cell-hashing/nucleus-hashing data, you'll find an additional field ``data$obs['assignment.dedup']`` for deduplicated sample assignment in the case that one sample name can correspond to multiple feature barcodes. .. _cellranger_workflow tutorial: ./cellranger/index.html .. _Import workflows to Terra: ./cumulus_import.html .. _gsutil: https://cloud.google.com/storage/docs/gsutil .. _genome reference: ./cellranger/index.html#sample-sheet .. _souporcell: https://github.com/wheaton5/souporcell .. _popscle: https://github.com/statgen/popscle .. _pegasusio: https://pypi.org/project/pegasusio/ .. _load demultiplexing results into Python and R: ./demultiplexing.html#load-demultiplexing-results-into-python-and-r