Example of 10X Genomics CellPlex Analysis on Cloud
In this example, you’ll learn how to perform Cellplex analysis on Cloud using Cromwell.
0. Prerequisite
You need to install the corresponding Cloud SDK tool on your local machine if not:
gcloud CLI for Google Cloud.
AWS CLI v2 for Amazon AWS Cloud.
And then install Altocumulus in your Python environment. This is the tool for data transfer between local machine and Cloud VM instance.
In this example, we assume that your Cromwell server is already deployed on Cloud at IP address 10.0.0.0 with port 8000, and also assume using Google Cloud with bucket gs://my-bucket.
1. Extract Genen-Count Matrices
First step is to extract gene-count matrices from sequencer output.
In this example, we have the following experiment setting:
A sample named
cellplex_gexby pooling all RNA data together for sequencing, with indexSI-TT-A1;A sample named
cellplex_barcodefor hashing data, with indexSI-NN-A1;Three samples to perform individual control:
Sample
Awith indexSI-TT-A2and CMO IDCMO_301,Sample
Bwith indexSI-TT-A3and CMO IDCMO_302,Sample
Cwith indexSI-TT-A4and CMO IDCMO_303
To extract feature barcodes for the hashing data, we need to create a feature barcoding file (say named feature_barcode.csv). Please refer to 10X Multi CMO Reference for the sequence information of these CMO IDs:
ATGAGGAATTCCTGC,A
CATGCCAATAGAGCG,B
CCGTCGTCCAAGCAT,C
After that, create a sample sheet in CSV format (say named cellranger_sample_sheet.csv) as the following:
Sample,Reference,Flowcell,Lane,Index,DataType,FeatureBarcodeFile
cellplex_gex,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A1,rna
cellplex_barcode,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-NN-A1,cmo,/path/to/feature_barcode.csv
A,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A2,rna
B,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A3,rna
C,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A4,rna
where
GRCh38-2020-Ais the Human GRCh38 (GENCODE v32/Ensembl 98) genome reference prebuilt by Cumulus. See Cumulus single-cell genome reference list for a complete list of genome references./path/to/flowcell/foldershould be replaced by the local path to the BCL folder of your sequencer output./path/to/feature_barcode.csvshould be replaced by the local path tofeature_barcode.csvfile we just created above.rnaandcmorefer to gene expression data and cell multiplexing oligos used in 10X Genomics CellPlex assay, respectively.Only the sample of
cmotype needs a feature barcode file for indexing.
For details on preparing this sample sheet, please refer to CellRanger workflow sample sheet format.
Now let’s prepare an input JSON file for cellranger_workflow WDL workflow to execute (say named cellranger_inputs.json):
{
"cellranger_workflow.input_csv_file": "/path/to/cellranger_sample_sheet.csv",
"cellranger_workflow.output_directory": "gs://my-bucket/cellplex/cellranger_output"
}
where
/path/to/cellranger_sample_sheet.csvshould be replaced by the local path to your sample sheet created above.gs://my-bucket/cellplex/cellranger_outputis the target folder on Google bucket to store your result when the workflow job is finished.
For details on these workflow inputs, please refer to CellRanger workflow inputs.
Now we are ready to submit a job to the Cromwell server on Cloud for computing. On your local machine, run the following command:
alto cromwell run -s 10.0.0.0 -p 8000 -m broadinstitute:cumulus:cellranger:master -i /path/to/cellranger_inputs.json -o cellranger_inputs_updated.json -b gs://my-bucket/cellplex
where
-sto specify the server’s IP address (or hostname),-pto specify the server’s port number.-mto specify which WDL workflow to use. You should use the Dockstore name of Cumulus cellranger_workflow. Here, the latest versionmasteris used. If omit the version info, i.e.broadinstitute:cumulus:cellranger, the default version will be used.-ito specify the workflow input JSON file.-oand-bare used when the input data are local and need to be uploaded to Cloud bucket first. This can be inferred from the workflow input JSON file and sample sheet CSV file.-oto specify the updated workflow input JSON file after uploading the input data, with all the local paths updated to Cloud bucket URLs.-bto specify which folder on Cloud bucket to upload the local input data.
Notice that -o and -b options can be dropped if all of your input data are already on Cloud bucket.
After submission, you’ll get the job’s ID for tracking its status:
alto cromwell check_status -s 10.0.0.0 -p 8000 --id <your-job-ID>
where <your-job-ID> should be replaced by the actual Cromwell job ID.
When the job is done, you’ll get results in gs://my-bucket/cellplex/cellranger_output. It should contain 6 subfolders, each of which is associated with one sample in cellranger_sample_sheet.csv.
2. Demultiplexing
Next, we need to demultiplex the resulting gene-count matrices. In this example, we perform both DemuxEM and Souporcell methods, respectively.
For DemuxEM, we’ll need the RNA raw count matrix in HDF5 format (gs://my-bucket/cellplex/cellranger_output/cellplex_gex/raw_feature_bc_matrix.h5) and the hashing count matrix in CSV format (gs://my-buckjet/cellplex/cellranger_output/cellplex_barcode/cellplex_barcode.csv).
For Souporcell, both the RNA raw count matrix above and its corresponding BAM file (gs://my-bucket/cellplex/cellranger_output/cellplex_gex/possorted_genome_bam.bam) are needed.
Prepare a sample sheet in CSV format (say named demux_sample_sheet.csv) for demultiplexing, one line for DemuxEM, one for Souporcell:
OUTNAME,RNA,TagFile,TYPE
cellplex_demux,gs://my-bucket/cellplex/cellranger_output/cellplex_gex/raw_feature_bc_matrix.h5,gs://my-buckjet/cellplex/cellranger_output/cellplex_barcode/cellplex_barcode.csv,cell-hashing
cellplex_souporcell,gs://my-bucket/cellplex/cellranger_output/cellplex_gex/raw_feature_bc_matrix.h5,gs://my-bucket/cellplex/cellranger_output/cellplex_gex/possorted_genome_bam.bam,genetic-pooling
where
cell-hashingindicates using DemuxEM for demultiplexing, whilegenetic-poolingindicates using genetic pooling methods for demultiplexing, with Souporcell being the default.
For details on this sample sheet, please refer to Demultiplexing workflow sample sheet format.
Then prepare a workflow input JSON file (say named demux_inputs.json) for demultiplexing:
{
"demultiplexing.input_sample_sheet": "/path/to/demux_sample_sheet.csv",
"demultiplexing.output_directory": "gs://my-bucket/cellplex/demux_output",
"demultiplexing.genome": "GRCh38-2020-A",
"demultiplexing.souporcell_num_clusters": 3
}
where
/path/to/demux_sample_sheet.csvshould be replaced by the local path to yourdemux_sample_sheet.csvcreated above.gs://my-bucket/cellplex/demux_outputis the Bucket folder to write the results when the job is finished.GRCh38-2020-Ais the genome reference used by Souporcell, which should be consistent with your settings in Step 1.souporcell_num_clustersis to set the number of clusters you expect to see for Souporcell clustering. Since we have 3 donors, so set it to 3.
For details, please refer to Demultiplexing workflow inputs.
Now submit the demultiplexing job to Cromwell server on Cloud:
alto cromwell run -s 10.0.0.0 -p 8000 -m broadinstitute:cumulus:demultiplexing:master -i demux_inputs.json -o demux_inputs_updated.json -b gs://my-bucket/cellplex
where
broadinstitute:cumulus:demultiplexingrefers to demultiplexing workflow published on Dockstore.We still need
-oand-boptions becausedemux_sample_sheet.csvis on the local machine.
Similarly, when the submission succeeds, you’ll get another job ID for demultiplexing. You can use it to track the job status.
When finished, below are the important output files:
DemuxEM output: In folder
gs://my-bucket/cellplex/demux_output/cellplex_demux,cellplex_demux_demux.zarr.zip: Demultiplexed RNA raw count matrix. This will be used for downstream analysis.cellplex_demux.out.demuxEM.zarr.zip: This file contains intermediate results for both RNA and hashing count matrices, which is useful for compare with other demultiplexing methods.DemuxEM plots in PDF format. They are used for estimating the performance of DemuxEM on the data.
Souporcell output: In folder
gs://my-bucket/cellplex/demux_output/cellplex_souporcell,cellplex_souporcell_demux.zarr.zip: Demultiplexed RNA raw count matrix. This will be used for downstream analysis.clusters.tsv: Inferred droplet type and cluster assignment for each cell barcode.cluster_genotypes.vcf: Inferred genotypes for each cluster.
3. Interactive Data Analysis
You may use Cumulus workflow to perform the downstream analysis in a batch way. Alternatively, you can also download the demultiplexing results from the Cloud bucket to your local machine, and perform the analysis interactively. This section introduces how to use Cumulus’ analysis module Pegasus to load demultiplexing results, perform quality control (QC), and compare the performance of the two methods.
You’ll need to first install Pegasus in your local Python environment. Also, download the demultiplexed raw counts in .zarr.zip format mentioned above to your local machine.
3.1. Extract Singlet/Doublet Type and Assignment
We can load the DemuxEM result, and perform QC by:
import pegasus as pg
data_demuxEM = pg.read_input("cellplex_demux_demux.zarr.zip")
pg.qc_metrics(data_demuxEM, min_genes=500, max_genes=6000, mito_prefix='MT-', percent_mito=20)
pg.filter_data(data_demuxEM)
where qc_metrics and filter_data are Pegasus functions to filter out low quality cells, and keep those with number of genes within range [500, 6000)
and having expression of mitochondrial genes < 20%. Please see Pegasus preprocess tools for details.
There are two columns in data_demuxEM.obs field related to demultiplexing results:
demux_type: This column stores the singlet/doublet type of each cell:
singlet,doublet, orunknown.assignment: This column stores the more detailed assignment of cells regarding samples/donors.
To get the distribution regarding these columns, e.g. demux_type:
data_demuxEM.obs['demux_type'].value_counts()
Besides, you can export the cell barcodes along with their singlet/doublet type and assignment as a CSV file by:
data_demuxEM.obs[['demux_type', 'assignment']].to_csv("demuxEM_assignment.csv")
We can also do it similarly for the Souporcell result as above, by reading cellplex_souporcell_demux.zarr.zip instead.
3.2. Compare the Two Demultiplexing Methods
We can compare the performance of DemuxEM and Souporcell by plotting a heatmap showing their singlet/doublet assignment results.
Assume we’ve already loaded the two results (data_demuxEM for DemuxEM result, data_souporcell for Souporcell result), and performed QC as in 3.1.
The following Python code will generate this heatmap in an interactive Python environment (e.g. in a Jupyter notebook):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def extract_assignment(data):
assign = data.obs['demux_type'].values.astype('object')
idx_singlet = (data.obs['demux_type'] == 'singlet').values
assign[idx_singlet] = data.obs.loc[idx_singlet, 'assignment'].values.astype(object)
return assign
assign_demuxEM = extract_assignment(data_demuxEM)
assign_souporcell = extract_assignment(data_souporcell)
df = pd.crosstab(assign_demuxEM, assign_souporcell)
df.columns.name = df.index.name = ""
ax = plt.gca()
ax.xaxis.tick_top()
ax = sns.heatmap(df, annot=True, fmt='d', cmap='inferno', ax=ax)
plt.tight_layout()
plt.gcf().dpi=500
3.3. Downstream Analysis
To perform further downstream analysis on the singlets, please refer to Pegasus tutorials.