Example of 10X Genomics CellPlex Analysis on Cloud
In this example, you’ll learn how to perform Cellplex analysis on Cloud using Cromwell.
0. Prerequisite
You need to install the corresponding Cloud SDK tool on your local machine if not:
gcloud CLI for Google Cloud.
AWS CLI v2 for Amazon AWS Cloud.
And then install Altocumulus in your Python environment. This is the tool for data transfer between local machine and Cloud VM instance.
In this example, we assume that your Cromwell server is already deployed on Cloud at IP address 10.0.0.0
with port 8000
, and also assume using Google Cloud with bucket gs://my-bucket
.
1. Extract Genen-Count Matrices
First step is to extract gene-count matrices from sequencer output.
In this example, we have the following experiment setting:
A sample named
cellplex_gex
by pooling all RNA data together for sequencing, with indexSI-TT-A1
;A sample named
cellplex_barcode
for hashing data, with indexSI-NN-A1
;Three samples to perform individual control:
Sample
A
with indexSI-TT-A2
and CMO IDCMO_301
,Sample
B
with indexSI-TT-A3
and CMO IDCMO_302
,Sample
C
with indexSI-TT-A4
and CMO IDCMO_303
To extract feature barcodes for the hashing data, we need to create a feature barcoding file (say named feature_barcode.csv
). Please refer to 10X Multi CMO Reference for the sequence information of these CMO IDs:
ATGAGGAATTCCTGC,A
CATGCCAATAGAGCG,B
CCGTCGTCCAAGCAT,C
After that, create a sample sheet in CSV format (say named cellranger_sample_sheet.csv
) as the following:
Sample,Reference,Flowcell,Lane,Index,DataType,FeatureBarcodeFile
cellplex_gex,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A1,rna
cellplex_barcode,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-NN-A1,cmo,/path/to/feature_barcode.csv
A,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A2,rna
B,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A3,rna
C,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A4,rna
where
GRCh38-2020-A
is the Human GRCh38 (GENCODE v32/Ensembl 98) genome reference prebuilt by Cumulus. See Cumulus single-cell genome reference list for a complete list of genome references./path/to/flowcell/folder
should be replaced by the local path to the BCL folder of your sequencer output./path/to/feature_barcode.csv
should be replaced by the local path tofeature_barcode.csv
file we just created above.rna
andcmo
refer to gene expression data and cell multiplexing oligos used in 10X Genomics CellPlex assay, respectively.Only the sample of
cmo
type needs a feature barcode file for indexing.
For details on preparing this sample sheet, please refer to CellRanger workflow sample sheet format.
Now let’s prepare an input JSON file for cellranger_workflow WDL workflow to execute (say named cellranger_inputs.json
):
{
"cellranger_workflow.input_csv_file": "/path/to/cellranger_sample_sheet.csv",
"cellranger_workflow.output_directory": "gs://my-bucket/cellplex/cellranger_output"
}
where
/path/to/cellranger_sample_sheet.csv
should be replaced by the local path to your sample sheet created above.gs://my-bucket/cellplex/cellranger_output
is the target folder on Google bucket to store your result when the workflow job is finished.
For details on these workflow inputs, please refer to CellRanger workflow inputs.
Now we are ready to submit a job to the Cromwell server on Cloud for computing. On your local machine, run the following command:
alto cromwell run -s 10.0.0.0 -p 8000 -m broadinstitute:cumulus:cellranger:master -i /path/to/cellranger_inputs.json -o cellranger_inputs_updated.json -b gs://my-bucket/cellplex
where
-s
to specify the server’s IP address (or hostname),-p
to specify the server’s port number.-m
to specify which WDL workflow to use. You should use the Dockstore name of Cumulus cellranger_workflow. Here, the latest versionmaster
is used. If omit the version info, i.e.broadinstitute:cumulus:cellranger
, the default version will be used.-i
to specify the workflow input JSON file.-o
and-b
are used when the input data are local and need to be uploaded to Cloud bucket first. This can be inferred from the workflow input JSON file and sample sheet CSV file.-o
to specify the updated workflow input JSON file after uploading the input data, with all the local paths updated to Cloud bucket URLs.-b
to specify which folder on Cloud bucket to upload the local input data.
Notice that -o
and -b
options can be dropped if all of your input data are already on Cloud bucket.
After submission, you’ll get the job’s ID for tracking its status:
alto cromwell check_status -s 10.0.0.0 -p 8000 --id <your-job-ID>
where <your-job-ID>
should be replaced by the actual Cromwell job ID.
When the job is done, you’ll get results in gs://my-bucket/cellplex/cellranger_output
. It should contain 6 subfolders, each of which is associated with one sample in cellranger_sample_sheet.csv
.
2. Demultiplexing
Next, we need to demultiplex the resulting gene-count matrices. In this example, we perform both DemuxEM and Souporcell methods, respectively.
For DemuxEM, we’ll need the RNA raw count matrix in HDF5 format (gs://my-bucket/cellplex/cellranger_output/cellplex_gex/raw_feature_bc_matrix.h5
) and the hashing count matrix in CSV format (gs://my-buckjet/cellplex/cellranger_output/cellplex_barcode/cellplex_barcode.csv
).
For Souporcell, both the RNA raw count matrix above and its corresponding BAM file (gs://my-bucket/cellplex/cellranger_output/cellplex_gex/possorted_genome_bam.bam
) are needed.
Prepare a sample sheet in CSV format (say named demux_sample_sheet.csv
) for demultiplexing, one line for DemuxEM, one for Souporcell:
OUTNAME,RNA,TagFile,TYPE
cellplex_demux,gs://my-bucket/cellplex/cellranger_output/cellplex_gex/raw_feature_bc_matrix.h5,gs://my-buckjet/cellplex/cellranger_output/cellplex_barcode/cellplex_barcode.csv,cell-hashing
cellplex_souporcell,gs://my-bucket/cellplex/cellranger_output/cellplex_gex/raw_feature_bc_matrix.h5,gs://my-bucket/cellplex/cellranger_output/cellplex_gex/possorted_genome_bam.bam,genetic-pooling
where
cell-hashing
indicates using DemuxEM for demultiplexing, whilegenetic-pooling
indicates using genetic pooling methods for demultiplexing, with Souporcell being the default.
For details on this sample sheet, please refer to Demultiplexing workflow sample sheet format.
Then prepare a workflow input JSON file (say named demux_inputs.json
) for demultiplexing:
{
"demultiplexing.input_sample_sheet": "/path/to/demux_sample_sheet.csv",
"demultiplexing.output_directory": "gs://my-bucket/cellplex/demux_output",
"demultiplexing.genome": "GRCh38-2020-A",
"demultiplexing.souporcell_num_clusters": 3
}
where
/path/to/demux_sample_sheet.csv
should be replaced by the local path to yourdemux_sample_sheet.csv
created above.gs://my-bucket/cellplex/demux_output
is the Bucket folder to write the results when the job is finished.GRCh38-2020-A
is the genome reference used by Souporcell, which should be consistent with your settings in Step 1.souporcell_num_clusters
is to set the number of clusters you expect to see for Souporcell clustering. Since we have 3 donors, so set it to 3.
For details, please refer to Demultiplexing workflow inputs.
Now submit the demultiplexing job to Cromwell server on Cloud:
alto cromwell run -s 10.0.0.0 -p 8000 -m broadinstitute:cumulus:demultiplexing:master -i demux_inputs.json -o demux_inputs_updated.json -b gs://my-bucket/cellplex
where
broadinstitute:cumulus:demultiplexing
refers to demultiplexing workflow published on Dockstore.We still need
-o
and-b
options becausedemux_sample_sheet.csv
is on the local machine.
Similarly, when the submission succeeds, you’ll get another job ID for demultiplexing. You can use it to track the job status.
When finished, below are the important output files:
DemuxEM output: In folder
gs://my-bucket/cellplex/demux_output/cellplex_demux
,cellplex_demux_demux.zarr.zip
: Demultiplexed RNA raw count matrix. This will be used for downstream analysis.cellplex_demux.out.demuxEM.zarr.zip
: This file contains intermediate results for both RNA and hashing count matrices, which is useful for compare with other demultiplexing methods.DemuxEM plots in PDF format. They are used for estimating the performance of DemuxEM on the data.
Souporcell output: In folder
gs://my-bucket/cellplex/demux_output/cellplex_souporcell
,cellplex_souporcell_demux.zarr.zip
: Demultiplexed RNA raw count matrix. This will be used for downstream analysis.clusters.tsv
: Inferred droplet type and cluster assignment for each cell barcode.cluster_genotypes.vcf
: Inferred genotypes for each cluster.
3. Interactive Data Analysis
You may use Cumulus workflow to perform the downstream analysis in a batch way. Alternatively, you can also download the demultiplexing results from the Cloud bucket to your local machine, and perform the analysis interactively. This section introduces how to use Cumulus’ analysis module Pegasus to load demultiplexing results, perform quality control (QC), and compare the performance of the two methods.
You’ll need to first install Pegasus in your local Python environment. Also, download the demultiplexed raw counts in .zarr.zip
format mentioned above to your local machine.
3.1. Extract Singlet/Doublet Type and Assignment
We can load the DemuxEM result, and perform QC by:
import pegasus as pg
data_demuxEM = pg.read_input("cellplex_demux_demux.zarr.zip")
pg.qc_metrics(data_demuxEM, min_genes=500, max_genes=6000, mito_prefix='MT-', percent_mito=20)
pg.filter_data(data_demuxEM)
where qc_metrics
and filter_data
are Pegasus functions to filter out low quality cells, and keep those with number of genes within range [500, 6000)
and having expression of mitochondrial genes < 20%
. Please see Pegasus preprocess tools for details.
There are two columns in data_demuxEM.obs field related to demultiplexing results:
demux_type: This column stores the singlet/doublet type of each cell:
singlet
,doublet
, orunknown
.assignment: This column stores the more detailed assignment of cells regarding samples/donors.
To get the distribution regarding these columns, e.g. demux_type:
data_demuxEM.obs['demux_type'].value_counts()
Besides, you can export the cell barcodes along with their singlet/doublet type and assignment as a CSV file by:
data_demuxEM.obs[['demux_type', 'assignment']].to_csv("demuxEM_assignment.csv")
We can also do it similarly for the Souporcell result as above, by reading cellplex_souporcell_demux.zarr.zip
instead.
3.2. Compare the Two Demultiplexing Methods
We can compare the performance of DemuxEM and Souporcell by plotting a heatmap showing their singlet/doublet assignment results.
Assume we’ve already loaded the two results (data_demuxEM
for DemuxEM result, data_souporcell
for Souporcell result), and performed QC as in 3.1.
The following Python code will generate this heatmap in an interactive Python environment (e.g. in a Jupyter notebook):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def extract_assignment(data):
assign = data.obs['demux_type'].values.astype('object')
idx_singlet = (data.obs['demux_type'] == 'singlet').values
assign[idx_singlet] = data.obs.loc[idx_singlet, 'assignment'].values.astype(object)
return assign
assign_demuxEM = extract_assignment(data_demuxEM)
assign_souporcell = extract_assignment(data_souporcell)
df = pd.crosstab(assign_demuxEM, assign_souporcell)
df.columns.name = df.index.name = ""
ax = plt.gca()
ax.xaxis.tick_top()
ax = sns.heatmap(df, annot=True, fmt='d', cmap='inferno', ax=ax)
plt.tight_layout()
plt.gcf().dpi=500
3.3. Downstream Analysis
To perform further downstream analysis on the singlets, please refer to Pegasus tutorials.