Example of Gene expression, Hashing and CITE-Seq Analysis on Cloud¶
In this example, you’ll learn how to perform Gene expression, Hashing and CITE-Seq data analysis on Cloud.
This example covers the cases of both Terra platform and a custom cloud server running Cromwell. When reading through the tutorial, you may check out the corresponding part based on your working situation.
0. Prerequisite¶
0-a. Cromwell server¶
If you use a Cromwell server on Cloud, on your local machine, you need to install the corresponding Cloud SDK tool if not:
- gcloud CLI if your Cloud bucket is on Google Cloud.
- AWS CLI v2 if your Cloud bucket is on Amazon AWS Cloud.
And then install Altocumulus in your Python environment. This is the tool for data transfer between local machine and cloud bucket, as well as communication with the Cromwell server on cloud.
0-b. Terra Platform¶
If you use Terra, after registering on Terra and creating a workspace there, you’ll need the following information:
- Terra workspace name. This is shown on your Terra workspace webpage, with format “<workspace-namespace>/<workspace-name>”. For example, if your Terra workspace has full name
ws-lab/ws-01
, then ws-lab is the namespace and ws-01 is the workspace name winthin that namespace.- The corresponding Google Cloud Bucket of your Terra workspace. You can check it under “Google Bucket” title on the right panel of your Terra workspace’s Dashboard tab. The bucket name associated with your workspace starts with
fc-
followed by a sequence of heximal numbers. For example,gs://fc-e0000000
, where “gs://” is the header of GS URI.
Besides, install gcloud CLI and Altocumulus on your local machine for data uploading. These tools will be used for data transfer between local machine and Cloud bucket.
Alternatively, you can also use Terra web UI for job submission instead of command-line submission. This will be discussed in Section Run Analysis with Terra Web UI below.
1. Extract Gene-Count Matrices¶
This phase is to extract gene-count matrices from sequencing output.
There are two cases: (1) from BCL data, which includes mkfastq step to generate FASTQ files and count step to generate gene-count matrices; (2) from FASTQ files, which only runs the count step.
1-a. Extract Genen-Count Matrices from BCL data¶
This section covers the case starting from BCL data.
Step 1. Sample Sheet Preparation¶
First, prepare a feature index file for your dataset. Say its filename is antibody_index.csv
, which has format “feature_barcode, feature_name,feature_type”. See an example below:
TTCCTGCCATTACTA,HTO_1,hashing
CCGTACCTCATTGTT,HTO_2,hashing
GGTAGATGTCCTCAG,HTO_3,hashing
TGGTGTCATTCTTGA,Ab1,citeseq
CTCATTGTAACTCCT,Ab2,citeseq
GCGCAACTTGATGAT,Ab3,citeseq
... ...
where each line contains the barcode and the name of a Hashing/CITE-Seq index: hashing
indicates a Cell/Nucleus-Hashing index, while citeseq
indicates a CITE-Seq index.
Next, create a sample sheet cellranger_sample_sheet.csv
for Cell Ranger processing on your local machine. Below is an example:
Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType,FeatureBarcodeFile
sample_control,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A1,fiveprime,rna
sample_gex,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A2,fiveprime,rna
sample_cell_hashing,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-NN-A1,fiveprime,hashing,/path/to/antibody_index.csv
sample_cite_seq,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-NN-A2,fiveprime,citeseq,/path/to/antibody_index.csv
where
GRCh38-2020-A
is the is the Human GRCh38 (GENCODE v32/Ensembl 98) genome reference prebuilt by Cumulus. See Cumulus single-cell genome reference list for a complete list of genome references./path/to/flowcell/folder
should be replaced by the actual local path to the BCL folder of your sequencing data./path/to/antibody_index.csv
should be replaced by the actual local path toantibody_index.csv
file we just created above.rna
,hashing
andciteseq
refer to gene expression data, cell/nucleus-hashing data, and CITE-Seq data, respectively.- Samples of type
rna
do not need any feature barcode file for indexing.
For the details on how to prepare this sample sheet, please refer to Step 3 of Cell Ranger sample sheet instruction.
Step 2. Workflow Input Preparation¶
Now prepare a JSON file for cellranger_workflow WDL workflow input on your local machine (say named cellranger_inputs.json
):
{
"cellranger_workflow.input_csv_file": "/path/to/cellranger_sample_sheet.csv",
"cellranger_workflow.output_directory": "gs://my-bucket/cellranger_output"
}
where
/path/to/cellranger_sample_sheet.csv
should be replaced by the actual local path to your sample sheet created above.gs://my-bucket/cellranger_output
is the target folder on Google bucket to store your result when the workflow job is finished, wheremy-bucket
should be replaced by your own Google bucket name.
For details on the all the workflow inputs of cellranger_workflow, please refer to Cell Ranger workflow inputs.
Step 3. Job Submission¶
Now we are ready to submit a job to cloud for computing:
If you use a Cromwell server on cloud, run the following Altocumulus command:
alto cromwell run -s <server-address> -p <port-number> -m broadinstitute:cumulus:cellranger -i /path/to/cellranger_inputs.json -o cellranger_inputs_updated.json -b gs://my-bucket/data_source
where
-s
specifies the server’s IP address (or hostname), where<server-address>
should be replaced by the actual IP address (or hostname).-p
specifies the server’s port number for Cromwell, where<port-number>
should be replaced by the actual port number.-m
specifies which WDL workflow to use. You should use the Dockstore name of Cumulus cellranger_workflow. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cellranger:master
to use its development version in master branch.-i
specifies the workflow input JSON file.-o
and-b
are used when the input data (which are specified in the workflow input JSON file and sample sheet CSV file) are local and need to be uploaded to Cloud bucket first.-o
specifies the updated workflow input JSON file after uploading the input data, with all the local paths updated to Cloud bucket URIs. This is useful when resubmitting jobs running the same input data, without uploading the same input data again.-b
specifies which folder on Cloud bucket to upload the local input data, wheremy-bucket
should be replaced by your own Google bucket name. Feel free to choose the folder name other thandata_source
.
Notice that -o
and -b
options can be dropped if all of your input data are already on Cloud bucket.
After submission, you’ll get the job’s ID for tracking its status:
alto cromwell check_status -s <server-address> -p <port-number> --id <your-job-ID>
where <your-job-ID>
should be replaced by the actual Cromwell job ID.
If you use Terra, run the following Altocumulus command:
alto terra run -m broadinstitute:cumulus:cellranger -w ws-lab/ws-01 --bucket-folder data_source -i /path/to/cellranger_inputs.json -o cellranger_inputs_updated.json
where
-m
specifies which WDL workflow to use. You should use the Dockstore name of Cumulus cellranger_workflow. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cellranger:master
to use its development version in master branch.-w
specifies the Terra workspace full name to use, wherews-lab/ws-01
should be replaced by your own Terra workspace full name.--bucket-folder
specifies the folder name on the Google bucket associated with the Terra workspace to store the uploaded data. Feel free to choose folder name other thandata_source
.-i
specifies the workflow input JSON file, where/path/to/cellranger_inputs.json
should be replaced by the actual local path tocellranger_inputs.json
file.-o
specifies the updated workflow input JSON file after uploading the input data, with all the local paths updated to Cloud bucket URIs. This is useful when resubmitting jobs running the same input data, without uploading the same input data again.
Notice that --bucket-folder
and -o
options can be dropped if all of your input data are already on Cloud bucket.
After submission, you can check the job’s status in the Job History tab of your Terra workspace page.
When the job is done, you’ll get results in gs://my-bucket/cellranger_output
, which is specified in cellranger_inputs.json
above. It should contain 4 subfolders, each of which is associated with one sample specified in cellranger_sample_sheet.csv
above.
For the next phases, you’ll need 3 files from the output:
- RNA count matrix of the sample group of interest:
gs://my-bucket/cellranger_output/sample_gex/raw_feature_bc_matrix.h5
;- Cell-Hashing Antibody count matrix:
gs://my-bucket/cellranger_output/sample_cell_hashing/sample_cell_hashing.csv
;- CITE-Seq Antibody count matrix:
gs://my-bucket/cellranger_output/sample_cite_seq/sample_cite_seq.csv
.
1-b. Extract Gene-Cound Matrices from FASTQ files¶
This section covers the case starting from FASTQ files.
Similarly as above, First, prepare a feature index file for your dataset. Say its filename is antibody_index.csv
, which has format “feature_barcode, feature_name,feature_type”. See an example below:
TTCCTGCCATTACTA,HTO_1,hashing
CCGTACCTCATTGTT,HTO_2,hashing
GGTAGATGTCCTCAG,HTO_3,hashing
TGGTGTCATTCTTGA,Ab1,citeseq
CTCATTGTAACTCCT,Ab2,citeseq
GCGCAACTTGATGAT,Ab3,citeseq
... ...
where each line contains the barcode and the name of a Hashing/CITE-Seq index: hashing
indicates a Cell/Nucleus-Hashing index, while citeseq
indicates a CITE-Seq index.
Next, create a sample sheet cellranger_sample_sheet.csv
for Cell Ranger processing on your local machine. Below is an example:
Sample,Reference,Flowcell,Chemistry,DataType,FeatureBarcodeFile
sample_1_rna,GRCh38-2020-A,/path/to/fastq/gex,fiveprime,rna
sample_2_rna,GRCh38-2020-A,/path/to/fastq/gex,fiveprime,rna
sample_3_rna,GRCh38-2020-A,/path/to/fastq/gex,fiveprime,rna
sample_1_adt,GRCh38-2020-A,/path/to/fastq/hashing_citeseq,fiveprime,adt,/path/to/antibody_index.csv
sample_2_adt,GRCh38-2020-A,/path/to/fastq/hashing_citeseq,fiveprime,adt,/path/to/antibody_index.csv
sample_3_adt,GRCh38-2020-A,/path/to/fastq/hashing_citeseq,fiveprime,adt,/path/to/antibody_index.csv
where
GRCh38-2020-A
is the is the Human GRCh38 (GENCODE v32/Ensembl 98) genome reference prebuilt by Cumulus. See Cumulus single-cell genome reference list for a complete list of genome references./path/to/fastq/gex
should be replaced by the actual local path to the folder containing FASTQ files of RNA samples./path/to/fastq/hashing_citeseq
should be replaced by the actual local path to the folder containing FASTQ files of Cell/Nucleus-Hashing and CITE-Seq samples./path/to/antibody_index.csv
should be replaced by the actual local path toantibody_index.csv
file we just created above.rna
andadt
refer to gene expression data and antibody data, respectively. In specific,adt
covers bothciteseq
andhashing
types, i.e. it includes both Hashing and CITE-Seq data types.- Samples of type
rna
do not need any feature barcode file for indexing.- Columns Lane and Index are not needed if starting from FASTQ files, as mkfastq step will be skipped.
For the details on how to prepare this sample sheet, please refer to Step 3 of Cell Ranger sample sheet instruction.
Now prepare a JSON file for cellranger_workflow WDL workflow input on your local machine (say named cellranger_inputs.json
):
{
"cellranger_workflow.input_csv_file": "/path/to/cellranger_sample_sheet.csv",
"cellranger_workflow.output_directory": "gs://my-bucket/cellranger_output",
"cellranger_workflow.run_mkfastq": false
}
where
/path/to/cellranger_sample_sheet.csv
should be replaced by the actual local path to your sample sheet created above.gs://my-bucket/cellranger_output
is the target folder on Google bucket to store your result when the workflow job is finished, wheremy-bucket
should be replaced by your own Google bucket name.- Set run_mkfastq to
false
to skip the mkfastq step, as we start from FASTQ files.
For details on the all the workflow inputs of cellranger_workflow, please refer to Cell Ranger workflow inputs.
Now we are ready to submit a job to cloud for computing. Follow instructions in Section 1-a above.
When finished, you’ll get results in gs://my-bucket/cellranger_output
, which is specified in cellranger_inputs.json
above. It should contain 6 subfolders, each of which is associated with one sample specified in cellranger_sample_sheet.csv
above.
In specific, for each adt
type sample, there are both count matrix of Hashing data and that of CITE-Seq data generated inside its corresponding subfolder, with filename suffix .hashing.csv
and .citeseq.csv
, respectively.
2. Demultiplex Cell-Hashing Data using DemuxEM¶
Run Workflow on Cloud¶
Next, we need to demultiplex the resulting RNA gene-count matrices. We use DemuxEM method in this example.
To be brief, we use the output of Section 1-a for illustration:
On your local machine, prepare a CSV-format sample sheet
demux_sample_sheet.csv
with the following content:OUTNAME,RNA,TagFile,TYPE exp,gs://my-bucket/cellranger_output/sample_gex/raw_feature_bc_matrix.h5,gs://my-bucket/cellranger_output/sample_cell_hashing/sample_cell_hashing.csv,cell-hashingwhere OUTNAME specifies the subfolder and file names of output, which is free to be changed, RNA and TagFile columns specify the RNA and hashing tag meta-data of samples, and TYPE is
cell-hashing
for this phase.On your local machine, also prepare an input JSON file
demux_inputs.json
for demultiplexing WDL workflow,demux_inputs.json
with the following content:{ "demultiplexing.input_sample_sheet" : "/path/to/demux_sample_sheet.csv", "demultiplexing.output_directory" : "gs://my-bucket/demux_output" }where
/path/to/demux_sample_sheet.csv
should be replaced by the actual local path todemux_sample_sheet.csv
created above.For the details on these options, please refer to demultiplexing workflow inputs.
Submit a demultiplexing job with
demux_inputs.json
input above to cloud for execution.
For job submission:
If you use a Cromwell server on cloud, run the following Altocumulus command on your local machine:
alto cromwell run -s <server-address> -p <port-number> -m broadinstitute:cumulus:demultiplexing -i /path/to/demux_inputs.json -o demux_inputs_updated.json -b gs://my-bucket/data_source
where
broadinstitute:cumulus:demultiplexing
refers to demultiplexing WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:demultiplexing:master
to use its development version in master branch./path/to/demux_inputs.json
should be replaced by the actual local path todemux_inputs.json
created above.- Replace
my-bucket
in-b
option by your own Google bucket name, and feel free to choose folder name other thandata_source
for uploading.- We still need
-o
and-b
options becausedemux_sample_sheet.csv
is on the local machine.
Similarly, when the submission succeeds, you’ll get another job ID for demultiplexing. You can use it to track the job status.
If you use Terra, run the following Altocumulus command:
alto terra run -m broadinstitute:cumulus:demultiplexing -w ws-lab/ws-01 --bucket-folder data_source -i /path/to/demux_inputs.json -o demux_inputs_updated.json
where
broadinstitute:cumulus:demultiplexing
refers to demultiplexing WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:demultiplexing:master
to use its development version in master branch./path/to/demux_inputs.json
should be replaced by the actual local path todemux_inputs.json
created above.ws-lab/ws-01
should be replaced by your own Terra workspace full name.--bucket-folder
: Feel free to choose folder name other thandata_source
for uploading.- We still need
-o
and--bucket-folder
options becausedemux_sample_sheet.csv
is on the local machine.
After submission, you can check the job’s status in the Job History tab of your Terra workspace page.
When finished, demultiplexing results are in gs://my-bucket/demux_output/exp
folder, with the following important output files:
exp_demux.zarr.zip
: Demultiplexed RNA raw count matrix. This will be used for downstram analysis.exp.out.demuxEM.zarr.zip
: This file contains intermediate results for both RNA and hashing count matrices, which is useful for compare with other demultiplexing methods.- DemuxEM plots in PDF format. They are used for evaluating the performance of DemuxEM on the data.
(Optional) Extract Demultiplexing results¶
This is performed on your local machine with demultiplexing results downloaded from cloud to your machine.
To download the demultiplexed count matrix exp_demux.zarr.zip
, you can either do it in Google cloud console, or using gsutil in command line:
gsutil -m cp gs://my-bucket/demux_output/exp/exp_demux.zarr.zip .
After that, in your Python environment, install Pegasus package, and follow the steps below to extract the demultiplexing results:
Load Libraries:
import numpy as np import pandas as pd import pegasus as pg import matplotlib.pyplot as plt import seaborn as snsLoad demuxEM output. For demuxEM, load RNA expression matrix with demultiplexed sample identities in Zarr format. These can be found in Google cloud console. QC 500 <= # of genes < 6000, % mito <= 10%:
data = pg.read_input('exp_demux.zarr.zip') pg.qc_metrics(data, min_genes=500, max_genes=6000, mito_prefix='Mt-', percent_mito=10) pg.filter_data(data)Demultiplexing results showing singlets, doublets and unknown:
data.obs['demux_type'].value_counts()Show assignments in singlets:
idx = data.obs['demux_type'] == 'singlet' data.obs.loc[idx, 'assignment'].value_counts()[0:10]Write assignment outputs to CSV:
data.obs[['demux_type', 'assignment']].to_csv('demux_exp.csv')
3. Data Analysis on CITE-Seq Data¶
In this phase, we merge RNA and ADT matrices for CITE-Seq data, and perform the downstream analysis.
To be brief, we use the CITE-Seq count matrix generated from Section 1-a and demultiplexing results in Section 2 for illustraion here:
On your local machine, prepare a CSV-format sample sheet
count_matrix.csv
with the following content:Sample,Location,Modality exp,gs://my-bucket/demux_output/exp/exp_demux.zarr.zip,rna exp,gs://my-bucket/cellranger_output/sample_cite_seq/sample_cite_seq.csv,citeseqThis sample sheet describes the metadata for each modality (as one row in the sheet):
- Sample specifies the name of the modality, and all the modalities of the same sample should have one common name, as otherwise their count matrices won’t be aggregated together;
- Location specifies the file location. For RNA data, this is the output of Phase 2; for CITE-Seq antibody data, it’s the output of Phase 1.
- Modality specifies the modality type, which is either
rna
for RNA matrix, orciteseq
for CITE-Seq antibody matrix.On your local machine, also prepare a JSON file
cumulus_inputs.json
for cumulus WDL workflow, with the following content:{ "cumulus.input_file": "/path/to/count_matrix.csv", "cumulus.output_directory": "gs://my-bucket/cumulus_output", "cumulus.output_name": "exp_merged_out", "cumulus.select_only_singlets": true, "cumulus.run_louvain": true, "cumulus.run_umap": true, "cumulus.citeseq": true, "cumulus.citeseq_umap": true, "cumulus.citeseq_umap_exclude": "Mouse_IgG1,Mouse_IgG2a,Mouse_IgG2b,Rat_IgG2b", "cumulus.plot_composition": "louvain_labels:assignment", "cumulus.plot_umap": "louvain_labels,assignment", "cumulus.plot_citeseq_umap": "louvain_labels,assignment", "cumulus.cluster_labels": "louvain_labels", "cumulus.annotate_cluster": true, "cumulus.organism": "human_immune" }where
/path/to/count_matrix.csv
should be replaced by the actual local path tocount_matrix.csv
created above.A typical Cumulus WDL pipeline consists of 4 steps, which is given here. For details on Cumulus workflow inputs above, please refer to cumulus inputs.
Submit a demultiplexing job with
cumulus_inputs.json
input above to cloud for execution.
For job submission:
If you use a Cromwell server on cloud, run the following Altocumulus command to submit the job:
alto cromwell run -s <server-address> -p <port-number> -m broadinstitute:cumulus:cumulus -i /path/to/cumulus_inputs.json -o cumulus_inputs_updated.json -b gs://my-bucket/data_source
where
broadinstitute:cumulus:cumulus
refers to cumulus WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cumulus:master
to use its development version in master branch./path/to/cumulus_inputs.json
should be replaced by the actual local path tocumulus_inputs.json
created above.my-bucket
in-b
option should be replaced by your own Google bucket name, and feel free to choose folder name other thandata_source
for uploading data.- We still need
-o
and-b
options becausecount_matrix.csv
is on the local machine.
Similarly, when the submission succeeds, you’ll get another job ID for demultiplexing. You can use it to track the job status.
If you use Terra, run the following Altocumulus command:
alto terra run -m broadinstitute:cumulus:cumulus -w ws-lab/ws-01 --bucket-folder data_source -i /path/to/cumulus_inputs.json -o cumulus_inputs_updated.json
where
broadinstitute:cumulus:cumulus
refers to cumulus WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cumulus:master
to use its development version in master branch.ws-lab/ws-01
should be replaced by your own Terra workspace full name.--bucket-folder
: Feel free to choose folder name other thandata_source
for uploading data./path/to/cumulus_inputs.json
should be replaced by the actual local path tocumulus_inputs.json
created above.- We still need
-o
and--bucket-folder
options becausecount_matrix.csv
is on the local machine.
After submission, you can check the job’s status in the Job History tab of your Terra workspace page.
When finished, all the output files are in gs://my-bucket/cumulus_output
folder, with the following important files:
exp_merged_out.aggr.zarr.zip
: The ZARR format file containing both the aggregated count matrix in<genome>-rna
modality, as well as CITE-Seq antibody count matrix in<genome>-citeseq
modality, where<genome>
is the genome reference name of your count matrices, e.g. GRCh38-2020-A.exp_merged_out.zarr.zip
: The ZARR format file containing the analysis results in<genome>-rna
modality, and CITE-Seq antibody count matrix in<genome>-citeseq
modality.exp_merged_out.<genome>-rna.h5ad
: The processed RNA matrix data in H5AD format.exp_merged_out.<genome>-rna.filt.xlsx
: The Quality-Control (QC) summary of the raw data.exp_merged_out.<genome>-rna.filt.{UMI, gene, mito}.pdf
: The QC plots of the raw data.exp_merged_out.<genome>-rna.de.xlsx
: Differential Expression analysis result.exp_merged_out.<genome>-rna.anno.txt
: The putative cell type annotation output.exp_merged_out.<genome>-rna.umap.pdf
: UMAP plot.exp_merged_out.<genome>-rna.citeseq.umap.pdf
: CITE-Seq UMAP plot.exp_merged_out.<genome>-rna.louvain_labels.assignment.composition.pdf
: Composition plot.
Run Analysis with Terra Web UI¶
For Terra users, instead of using Altocumulus to submit jobs in command line, they can also use the Terra web UI.
First, upload the local BCL data or FASTQ files to the Google bucket associated with your Terra workspace (say gs://fc-e0000000
) using gsutil:
gsutil -m cp -r /path/to/your/data/folder gs://fc-e0000000/data_source/
where /path/to/your/data/folder
should be replaced by the actual local path to your data folder, and data_source
is the folder on Google bucket to store the uploaded data.
Then for each of the 3 phases above:
When preparing the sample sheet, remember to replace all the local paths by the GS URIs of the corresponding folders/files that you uploaded to Google bucket. Then upload it to Google bucket as well:
gsutil cp /path/to/sample/sheet gs://fc-e0000000/data_source/where
/path/to/sample/sheet
should be replaced by the actual local path to your sample sheet. Notice that for Phase 1,antibody_index.csv
file should also be uploaded to Google bucket, and its references in the sample sheet must be replaced by its GS URI.When preparing the workflow input JSON file, change the field of sample sheet to its GS URI on cloud.
Import the corresponding WDL workflow to your Terra workspace by following steps in import workflows tutorial.
In the workflow page (Workspace -> Workflows -> your WDL workflow), upload your input JSON file by clicking the “upload json” button:
Click “SAVE” button to save the configuration, and click “RUN ANALYSIS” button to submit the job:
You can check the job’s status in the Job History tab of your Terra workspace page.