Example of Gene expression, Hashing and CITE-Seq Analysis on Cloud
In this example, you’ll learn how to perform Gene expression, Hashing and CITE-Seq data analysis on Cloud.
This example covers the cases of both Terra platform and a custom cloud server running Cromwell. When reading through the tutorial, you may check out the corresponding part based on your working situation.
0. Prerequisite
0-a. Cromwell server
If you use a Cromwell server on Cloud, on your local machine, you need to install the corresponding Cloud SDK tool if not:
gcloud CLI if your Cloud bucket is on Google Cloud.
AWS CLI v2 if your Cloud bucket is on Amazon AWS Cloud.
And then install Altocumulus in your Python environment. This is the tool for data transfer between local machine and cloud bucket, as well as communication with the Cromwell server on cloud.
0-b. Terra Platform
If you use Terra, after registering on Terra and creating a workspace there, you’ll need the following information:
Terra workspace name. This is shown on your Terra workspace webpage, with format “<workspace-namespace>/<workspace-name>”. For example, if your Terra workspace has full name
ws-lab/ws-01, then ws-lab is the namespace and ws-01 is the workspace name winthin that namespace.The corresponding Google Cloud Bucket of your Terra workspace. You can check it under “Google Bucket” title on the right panel of your Terra workspace’s Dashboard tab. The bucket name associated with your workspace starts with
fc-followed by a sequence of heximal numbers. For example,gs://fc-e0000000, where “gs://” is the header of GS URI.
Besides, install gcloud CLI and Altocumulus on your local machine for data uploading. These tools will be used for data transfer between local machine and Cloud bucket.
Alternatively, you can also use Terra web UI for job submission instead of command-line submission. This will be discussed in Section Run Analysis with Terra Web UI below.
1. Extract Gene-Count Matrices
This phase is to extract gene-count matrices from sequencing output.
There are two cases: (1) from BCL data, which includes mkfastq step to generate FASTQ files and count step to generate gene-count matrices; (2) from FASTQ files, which only runs the count step.
1-a. Extract Genen-Count Matrices from BCL data
This section covers the case starting from BCL data.
Step 1. Sample Sheet Preparation
First, prepare a feature index file for your dataset. Say its filename is antibody_index.csv, which has format “feature_barcode, feature_name,feature_type”. See an example below:
TTCCTGCCATTACTA,HTO_1,hashing
CCGTACCTCATTGTT,HTO_2,hashing
GGTAGATGTCCTCAG,HTO_3,hashing
TGGTGTCATTCTTGA,Ab1,citeseq
CTCATTGTAACTCCT,Ab2,citeseq
GCGCAACTTGATGAT,Ab3,citeseq
... ...
where each line contains the barcode and the name of a Hashing/CITE-Seq index: hashing indicates a Cell/Nucleus-Hashing index, while citeseq indicates a CITE-Seq index.
Next, create a sample sheet cellranger_sample_sheet.csv for Cell Ranger processing on your local machine. Below is an example:
Sample,Reference,Flowcell,Lane,Index,Chemistry,DataType,FeatureBarcodeFile
sample_control,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A1,fiveprime,rna
sample_gex,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-TT-A2,fiveprime,rna
sample_cell_hashing,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-NN-A1,fiveprime,hashing,/path/to/antibody_index.csv
sample_cite_seq,GRCh38-2020-A,/path/to/flowcell/folder,*,SI-NN-A2,fiveprime,citeseq,/path/to/antibody_index.csv
where
GRCh38-2020-Ais the is the Human GRCh38 (GENCODE v32/Ensembl 98) genome reference prebuilt by Cumulus. See Cumulus single-cell genome reference list for a complete list of genome references.
/path/to/flowcell/foldershould be replaced by the actual local path to the BCL folder of your sequencing data.
/path/to/antibody_index.csvshould be replaced by the actual local path toantibody_index.csvfile we just created above.
rna,hashingandciteseqrefer to gene expression data, cell/nucleus-hashing data, and CITE-Seq data, respectively.Samples of type
rnado not need any feature barcode file for indexing.
For the details on how to prepare this sample sheet, please refer to Step 3 of Cell Ranger sample sheet instruction.
Step 2. Workflow Input Preparation
Now prepare a JSON file for cellranger_workflow WDL workflow input on your local machine (say named cellranger_inputs.json):
{
"cellranger_workflow.input_csv_file": "/path/to/cellranger_sample_sheet.csv",
"cellranger_workflow.output_directory": "gs://my-bucket/cellranger_output"
}
where
/path/to/cellranger_sample_sheet.csvshould be replaced by the actual local path to your sample sheet created above.
gs://my-bucket/cellranger_outputis the target folder on Google bucket to store your result when the workflow job is finished, wheremy-bucketshould be replaced by your own Google bucket name.
For details on the all the workflow inputs of cellranger_workflow, please refer to Cell Ranger workflow inputs.
Step 3. Job Submission
Now we are ready to submit a job to cloud for computing:
If you use a Cromwell server on cloud, run the following Altocumulus command:
alto cromwell run -s <server-address> -p <port-number> -m broadinstitute:cumulus:cellranger -i /path/to/cellranger_inputs.json -o cellranger_inputs_updated.json -b gs://my-bucket/data_source
where
-sspecifies the server’s IP address (or hostname), where<server-address>should be replaced by the actual IP address (or hostname).
-pspecifies the server’s port number for Cromwell, where<port-number>should be replaced by the actual port number.
-mspecifies which WDL workflow to use. You should use the Dockstore name of Cumulus cellranger_workflow. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cellranger:masterto use its development version in master branch.
-ispecifies the workflow input JSON file.
-oand-bare used when the input data (which are specified in the workflow input JSON file and sample sheet CSV file) are local and need to be uploaded to Cloud bucket first.
-ospecifies the updated workflow input JSON file after uploading the input data, with all the local paths updated to Cloud bucket URIs. This is useful when resubmitting jobs running the same input data, without uploading the same input data again.
-bspecifies which folder on Cloud bucket to upload the local input data, wheremy-bucketshould be replaced by your own Google bucket name. Feel free to choose the folder name other thandata_source.
Notice that -o and -b options can be dropped if all of your input data are already on Cloud bucket.
After submission, you’ll get the job’s ID for tracking its status:
alto cromwell check_status -s <server-address> -p <port-number> --id <your-job-ID>
where <your-job-ID> should be replaced by the actual Cromwell job ID.
If you use Terra, run the following Altocumulus command:
alto terra run -m broadinstitute:cumulus:cellranger -w ws-lab/ws-01 --bucket-folder data_source -i /path/to/cellranger_inputs.json -o cellranger_inputs_updated.json
where
-mspecifies which WDL workflow to use. You should use the Dockstore name of Cumulus cellranger_workflow. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cellranger:masterto use its development version in master branch.
-wspecifies the Terra workspace full name to use, wherews-lab/ws-01should be replaced by your own Terra workspace full name.
--bucket-folderspecifies the folder name on the Google bucket associated with the Terra workspace to store the uploaded data. Feel free to choose folder name other thandata_source.
-ispecifies the workflow input JSON file, where/path/to/cellranger_inputs.jsonshould be replaced by the actual local path tocellranger_inputs.jsonfile.
-ospecifies the updated workflow input JSON file after uploading the input data, with all the local paths updated to Cloud bucket URIs. This is useful when resubmitting jobs running the same input data, without uploading the same input data again.
Notice that --bucket-folder and -o options can be dropped if all of your input data are already on Cloud bucket.
After submission, you can check the job’s status in the Job History tab of your Terra workspace page.
When the job is done, you’ll get results in gs://my-bucket/cellranger_output, which is specified in cellranger_inputs.json above. It should contain 4 subfolders, each of which is associated with one sample specified in cellranger_sample_sheet.csv above.
For the next phases, you’ll need 3 files from the output:
RNA count matrix of the sample group of interest:
gs://my-bucket/cellranger_output/sample_gex/raw_feature_bc_matrix.h5;Cell-Hashing Antibody count matrix:
gs://my-bucket/cellranger_output/sample_cell_hashing/sample_cell_hashing.csv;CITE-Seq Antibody count matrix:
gs://my-bucket/cellranger_output/sample_cite_seq/sample_cite_seq.csv.
1-b. Extract Gene-Cound Matrices from FASTQ files
This section covers the case starting from FASTQ files.
Similarly as above, First, prepare a feature index file for your dataset. Say its filename is antibody_index.csv, which has format “feature_barcode, feature_name,feature_type”. See an example below:
TTCCTGCCATTACTA,HTO_1,hashing
CCGTACCTCATTGTT,HTO_2,hashing
GGTAGATGTCCTCAG,HTO_3,hashing
TGGTGTCATTCTTGA,Ab1,citeseq
CTCATTGTAACTCCT,Ab2,citeseq
GCGCAACTTGATGAT,Ab3,citeseq
... ...
where each line contains the barcode and the name of a Hashing/CITE-Seq index: hashing indicates a Cell/Nucleus-Hashing index, while citeseq indicates a CITE-Seq index.
Next, create a sample sheet cellranger_sample_sheet.csv for Cell Ranger processing on your local machine. Below is an example:
Sample,Reference,Flowcell,Chemistry,DataType,FeatureBarcodeFile
sample_1_rna,GRCh38-2020-A,/path/to/fastq/gex,fiveprime,rna
sample_2_rna,GRCh38-2020-A,/path/to/fastq/gex,fiveprime,rna
sample_3_rna,GRCh38-2020-A,/path/to/fastq/gex,fiveprime,rna
sample_1_adt,GRCh38-2020-A,/path/to/fastq/hashing_citeseq,fiveprime,adt,/path/to/antibody_index.csv
sample_2_adt,GRCh38-2020-A,/path/to/fastq/hashing_citeseq,fiveprime,adt,/path/to/antibody_index.csv
sample_3_adt,GRCh38-2020-A,/path/to/fastq/hashing_citeseq,fiveprime,adt,/path/to/antibody_index.csv
where
GRCh38-2020-Ais the is the Human GRCh38 (GENCODE v32/Ensembl 98) genome reference prebuilt by Cumulus. See Cumulus single-cell genome reference list for a complete list of genome references.
/path/to/fastq/gexshould be replaced by the actual local path to the folder containing FASTQ files of RNA samples.
/path/to/fastq/hashing_citeseqshould be replaced by the actual local path to the folder containing FASTQ files of Cell/Nucleus-Hashing and CITE-Seq samples.
/path/to/antibody_index.csvshould be replaced by the actual local path toantibody_index.csvfile we just created above.
rnaandadtrefer to gene expression data and antibody data, respectively. In specific,adtcovers bothciteseqandhashingtypes, i.e. it includes both Hashing and CITE-Seq data types.Samples of type
rnado not need any feature barcode file for indexing.Columns Lane and Index are not needed if starting from FASTQ files, as mkfastq step will be skipped.
For the details on how to prepare this sample sheet, please refer to Step 3 of Cell Ranger sample sheet instruction.
Now prepare a JSON file for cellranger_workflow WDL workflow input on your local machine (say named cellranger_inputs.json):
{
"cellranger_workflow.input_csv_file": "/path/to/cellranger_sample_sheet.csv",
"cellranger_workflow.output_directory": "gs://my-bucket/cellranger_output",
"cellranger_workflow.run_mkfastq": false
}
where
/path/to/cellranger_sample_sheet.csvshould be replaced by the actual local path to your sample sheet created above.
gs://my-bucket/cellranger_outputis the target folder on Google bucket to store your result when the workflow job is finished, wheremy-bucketshould be replaced by your own Google bucket name.Set run_mkfastq to
falseto skip the mkfastq step, as we start from FASTQ files.
For details on the all the workflow inputs of cellranger_workflow, please refer to Cell Ranger workflow inputs.
Now we are ready to submit a job to cloud for computing. Follow instructions in Section 1-a above.
When finished, you’ll get results in gs://my-bucket/cellranger_output, which is specified in cellranger_inputs.json above. It should contain 6 subfolders, each of which is associated with one sample specified in cellranger_sample_sheet.csv above.
In specific, for each adt type sample, there are both count matrix of Hashing data and that of CITE-Seq data generated inside its corresponding subfolder, with filename suffix .hashing.csv and .citeseq.csv, respectively.
2. Demultiplex Cell-Hashing Data using DemuxEM
Run Workflow on Cloud
Next, we need to demultiplex the resulting RNA gene-count matrices. We use DemuxEM method in this example.
To be brief, we use the output of Section 1-a for illustration:
On your local machine, prepare a CSV-format sample sheet
demux_sample_sheet.csvwith the following content:OUTNAME,RNA,TagFile,TYPE exp,gs://my-bucket/cellranger_output/sample_gex/raw_feature_bc_matrix.h5,gs://my-bucket/cellranger_output/sample_cell_hashing/sample_cell_hashing.csv,cell-hashingwhere OUTNAME specifies the subfolder and file names of output, which is free to be changed, RNA and TagFile columns specify the RNA and hashing tag meta-data of samples, and TYPE is
cell-hashingfor this phase.On your local machine, also prepare an input JSON file
demux_inputs.jsonfor demultiplexing WDL workflow,demux_inputs.jsonwith the following content:{ "demultiplexing.input_sample_sheet" : "/path/to/demux_sample_sheet.csv", "demultiplexing.output_directory" : "gs://my-bucket/demux_output" }where
/path/to/demux_sample_sheet.csvshould be replaced by the actual local path todemux_sample_sheet.csvcreated above.For the details on these options, please refer to demultiplexing workflow inputs.
Submit a demultiplexing job with
demux_inputs.jsoninput above to cloud for execution.
For job submission:
If you use a Cromwell server on cloud, run the following Altocumulus command on your local machine:
alto cromwell run -s <server-address> -p <port-number> -m broadinstitute:cumulus:demultiplexing -i /path/to/demux_inputs.json -o demux_inputs_updated.json -b gs://my-bucket/data_source
where
broadinstitute:cumulus:demultiplexingrefers to demultiplexing WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:demultiplexing:masterto use its development version in master branch.
/path/to/demux_inputs.jsonshould be replaced by the actual local path todemux_inputs.jsoncreated above.Replace
my-bucketin-boption by your own Google bucket name, and feel free to choose folder name other thandata_sourcefor uploading.We still need
-oand-boptions becausedemux_sample_sheet.csvis on the local machine.
Similarly, when the submission succeeds, you’ll get another job ID for demultiplexing. You can use it to track the job status.
If you use Terra, run the following Altocumulus command:
alto terra run -m broadinstitute:cumulus:demultiplexing -w ws-lab/ws-01 --bucket-folder data_source -i /path/to/demux_inputs.json -o demux_inputs_updated.json
where
broadinstitute:cumulus:demultiplexingrefers to demultiplexing WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:demultiplexing:masterto use its development version in master branch.
/path/to/demux_inputs.jsonshould be replaced by the actual local path todemux_inputs.jsoncreated above.
ws-lab/ws-01should be replaced by your own Terra workspace full name.
--bucket-folder: Feel free to choose folder name other thandata_sourcefor uploading.We still need
-oand--bucket-folderoptions becausedemux_sample_sheet.csvis on the local machine.
After submission, you can check the job’s status in the Job History tab of your Terra workspace page.
When finished, demultiplexing results are in gs://my-bucket/demux_output/exp folder, with the following important output files:
exp_demux.zarr.zip: Demultiplexed RNA raw count matrix. This will be used for downstram analysis.
exp.out.demuxEM.zarr.zip: This file contains intermediate results for both RNA and hashing count matrices, which is useful for compare with other demultiplexing methods.DemuxEM plots in PDF format. They are used for evaluating the performance of DemuxEM on the data.
(Optional) Extract Demultiplexing results
This is performed on your local machine with demultiplexing results downloaded from cloud to your machine.
To download the demultiplexed count matrix exp_demux.zarr.zip, you can either do it in Google cloud console, or using gsutil in command line:
gsutil -m cp gs://my-bucket/demux_output/exp/exp_demux.zarr.zip .
After that, in your Python environment, install Pegasus package, and follow the steps below to extract the demultiplexing results:
Load Libraries:
import numpy as np import pandas as pd import pegasus as pg import matplotlib.pyplot as plt import seaborn as snsLoad demuxEM output. For demuxEM, load RNA expression matrix with demultiplexed sample identities in Zarr format. These can be found in Google cloud console. QC 500 <= # of genes < 6000, % mito <= 10%:
data = pg.read_input('exp_demux.zarr.zip') pg.qc_metrics(data, min_genes=500, max_genes=6000, mito_prefix='Mt-', percent_mito=10) pg.filter_data(data)Demultiplexing results showing singlets, doublets and unknown:
data.obs['demux_type'].value_counts()Show assignments in singlets:
idx = data.obs['demux_type'] == 'singlet' data.obs.loc[idx, 'assignment'].value_counts()[0:10]Write assignment outputs to CSV:
data.obs[['demux_type', 'assignment']].to_csv('demux_exp.csv')
3. Data Analysis on CITE-Seq Data
In this phase, we merge RNA and ADT matrices for CITE-Seq data, and perform the downstream analysis.
To be brief, we use the CITE-Seq count matrix generated from Section 1-a and demultiplexing results in Section 2 for illustraion here:
On your local machine, prepare a CSV-format sample sheet
count_matrix.csvwith the following content:Sample,Location,Modality exp,gs://my-bucket/demux_output/exp/exp_demux.zarr.zip,rna exp,gs://my-bucket/cellranger_output/sample_cite_seq/sample_cite_seq.csv,citeseqThis sample sheet describes the metadata for each modality (as one row in the sheet):
Sample specifies the name of the modality, and all the modalities of the same sample should have one common name, as otherwise their count matrices won’t be aggregated together;
Location specifies the file location. For RNA data, this is the output of Phase 2; for CITE-Seq antibody data, it’s the output of Phase 1.
Modality specifies the modality type, which is either
rnafor RNA matrix, orciteseqfor CITE-Seq antibody matrix.On your local machine, also prepare a JSON file
cumulus_inputs.jsonfor cumulus WDL workflow, with the following content:{ "cumulus.input_file": "/path/to/count_matrix.csv", "cumulus.output_directory": "gs://my-bucket/cumulus_output", "cumulus.output_name": "exp_merged_out", "cumulus.select_only_singlets": true, "cumulus.run_louvain": true, "cumulus.run_umap": true, "cumulus.citeseq": true, "cumulus.citeseq_umap": true, "cumulus.citeseq_umap_exclude": "Mouse_IgG1,Mouse_IgG2a,Mouse_IgG2b,Rat_IgG2b", "cumulus.plot_composition": "louvain_labels:assignment", "cumulus.plot_umap": "louvain_labels,assignment", "cumulus.plot_citeseq_umap": "louvain_labels,assignment", "cumulus.cluster_labels": "louvain_labels", "cumulus.annotate_cluster": true, "cumulus.organism": "human_immune" }where
/path/to/count_matrix.csvshould be replaced by the actual local path tocount_matrix.csvcreated above.A typical Cumulus WDL pipeline consists of 4 steps, which is given here. For details on Cumulus workflow inputs above, please refer to cumulus inputs.
Submit a demultiplexing job with
cumulus_inputs.jsoninput above to cloud for execution.
For job submission:
If you use a Cromwell server on cloud, run the following Altocumulus command to submit the job:
alto cromwell run -s <server-address> -p <port-number> -m broadinstitute:cumulus:cumulus -i /path/to/cumulus_inputs.json -o cumulus_inputs_updated.json -b gs://my-bucket/data_source
where
broadinstitute:cumulus:cumulusrefers to cumulus WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cumulus:masterto use its development version in master branch.
/path/to/cumulus_inputs.jsonshould be replaced by the actual local path tocumulus_inputs.jsoncreated above.
my-bucketin-boption should be replaced by your own Google bucket name, and feel free to choose folder name other thandata_sourcefor uploading data.We still need
-oand-boptions becausecount_matrix.csvis on the local machine.
Similarly, when the submission succeeds, you’ll get another job ID for demultiplexing. You can use it to track the job status.
If you use Terra, run the following Altocumulus command:
alto terra run -m broadinstitute:cumulus:cumulus -w ws-lab/ws-01 --bucket-folder data_source -i /path/to/cumulus_inputs.json -o cumulus_inputs_updated.json
where
broadinstitute:cumulus:cumulusrefers to cumulus WDL workflow published on Dockstore. Here, the version is omitted, so that the default version will be used. Alternatively, you can explicitly specify which version to use, e.g.broadinstitute:cumulus:cumulus:masterto use its development version in master branch.
ws-lab/ws-01should be replaced by your own Terra workspace full name.
--bucket-folder: Feel free to choose folder name other thandata_sourcefor uploading data.
/path/to/cumulus_inputs.jsonshould be replaced by the actual local path tocumulus_inputs.jsoncreated above.We still need
-oand--bucket-folderoptions becausecount_matrix.csvis on the local machine.
After submission, you can check the job’s status in the Job History tab of your Terra workspace page.
When finished, all the output files are in gs://my-bucket/cumulus_output folder, with the following important files:
exp_merged_out.aggr.zarr.zip: The ZARR format file containing both the aggregated count matrix in<genome>-rnamodality, as well as CITE-Seq antibody count matrix in<genome>-citeseqmodality, where<genome>is the genome reference name of your count matrices, e.g. GRCh38-2020-A.
exp_merged_out.zarr.zip: The ZARR format file containing the analysis results in<genome>-rnamodality, and CITE-Seq antibody count matrix in<genome>-citeseqmodality.
exp_merged_out.<genome>-rna.h5ad: The processed RNA matrix data in H5AD format.
exp_merged_out.<genome>-rna.filt.xlsx: The Quality-Control (QC) summary of the raw data.
exp_merged_out.<genome>-rna.filt.{UMI, gene, mito}.pdf: The QC plots of the raw data.
exp_merged_out.<genome>-rna.de.xlsx: Differential Expression analysis result.
exp_merged_out.<genome>-rna.anno.txt: The putative cell type annotation output.
exp_merged_out.<genome>-rna.umap.pdf: UMAP plot.
exp_merged_out.<genome>-rna.citeseq.umap.pdf: CITE-Seq UMAP plot.
exp_merged_out.<genome>-rna.louvain_labels.assignment.composition.pdf: Composition plot.
Run Analysis with Terra Web UI
For Terra users, instead of using Altocumulus to submit jobs in command line, they can also use the Terra web UI.
First, upload the local BCL data or FASTQ files to the Google bucket associated with your Terra workspace (say gs://fc-e0000000) using gsutil:
gsutil -m cp -r /path/to/your/data/folder gs://fc-e0000000/data_source/
where /path/to/your/data/folder should be replaced by the actual local path to your data folder, and data_source is the folder on Google bucket to store the uploaded data.
Then for each of the 3 phases above:
When preparing the sample sheet, remember to replace all the local paths by the GS URIs of the corresponding folders/files that you uploaded to Google bucket. Then upload it to Google bucket as well:
gsutil cp /path/to/sample/sheet gs://fc-e0000000/data_source/where
/path/to/sample/sheetshould be replaced by the actual local path to your sample sheet. Notice that for Phase 1,antibody_index.csvfile should also be uploaded to Google bucket, and its references in the sample sheet must be replaced by its GS URI.When preparing the workflow input JSON file, change the field of sample sheet to its GS URI on cloud.
Import the corresponding WDL workflow to your Terra workspace by following steps in import workflows tutorial.
In the workflow page (Workspace -> Workflows -> your WDL workflow), upload your input JSON file by clicking the “upload json” button:
Click “SAVE” button to save the configuration, and click “RUN ANALYSIS” button to submit the job:
You can check the job’s status in the Job History tab of your Terra workspace page.