Topic modeling

Prepare input data

Follow the steps below to run topic_modeling on Terra.

Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats
Upload your count matrix to the workspace.
Example:
gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
where /foo/bar/projects/dataset.h5ad is the path to your dataset on your local machine, and gs://fc-e0000000-0000-0000-0000-000000000000/ is the Google bucket destination.
Import topic_modeling workflow to your workspace.

See the Terra documentation for adding a workflow. The cumulus workflow is under Broad Methods Repository with name “cumulus/topic_modeling”.

Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.
In your workspace, open topic_modeling in WORKFLOWS tab. Select Run workflow with inputs defined by file paths as below

and click the SAVE button.

Inputs for the topic_modeling workflow are described below. Required inputs are in bold.

Name	Description	Example	Default
input_file	Google bucket URL of the input count matrix.	“gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad”
number_of_topics	Array of number of topics.	[10,15,20]
prefix_exclude	Comma separated list of features to exclude that start with prefix.	“mt-,Rpl,Rps”	“mt-,Rpl,Rps”
min_percent_expressed	Exclude features expressed below min_percent.	2
max_percent_expressed	Exclude features expressed below min_percent.	98
random_number_seed	Random number seed for reproducibility.	0	0

Name	Type	Description
coherence_plot	File	Plot of coherence scores vs. number of topics
perplexity_plot	File	Plot of perplexity values vs. number of topics
cell_scores	Array[File]	Topic by cells (one file for each topic number)
feature_topics	Array[File]	Topic by features (one file for each topic number)
report	Array[File]	HTML visualization report (one file for each topic number)
stats	Array[File]	Computed coherence and perplexity (one file for each topic number)
model	Array[File]	Serialized LDA model (one file for each topic number)
corpus	File	Serialized corpus
dictionary	File	Serialized dictionary