Topic modeling
Prepare input data
Follow the steps below to run topic_modeling on Terra.
Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats
Upload your count matrix to the workspace.
Example:
gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
where
/foo/bar/projects/dataset.h5ad
is the path to your dataset on your local machine, andgs://fc-e0000000-0000-0000-0000-000000000000/
is the Google bucket destination.Import topic_modeling workflow to your workspace.
See the Terra documentation for adding a workflow. The cumulus workflow is under
Broad Methods Repository
with name “cumulus/topic_modeling”.Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.In your workspace, open
topic_modeling
inWORKFLOWS
tab. SelectRun workflow with inputs defined by file paths
as belowand click the
SAVE
button.
Workflow input
Inputs for the topic_modeling workflow are described below. Required inputs are in bold.
Name |
Description |
Example |
Default |
---|---|---|---|
input_file |
Google bucket URL of the input count matrix. |
“gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad” |
|
number_of_topics |
Array of number of topics. |
[10,15,20] |
|
prefix_exclude |
Comma separated list of features to exclude that start with prefix. |
“mt-,Rpl,Rps” |
“mt-,Rpl,Rps” |
min_percent_expressed |
Exclude features expressed below min_percent. |
2 |
|
max_percent_expressed |
Exclude features expressed below min_percent. |
98 |
|
random_number_seed |
Random number seed for reproducibility. |
0 |
0 |
Workflow output
Name |
Type |
Description |
---|---|---|
coherence_plot |
File |
Plot of coherence scores vs. number of topics |
perplexity_plot |
File |
Plot of perplexity values vs. number of topics |
cell_scores |
Array[File] |
Topic by cells (one file for each topic number) |
feature_topics |
Array[File] |
Topic by features (one file for each topic number) |
report |
Array[File] |
HTML visualization report (one file for each topic number) |
stats |
Array[File] |
Computed coherence and perplexity (one file for each topic number) |
model |
Array[File] |
Serialized LDA model (one file for each topic number) |
corpus |
File |
Serialized corpus |
dictionary |
File |
Serialized dictionary |