Topic modeling¶
Prepare input data¶
Follow the steps below to run topic_modeling on Terra.
Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats
Upload your count matrix to the workspace.
Example:
gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
where
/foo/bar/projects/dataset.h5ad
is the path to your dataset on your local machine, andgs://fc-e0000000-0000-0000-0000-000000000000/
is the Google bucket destination.Import topic_modeling workflow to your workspace.
See the Terra documentation for adding a workflow. The cumulus workflow is under
Broad Methods Repository
with name “cumulus/topic_modeling”.Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.In your workspace, open
topic_modeling
inWORKFLOWS
tab. SelectRun workflow with inputs defined by file paths
as belowand click the
SAVE
button.
Workflow input¶
Inputs for the topic_modeling workflow are described below. Required inputs are in bold.
Name | Description | Example | Default |
---|---|---|---|
input_file | Google bucket URL of the input count matrix. | “gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad” | |
number_of_topics | Array of number of topics. | [10,15,20] | |
prefix_exclude | Comma separated list of features to exclude that start with prefix. | “mt-,Rpl,Rps” | “mt-,Rpl,Rps” |
min_percent_expressed | Exclude features expressed below min_percent. | 2 | |
max_percent_expressed | Exclude features expressed below min_percent. | 98 | |
random_number_seed | Random number seed for reproducibility. | 0 | 0 |
Workflow output¶
Name | Type | Description |
---|---|---|
coherence_plot | File | Plot of coherence scores vs. number of topics |
perplexity_plot | File | Plot of perplexity values vs. number of topics |
cell_scores | Array[File] | Topic by cells (one file for each topic number) |
feature_topics | Array[File] | Topic by features (one file for each topic number) |
report | Array[File] | HTML visualization report (one file for each topic number) |
stats | Array[File] | Computed coherence and perplexity (one file for each topic number) |
model | Array[File] | Serialized LDA model (one file for each topic number) |
corpus | File | Serialized corpus |
dictionary | File | Serialized dictionary |