Topic modeling¶
Prepare input data¶
Follow the steps below to run topic_modeling on Terra.
Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats
Upload your count matrix to the workspace.
Example:
gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
where
/foo/bar/projects/dataset.h5adis the path to your dataset on your local machine, andgs://fc-e0000000-0000-0000-0000-000000000000/is the Google bucket destination.Import topic_modeling workflow to your workspace.
See the Terra documentation for adding a workflow. The cumulus workflow is under
Broad Methods Repositorywith name “cumulus/topic_modeling”.Moreover, in the workflow page, click the
Export to Workspace...button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.In your workspace, open
topic_modelinginWORKFLOWStab. SelectRun workflow with inputs defined by file pathsas below
and click the
SAVEbutton.
Workflow input¶
Inputs for the topic_modeling workflow are described below. Required inputs are in bold.
| Name | Description | Example | Default |
|---|---|---|---|
| input_file | Google bucket URL of the input count matrix. | “gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad” | |
| number_of_topics | Array of number of topics. | [10,15,20] | |
| prefix_exclude | Comma separated list of features to exclude that start with prefix. | “mt-,Rpl,Rps” | “mt-,Rpl,Rps” |
| min_percent_expressed | Exclude features expressed below min_percent. | 2 | |
| max_percent_expressed | Exclude features expressed below min_percent. | 98 | |
| random_number_seed | Random number seed for reproducibility. | 0 | 0 |
Workflow output¶
| Name | Type | Description |
|---|---|---|
| coherence_plot | File | Plot of coherence scores vs. number of topics |
| perplexity_plot | File | Plot of perplexity values vs. number of topics |
| cell_scores | Array[File] | Topic by cells (one file for each topic number) |
| feature_topics | Array[File] | Topic by features (one file for each topic number) |
| report | Array[File] | HTML visualization report (one file for each topic number) |
| stats | Array[File] | Computed coherence and perplexity (one file for each topic number) |
| model | Array[File] | Serialized LDA model (one file for each topic number) |
| corpus | File | Serialized corpus |
| dictionary | File | Serialized dictionary |