Prepare input data¶
Follow the steps below to run topic_modeling on Terra.
Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats
Upload your count matrix to the workspace.
gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
/foo/bar/projects/dataset.h5adis the path to your dataset on your local machine, and
gs://fc-e0000000-0000-0000-0000-000000000000/is the Google bucket destination.
Import topic_modeling workflow to your workspace.
See the Terra documentation for adding a workflow. The cumulus workflow is under
Broad Methods Repositorywith name “cumulus/topic_modeling”.
Moreover, in the workflow page, click the
Export to Workspace...button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.
In your workspace, open
Run workflow with inputs defined by file pathsas below
and click the
Inputs for the topic_modeling workflow are described below. Required inputs are in bold.
|input_file||Google bucket URL of the input count matrix.||“gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad”|
|number_of_topics||Array of number of topics.||[10,15,20]|
|prefix_exclude||Comma separated list of features to exclude that start with prefix.||“mt-,Rpl,Rps”||“mt-,Rpl,Rps”|
|min_percent_expressed||Exclude features expressed below min_percent.||2|
|max_percent_expressed||Exclude features expressed below min_percent.||98|
|random_number_seed||Random number seed for reproducibility.||0||0|
|coherence_plot||File||Plot of coherence scores vs. number of topics|
|perplexity_plot||File||Plot of perplexity values vs. number of topics|
|cell_scores||Array[File]||Topic by cells (one file for each topic number)|
|feature_topics||Array[File]||Topic by features (one file for each topic number)|
|report||Array[File]||HTML visualization report (one file for each topic number)|
|stats||Array[File]||Computed coherence and perplexity (one file for each topic number)|
|model||Array[File]||Serialized LDA model (one file for each topic number)|