Topic modeling

Prepare input data

Follow the steps below to run topic_modeling on Terra.

  1. Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats

  2. Upload your count matrix to the workspace.

    Example:

    gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
    

    where /foo/bar/projects/dataset.h5ad is the path to your dataset on your local machine, and gs://fc-e0000000-0000-0000-0000-000000000000/ is the Google bucket destination.

  3. Import topic_modeling workflow to your workspace.

    See the Terra documentation for adding a workflow. The cumulus workflow is under Broad Methods Repository with name “cumulus/topic_modeling”.

    Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.

  4. In your workspace, open topic_modeling in WORKFLOWS tab. Select Run workflow with inputs defined by file paths as below

    _images/single_workflow.png

    and click the SAVE button.

Workflow input

Inputs for the topic_modeling workflow are described below. Required inputs are in bold.

Name

Description

Example

Default

input_file

Google bucket URL of the input count matrix.

“gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad”

number_of_topics

Array of number of topics.

[10,15,20]

prefix_exclude

Comma separated list of features to exclude that start with prefix.

“mt-,Rpl,Rps”

“mt-,Rpl,Rps”

min_percent_expressed

Exclude features expressed below min_percent.

2

max_percent_expressed

Exclude features expressed below min_percent.

98

random_number_seed

Random number seed for reproducibility.

0

0

Workflow output

Name

Type

Description

coherence_plot

File

Plot of coherence scores vs. number of topics

perplexity_plot

File

Plot of perplexity values vs. number of topics

cell_scores

Array[File]

Topic by cells (one file for each topic number)

feature_topics

Array[File]

Topic by features (one file for each topic number)

report

Array[File]

HTML visualization report (one file for each topic number)

stats

Array[File]

Computed coherence and perplexity (one file for each topic number)

model

Array[File]

Serialized LDA model (one file for each topic number)

corpus

File

Serialized corpus

dictionary

File

Serialized dictionary