Topic modeling

Prepare input data

Follow the steps below to run topic_modeling on Terra.

  1. Prepare your count matrix. Cumulus currently supports the following formats: ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ and ‘fcs’ (for flow/mass cytometry data) formats

  2. Upload your count matrix to the workspace.

    Example:

    gsutil cp /foo/bar/projects/dataset.h5ad gs://fc-e0000000-0000-0000-0000-000000000000/
    

    where /foo/bar/projects/dataset.h5ad is the path to your dataset on your local machine, and gs://fc-e0000000-0000-0000-0000-000000000000/ is the Google bucket destination.

  3. Import topic_modeling workflow to your workspace.

    See the Terra documentation for adding a workflow. The cumulus workflow is under Broad Methods Repository with name “cumulus/topic_modeling”.

    Moreover, in the workflow page, click the Export to Workspace... button, and select the workspace to which you want to export topic_modeling workflow in the drop-down menu.

  4. In your workspace, open topic_modeling in WORKFLOWS tab. Select Run workflow with inputs defined by file paths as below

    _images/single_workflow.png

    and click the SAVE button.

Workflow input

Inputs for the topic_modeling workflow are described below. Required inputs are in bold.

Name Description Example Default
input_file Google bucket URL of the input count matrix. “gs://fc-e0000000-0000-0000-0000-000000000000/my_dataset.h5ad”  
number_of_topics Array of number of topics. [10,15,20]  
prefix_exclude Comma separated list of features to exclude that start with prefix. “mt-,Rpl,Rps” “mt-,Rpl,Rps”
min_percent_expressed Exclude features expressed below min_percent. 2  
max_percent_expressed Exclude features expressed below min_percent. 98  
random_number_seed Random number seed for reproducibility. 0 0

Workflow output

Name Type Description
coherence_plot File Plot of coherence scores vs. number of topics
perplexity_plot File Plot of perplexity values vs. number of topics
cell_scores Array[File] Topic by cells (one file for each topic number)
feature_topics Array[File] Topic by features (one file for each topic number)
report Array[File] HTML visualization report (one file for each topic number)
stats Array[File] Computed coherence and perplexity (one file for each topic number)
model Array[File] Serialized LDA model (one file for each topic number)
corpus File Serialized corpus
dictionary File Serialized dictionary