Drop-seq pipeline
This workflow follows the steps outlined in the Drop-seq alignment cookbook from the McCarroll lab , except the default STAR aligner flags are –limitOutSJcollapsed 1000000 –twopassMode Basic. Additionally the pipeline provides the option to generate count matrices using dropEst.
Copy your sequencing output to your workspace bucket using gsutil in your unix terminal.
You can obtain your bucket URL in the dashboard tab of your Terra workspace under the information panel.
Note: Broad users need to be on an UGER node (not a login node) in order to use the
-m
flagRequest an UGER node:
reuse UGER qrsh -q interactive -l h_vmem=4g -pe smp 8 -binding linear:8 -P regevlab
The above command requests an interactive node with 4G memory per thread and 8 threads. Feel free to change the memory, thread, and project parameters.
Once you’re connected to an UGER node, you can make gsutil available by running:
reuse Google-Cloud-SDK
Use
gsutil cp [OPTION]... src_url dst_url
to copy data to your workspace bucket. For example, the following command copies the directory at /foo/bar/nextseq/Data/VK18WBC6Z4 to a Google bucket:gsutil -m cp -r /foo/bar/nextseq/Data/VK18WBC6Z4 gs://fc-e0000000-0000-0000-0000-000000000000/VK18WBC6Z4
-m
means copy in parallel,-r
means copy the directory recursively.Non Broad Institute users that wish to run bcl2fastq must create a custom docker image.
See bcl2fastq-docker instructions.
Create a sample sheet.
Please note that the columns in the CSV must be in the order shown below and does not contain a header line. The sample sheet provides either the FASTQ files for each sample if you’ve already run bcl2fastq or a list of BCL directories if you’re starting from BCL directories. Please note that BCL directories must contain a valid bcl2fastq sample sheet (SampleSheet.csv):
Column
Description
Name
Sample name.
Read1
Location of the FASTQ file for read1 in the cloud (gsurl).
Read2
Location of the FASTQ file for read2 in the cloud (gsurl).
Example using FASTQ input files:
sample-1,gs://fc-e0000000-0000-0000-0000-000000000000/dropseq-1/sample1-1_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/dropseq-1/sample-1_L001_R2_001.fastq.gz sample-2,gs://fc-e0000000-0000-0000-0000-000000000000/dropseq-1/sample-2_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/dropseq-1/sample-2_L001_R2_001.fastq.gz sample-1,gs://fc-e0000000-0000-0000-0000-000000000000/dropseq-2/sample1-1_L001_R1_001.fastq.gz,gs://fc-e0000000-0000-0000-0000-000000000000/dropseq-2/sample-1_L001_R2_001.fastq.gz
Note that in this example, sample-1 was sequenced across two flowcells.
Example using BCL input directories:
gs://fc-e0000000-0000-0000-0000-000000000000/flowcell-1 gs://fc-e0000000-0000-0000-0000-000000000000/flowcell-2
Note that the flow cell directory must contain a bcl2fastq sample sheet named SampleSheet.csv.
Upload your sample sheet to the workspace bucket.
Example:
gsutil cp /foo/bar/projects/sample_sheet.csv gs://fc-e0000000-0000-0000-0000-000000000000/
Import dropseq_workflow workflow to your workspace.
See the Terra documentation for adding a workflow. The dropseq_workflow is under
Broad Methods Repository
with name “cumulus/dropseq_workflow”.Moreover, in the workflow page, click the
Export to Workspace...
button, and select the workspace you want to export dropseq_workflow workflow in the drop-down menu.In your workspace, open
dropseq_workflow
inWORKFLOWS
tab. SelectRun workflow with inputs defined by file paths
as belowand click the
SAVE
button.
Inputs
Please see the description of important inputs below.
Name |
Description |
---|---|
input_csv_file |
CSV file containing sample name, read1, and read2 or a list of BCL directories. |
output_directory |
Pipeline output directory (gs URL e.g. “gs://fc-e0000000-0000-0000-0000-000000000000/dropseq_output”) |
reference |
hg19, GRCh38, mm10, hg19_mm10, mmul_8.0.1 or a path to a custom reference JSON file |
run_bcl2fastq |
Whether your sample sheet contains one BCL directory per line or one sample per line (default false) |
run_dropseq_tools |
Whether to generate count matrixes using Drop-Seq tools from the McCarroll lab (default true) |
run_dropest |
Whether to generate count matrixes using dropEst (default false) |
cellular_barcode_whitelist |
Optional whitelist of known cellular barcodes |
drop_seq_tools_force_cells |
If supplied, bypass the cell detection algorithm (the elbow method) and use this number of cells. |
dropest_cells_max |
Maximal number of output cells |
dropest_genes_min |
Minimal number of genes for cells after the merge procedure (default 100) |
dropest_min_merge_fraction |
Threshold for the merge procedure (default 0.2) |
dropest_max_cb_merge_edit_distance |
Max edit distance between barcodes (default 2) |
dropest_max_umi_merge_edit_distance |
Max edit distance between UMIs (default 1) |
dropest_min_genes_before_merge |
Minimal number of genes for cells before the merge procedure. Used mostly for optimization. (default 10) |
dropest_merge_barcodes_precise |
Use precise merge strategy (can be slow), recommended to use when the list of real barcodes is not available (default true) |
dropest_velocyto |
Save separate count matrices for exons, introns and exon/intron spanning reads (default true) |
trim_sequence |
The sequence to look for at the start of reads for trimming (default “AAGCAGTGGTATCAACGCAGAGTGAATGGG”) |
trim_num_bases |
How many bases at the beginning of the sequence must match before trimming occur (default 5) |
umi_base_range |
The base location of the molecular barcode (default 13-20) |
cellular_barcode_base_range |
The base location of the cell barcode (default 1-12) |
star_flags |
Additional options to pass to STAR aligner |
Please note that run_bcl2fastq must be set to true if you’re starting from BCL files instead of FASTQs.
Custom Genome JSON
If you’re reference is not one of the predefined choices, you can create a custom JSON file. Example:
{
"refflat": "gs://fc-e0000000-0000-0000-0000-000000000000/human_mouse/hg19_mm10_transgenes.refFlat",
"genome_fasta": "gs://fc-e0000000-0000-0000-0000-000000000000/human_mouse/hg19_mm10_transgenes.fasta",
"star_genome": "gs://fc-e0000000-0000-0000-0000-000000000000/human_mouse/STAR2_5_index_hg19_mm10.tar.gz",
"gene_intervals": "gs://fc-e0000000-0000-0000-0000-000000000000/human_mouse/hg19_mm10_transgenes.genes.intervals",
"genome_dict": "gs://fc-e0000000-0000-0000-0000-000000000000/human_mouse/hg19_mm10_transgenes.dict",
"star_cpus": 32,
"star_memory": "120G"
}
The fields star_cpus and star_memory are optional and are used as the default cpus and memory for running STAR with your genome.
Outputs
The pipeline outputs a list of google bucket urls containing one gene-count matrix per sample. Each gene-count matrix file produced by Drop-seq tools has the suffix ‘dge.txt.gz’, matrices produced by dropEst have the extension .rds.
Building a Custom Genome
The tool dropseq_bundle can be used to build a custom genome. Please see the description of important inputs below.
Name |
Description |
---|---|
fasta_file |
Array of fasta files. If more than one species, fasta and gtf files must be in the same order. |
gtf_file |
Array of gtf files. If more than one species, fasta and gtf files must be in the same order. |
genomeSAindexNbases |
Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, must be scaled down to min(14, log2(GenomeLength)/2 - 1) |
dropseq_workflow Terra Release Notes
Version 11
Added fastq_to_sam_memory and trim_bam_memory workflow inputs
Version 10
Updated workflow to WDL version 1.0
Version 9
Changed input bcl2fastq_docker_registry from optional to required
Version 8
Added additional parameters for bcl2fastq
Version 7
Added support for multi-species genomes (Barnyard experiments)
Version 6
Added star_extra_disk_space and star_disk_space_multiplier workflow inputs to adjust disk space allocated for STAR alignment task.
Version 5
Split preprocessing steps into separate tasks (FastqToSam, TagBam, FilterBam, and TrimBam).
Version 4
Handle uncompressed fastq files as workflow input.
Added optional prepare_fastq_disk_space_multiplier input.
Version 3
Set default value for docker_registry input.
Version 2
Added docker_registry input.
Version 1
Renamed sccloud to cumulus
Added use_bases_mask option when running bcl2fastq
Version 18
Created a separate docker image for running bcl2fastq
Version 17
Fixed bug that ignored WDL input star_flags (thanks to Carly Ziegler for reporting)
Changed default value of star_flags to the empty string (Prior versions of the WDL incorrectly indicated that basic 2-pass mapping was done)
Version 16
Use cumulus dockerhub organization
Changed default dropEst version to 0.8.6
Version 15
Added drop_deq_tools_prep_bam_memory and drop_deq_tools_dge_memory options
Version 14
Fix for downloading files from user pays buckets
Version 13
Set GCLOUD_PROJECT_ID for user pays buckets
Version 12
Changed default dropEst memory from 52G to 104G
Version 11
Updated formula for computing disk size for dropseq_count
Version 10
Added option to specify merge_bam_alignment_memory and sort_bam_max_records_in_ram
Version 9
Updated default drop_seq_tools_version from 2.2.0 to 2.3.0
Version 8
Made additional options available for running dropEst
Version 7
Changed default dropEst memory from 104G to 52G
Version 6
Added option to run dropEst
Version 5
Specify full version for bcl2fastq (2.20.0.422-2 instead of 2.20.0.422)
Version 4
Fixed issue that prevented bcl2fastq from running
Version 3
Set default run_bcl2fastq to false
Create shortcuts for commonly used genomes
Version 2
Updated QC report
Version 1
Initial release
dropseq_bundle Terra Release Notes
Version 4
Added create_intervals_memory and extra_star_flags inputs
Version 3
Added extra disk space inputs
Fixed bug that prevented creating multi-genome bundles
Version 2
Added docker_registry input
Version 1
Renamed sccloud to cumulus
Version 1
Changed docker organization
Version 1
Initial release