Human de novo Assembly

Generate de novo assemblies from PacBio read data. Supports single-sample and trio-based assemblies.

Generate de novo assemblies from PacBio read data. Supports single-sample and trio-based assemblies.

Workflow for running de novo assembly using human PacBio whole genome sequencing (WGS) data. Written using Workflow Description Language (WDL). The assembly workflow performs de novo assembly on samples and trios.

Human de novo assembly workflow diagram
Human de novo assembly workflow diagram

Workflow Inputs

Each sample can independently have single-sample de novo assembly run. Additionally, if a trio is provided, trio-based assembly may be run.

Input Description
cohort

A cohort can include one or more samples. Samples need not be related.

cohort_id A unique name for the cohort; used to name outputs
samples The set of samples for the cohort. At least one sample must be defined.
run_de_novo_assembly_trio Run trio binned de novo assembly.
samples

Sample information for each sample in the workflow run.

sample_id A unique name for the sample; used to name outputs
movie_bams The set of unaligned movie BAMs associated with this sample
sex Sample sex
father_id Paternal sample_id
mother_id Maternal sample_id
run_de_novo_assembly If true, run single-sample de novo assembly for this sample
reference

Files associated with the reference genome.

name Reference name; used to name outputs (e.g., “GRCh38”)
fasta Reference genome and index
backend

Backend where the workflow will be executed

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The output set will depend on whether single-sample or trio-based de novo assembly is run.

Output Description
Sample de novo assembly

These files will be output if cohort.samples[sample] is set to true for any sample.

zipped_assembly_fastas De novo dual assembly generated by hifiasm
assembly_noseq_gfas Assembly graphs in GFA format.
assembly_lowQ_beds Coordinates of low quality regions in BED format.
assembly_stats Assembly size and NG50 stats generated by calN50.
asm_bam minimap2 alignment of assembly to reference.
htsbox_vcf Naive pileup variant calling of assembly against reference with htsbox
htsbox_vcf_stats bcftools stats summary statistics for htsbox variant calls
Trio de novo assembly

These files will be output if cohort.de_novo_assembly_trio is set to true and there is at least one parent-parent-kid trio in the cohort.

trio_zipped_assembly_fastas Haplotype-resolved de novo assembly of the trio kid generated by hifiasm with trio binning
trio_assembly_noseq_gfas Assembly graphs in GFA format.
trio_assembly_lowQ_beds Coordinates of low quality regions in BED format.
trio_assembly_stats Assembly size and NG50 stats generated by calN50.
trio_asm_bams minimap2 alignment of assembly to reference.
haplotype_key Indication of which haplotype (hap1/hap2) corresponds to which parent.

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

Top