The genomic pipeline, end to end (v5)

Every stage your raw DNA file passes through, in order, from upload through encryption, parsing, variant enrichment, the seventeen analyser modules, and final result persistence.

8 min read · updated Apr 23, 2026

The genomic pipeline is the part of Haeckel that turns a raw DNA file into the populated Genome page you see when the analysis completes. It runs in stages with checkpoints at each step so a partial failure does not lose the work that came before, and so re-processing a file (when an analyser improves, for example) skips the stages that have not changed.

Stage 0: upload and validation

The user uploads a chip-array text file, a VCF, or a FASTQ. The pipeline auto-detects the format from the magic bytes and the column structure, decompresses gzip when needed, and validates that the file is well-formed before persisting it. Failed validation aborts the run with an actionable error message rather than letting downstream stages produce garbage.

Stage 1: encryption and persistence

The validated file is encrypted with a per-user symmetric key and stored in encrypted object storage. The encryption key itself is held in our key-management service and is referenced by an opaque identifier rather than by the file path, so an attacker with storage access alone cannot decrypt without also breaching the KMS.

Stage 2: parsing and ref/alt alignment

A format-specific parser reads the file and emits a stream of typed variant objects with rsID, chromosome, position, and genotype. The variants are then passed through ref/alt alignment, which standardises the orientation of each variant against a public reference. Without this step, palindromic A/T and C/G SNPs can flip silently between platforms and corrupt every downstream PRS.

Stage 3: variant enrichment

The aligned variants are joined against a large reference database that carries population-frequency data, ClinVar significance, GWAS-catalog effect sizes, ancestry-informativeness scores, and any structural-variant overlap. The enrichment is the input that downstream analysers consume; a variant without enrichment data falls back to defaults rather than failing.

Stage 4: parallel analyser execution

The seventeen analyser modules run in parallel (where their dependencies allow). Each analyser receives the enriched variant stream and produces a typed result object. Failures in one analyser are caught and reported in a `failedAnalyzers` array on the final result, rather than aborting the whole run. This means that a user with a chip-array file (which lacks the SNPs needed for some analyses) still gets a Genome page populated with everything that did succeed, plus an honest list of what did not.

AncestryAnalyzer (v4.0) — MLE-EM with spatial thinning, bootstrap Wald, AMR deconvolution.
HealthAnalyzer — 89-gene ClinVar matching with sex-aware X-linked.
TraitAnalyzer — GWAS-based phenotype mapping.
HaplogroupAnalyzer (v2.1) — recursive phylogenetic traversal with Bayesian confidence.
APOEAnalyzer — 4-tier resolution.
HLAAnalyzer — tag-SNP imputation.
PCACalculator — 3D centroid projection.
ArchaicAnalyzer — 53-marker Bayesian log-likelihood.
Archaic introgression: S* statistic + D-statistics + tract HMM.
ROHDetector — sliding-window with segment merging.
PhenotypePredictor — HIrisPlex-S plus GIANT-derived height.
PharmacogenomicsAnalyzer — star alleles + CPIC, with FDA Black Box flag.
NutrigenomicsAnalyzer — 46-SNP panel across 10 categories.
KinshipEstimator — KING-robust + IBD0/1/2.
DataQualityAnalyzer — call rate, het rate, Ti/Tv, per-chromosome counts.
PRS engine — six methods (Basic, C+T, PRS-CS, LDpred2, Lassosum, SBayesR) plus ensemble.

Stage 5: result persistence and notification

All analyser outputs are upserted into the result tables. Upsert (rather than insert) means that re-processing a file overwrites the previous result rather than creating duplicates. The user receives a notification when the run completes, and the Genome page becomes populated.

Idempotency and re-processing

When a method improves or a new analyser ships, the platform offers to re-process your existing file rather than asking you to re-upload. Re-processing reads the encrypted file from storage, runs the new pipeline version, and overwrites the previous result. The Genome page exposes the pipeline version that produced each result so you can see whether any given metric is up to date with the latest method.

Ask Mirror about this for your own genome

Explain this article in the context of my own genome and tell me what is most relevant for me.