How your ancestry composition is computed
A short tour of the v4.0 inference pipeline: spatial thinning of the AIM panel, an MLE-EM solver, a bootstrap Wald test for signal versus noise, and a deconvolution step that fixes the admixed AMR reference panel.
Ancestry inference asks a deceptively simple question: given your genotypes at a panel of ancestry-informative markers, what mixture of reference populations best explains them? The answer is a vector of percentages, one per population, that sum to 100. The trick lies in how that vector is estimated, because almost every step has a place where naive choices produce systematically wrong answers.
Step 1: pick the right markers
The starting point is a curated set of ancestry-informative markers, or AIMs, where allele frequencies differ sharply between populations. The panel draws from public reference datasets (1000 Genomes plus the HGDP+TGP subset of gnomAD) and covers over 60 subpopulations across all major continental ancestries.
Raw markers are not all independent. Many of them sit close together on the same chromosome and inherit together as a block, which means they carry overlapping information rather than additive information. Treating them as independent inflates statistical confidence and biases the inferred percentages.
Step 2: solve the mixture
With an independent AIM panel in hand, the mixture is fit by maximum-likelihood expectation-maximisation. The EM iteration is started from multiple random initial points to avoid local maxima, and the best run is retained.
Step 3: tell signal from noise
Even after thinning, an MLE solver will hand back a small non-zero percentage for almost every reference population, because random sampling produces tiny apparent contributions from populations that are not really there. To separate real ancestry from sampling noise, the analyser runs a bootstrap resample and computes the empirical standard error per population. A component is considered real only when its percentage is statistically distinguishable from zero at the standard Wald threshold. Components that fail the test are merged into an "Unassigned" bucket on your display, rather than being shown with a misleading 1% or 2%.
Step 4: fix the admixed AMR reference
The 1000 Genomes AMR (Native American admixed) panel is itself ancestrally heterogeneous. On average it is roughly 53% European, 9% African, and 38% Native American. If you do not correct for that, every user with even a small Native American component shows correlated ghost contributions from EUR and AFR that are entirely an artefact of the reference panel.
Haeckel performs an in-silico deconvolution on the AMR centroid, solving for the underlying Native American vertex by orthogonalising against the EUR and AFR centroids. The result is a cleaner picture for users with admixed American ancestry, and it removes the spurious ghost ancestry that earlier versions of the analyser produced for users with no AMR component at all.
- Alexander DH et al. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research.
- Patterson N et al. (2006). Population structure and eigenanalysis. PLOS Genetics.
- 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature.
- Bergström A et al. (2020). Insights into human genetic variation and population history from 929 diverse genomes. Science.
Walk me through how my ancestry breakdown was computed and how confident each percentage is.