Methods

Ancestry inference v4.0

The full pipeline that turns your raw genotypes into a 62-population ancestry vector: spatial thinning, MLE-EM, bootstrap Wald testing, and AMR deconvolution.

8 min read · updated Apr 19, 2026

Ancestry inference looks like a one-step problem from the outside ("what is my mix?") and is in fact a pipeline of four stages, each addressing a specific failure mode of the naive approach. This article walks through every stage, with the failure mode it fixes alongside the chosen solution.

Stage 1: spatial thinning of the AIM panel

The reference panel begins with a large set of ancestry-informative markers drawn from public reference datasets, covering over 60 subpopulations across all major continental ancestries. Many of those markers sit physically close together on the same chromosome and inherit together as linkage blocks, so they carry overlapping rather than additive information. Treating linked markers as independent inflates statistical confidence and biases the inferred ancestry vector. We thin the panel with a spatial window, retaining the marker with the highest cross-population variance in each window, and the independent set is the input to the EM solver.

Stage 2: MLE-EM mixture inference

With a properly thinned panel, the inference itself is a maximum-likelihood expectation-maximisation problem. Each iteration alternates between estimating the expected ancestry assignments per locus and updating the population proportion vector to maximise the joint likelihood. We initialise from multiple random starts to escape local maxima, retaining the run with the highest converged log-likelihood.

Stage 3: bootstrap Wald test for signal versus noise

A maximum-likelihood solver almost always returns a small non-zero contribution for every reference population, even when the true contribution is zero. To distinguish signal from noise, the pipeline runs a bootstrap resample (warm-started from the base solution) and computes the empirical standard error per population. A component is considered real only when its proportion is statistically distinguishable from zero at the standard Wald threshold. Components that fail the test are merged into an "Unassigned" bucket on the display vector.

Stage 4: in-silico AMR deconvolution

The 1000 Genomes AMR (Native American admixed) panel is itself ancestrally heterogeneous, carrying a substantial European and African fraction in addition to its Native American signal. Without correction, every user with a small Native American component shows correlated ghost contributions from EUR and AFR that come entirely from the reference panel composition rather than from the user. The pipeline orthogonalises the AMR centroid against the EUR and AFR centroids and re-inverts to express the user's mixture in the corrected coordinate system. This eliminates the structural correlation ghost that earlier versions of the analyser produced.

Two outputs

Each report carries two ancestry vectors. The display vector is honest: it contains percentages above the Wald threshold plus an Unassigned slice for components that failed the test. The PRS vector applies a hard 2% cutoff and re-projects back to the 1000 Genomes AMR space, because the PRS LD reference panels expect that coordinate system. Use the display vector to read your ancestry, and let the PRS vector flow into downstream calibration without further thought.

Validation

On the 1000 Genomes validation samples that were held out from reference construction, v4.0 returns the expected primary ancestry at 96 to 100% for unadmixed individuals (YRI returns AFR 100%, CHB returns EAS 100%, BEB returns SAS 88%) and produces stable admixed profiles for known admixed samples (PUR returns EUR 53%, AFR 24%, NAT 23%; ASW returns AFR 64%, EUR 28%, NAT 8%). The previous version, v3.1, produced spurious 3 to 7% ghost contributions in unadmixed European samples that the v4.0 pipeline now eliminates.

References
  • Alexander DH et al. (2009). Fast model-based estimation of ancestry. Genome Research.
  • Patterson N et al. (2006). Population structure and eigenanalysis. PLOS Genetics.
  • Cavalli-Sforza LL, Edwards AWF (1967). Phylogenetic analysis: models and estimation procedures. American Journal of Human Genetics.
Ask Mirror about this for your own genome

Explain this article in the context of my own genome and tell me what is most relevant for me.