LD reference panels: how they are computed and why size matters
How Haeckel computes and ships LD reference panels for all five 1000 Genomes superpopulations, and how the mixing parameter caps cross-ancestry amplification.
Every Bayesian polygenic-score method needs a linkage-disequilibrium reference matrix that captures how alleles at nearby SNPs co-occur in the user's ancestry. Without a properly matched LD reference, the inflation of marginal GWAS effect sizes by LD goes uncorrected and the score is biased.
Per-population, block-by-block construction
The reference VCF is partitioned into approximately independent linkage-disequilibrium blocks per chromosome arm. For each major population and each block, the analyser computes the SNP-by-SNP correlation matrix using the population's subset of samples, retains only entries above a sparsity threshold to keep the matrix tractable, and stores the result in a binary format addressable by chromosome and block coordinates.
The LD-mixing parameter
When a Bayesian PRS method is applied to a user whose ancestry does not match any single reference panel exactly (admixed users, multi-ancestry users), the LD matrix needs to be a mixture of the matched panels. The mixing is parameterised by s: at s = 0 the matrix collapses to the identity, at s = 1 it is entirely the population-specific matrix, and intermediate values blend the two. Higher s puts more weight on the population-specific LD structure; lower s shrinks toward the identity.
Setting s too high amplifies any mismatch between the user's true LD and the reference. Haeckel defaults to a conservative value of s for cross-ancestry users to prioritise calibration safety over peak in-sample accuracy. For users whose ancestry matches a single reference panel exactly we lift the cap higher.
- Privé F et al. (2021). High-resolution portability of 245 polygenic scores when derived and applied in the same cohort. Genetic Epidemiology.
- Wang Y et al. (2020). Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications.
Explain this article in the context of my own genome and tell me what is most relevant for me.