Methods

Data quality control: call rate, heterozygosity, and Ti/Tv

The four metrics Haeckel computes on every uploaded file to flag genotyping errors, sample contamination, and array-platform problems before any downstream analysis runs.

6 min read · updated Apr 19, 2026

Every genome that lands in Haeckel goes through a four-metric quality-control pass before any analyser touches it. The QC step does not modify the data; it produces a quality grade (pass / warn / fail) plus a per-metric explanation that the user can inspect. Downstream analysers take the grade as input and may degrade gracefully (PRS, for example, tolerates moderate missingness; phenotype prediction does not).

Metric 1: call rate

The call rate is the fraction of array positions for which a confident genotype was reported. Above 95% the file passes. Between 90% and 95% the file is flagged as warn: many analyses can still run, but PRS calibration may be biased and the user is told. Below 90% the file fails and the user is asked to re-upload or to redownload from the source.

Metric 2: heterozygosity rate

The heterozygosity rate is the fraction of called genotypes that are heterozygous. A typical value sits around 0.30-0.34 for outbred individuals. A value substantially above the population mean (above 0.40) suggests sample contamination from another donor. A value substantially below (below 0.25) suggests recent consanguinity (high F_ROH) or, more rarely, technical issues such as PCR-amplification failure of one allele class.

Metric 3: Ti/Tv ratio

The transition-to-transversion ratio is the ratio of A↔G or C↔T variants (transitions) to A↔C, A↔T, G↔C, or G↔T variants (transversions). In whole-genome data the Ti/Tv ratio sits around 2.0-2.1; in chip-array data, where the array preferentially samples coding-region variants that are enriched for transitions, the ratio is higher (around 2.5-3.0). A Ti/Tv far from the platform-expected value indicates systematic genotyping error.

Metric 4: per-chromosome variant count

The QC analyser counts called variants per chromosome and flags any chromosome with a count more than three standard deviations below the cohort mean for the same array platform. A chromosome with near-zero coverage points to either a chromosome-specific failure of the array (rare) or a sex-discordant sample where the user's reported sex disagrees with the genotype-inferred sex (more common, often a clerical error during account setup).

References
  • Anderson CA et al. (2010). Data quality control in genetic case-control association studies. Nature Protocols.
  • Marees AT et al. (2018). A tutorial on conducting genome-wide association studies. International Journal of Methods in Psychiatric Research.
Ask Mirror about this for your own genome

Explain this article in the context of my own genome and tell me what is most relevant for me.