Data quality control: call rate, heterozygosity, and Ti/Tv
The four metrics Haeckel computes on every uploaded file to flag genotyping errors, sample contamination, and array-platform problems before any downstream analysis runs.
Every genome that lands in Haeckel goes through a four-metric quality-control pass before any analyser touches it. The QC step does not modify the data; it produces a quality grade (pass / warn / fail) plus a per-metric explanation that the user can inspect. Downstream analysers take the grade as input and may degrade gracefully (PRS, for example, tolerates moderate missingness; phenotype prediction does not).
Metric 1: call rate
The call rate is the fraction of array positions for which a confident genotype was reported. Above 95% the file passes. Between 90% and 95% the file is flagged as warn: many analyses can still run, but PRS calibration may be biased and the user is told. Below 90% the file fails and the user is asked to re-upload or to redownload from the source.
Metric 2: heterozygosity rate
The heterozygosity rate is the fraction of called genotypes that are heterozygous. A typical value sits around 0.30-0.34 for outbred individuals. A value substantially above the population mean (above 0.40) suggests sample contamination from another donor. A value substantially below (below 0.25) suggests recent consanguinity (high F_ROH) or, more rarely, technical issues such as PCR-amplification failure of one allele class.
Metric 3: Ti/Tv ratio
The transition-to-transversion ratio is the ratio of A↔G or C↔T variants (transitions) to A↔C, A↔T, G↔C, or G↔T variants (transversions). In whole-genome data the Ti/Tv ratio sits around 2.0-2.1; in chip-array data, where the array preferentially samples coding-region variants that are enriched for transitions, the ratio is higher (around 2.5-3.0). A Ti/Tv far from the platform-expected value indicates systematic genotyping error.
Metric 4: per-chromosome variant count
The QC analyser counts called variants per chromosome and flags any chromosome with a count more than three standard deviations below the cohort mean for the same array platform. A chromosome with near-zero coverage points to either a chromosome-specific failure of the array (rare) or a sex-discordant sample where the user's reported sex disagrees with the genotype-inferred sex (more common, often a clerical error during account setup).
- Anderson CA et al. (2010). Data quality control in genetic case-control association studies. Nature Protocols.
- Marees AT et al. (2018). A tutorial on conducting genome-wide association studies. International Journal of Methods in Psychiatric Research.
Explain this article in the context of my own genome and tell me what is most relevant for me.