Uploading your DNA
Where to get your raw DNA file from 23andMe, AncestryDNA, FTDNA, or a clinical WGS provider, and what happens after you upload it.
Haeckel reads raw DNA files from every major consumer testing service and from clinical sequencing providers. The pipeline accepts chip-array text exports from 23andMe, AncestryDNA, MyHeritage, and FamilyTreeDNA, VCF files from any clinical WGS or WES provider, and FASTQ files for raw sequencing reads. Each format is auto-detected from its magic bytes plus a small heuristic on the column structure, so you do not need to declare the format yourself.
Where to get your file from each provider
- 23andMe: Account → Browse Raw Data → Download. The file arrives as a tab-separated UTF-8 text file inside a zip archive (~700,000 SNPs, ~20MB compressed). The header carries metadata lines starting with "#"; the data section starts at "rsid\tchromosome\tposition\tgenotype".
- AncestryDNA: Settings → Download Raw DNA Data → confirm with email. The file is a comma-separated archive (~700,000 SNPs after dedup, ~30MB compressed). The header includes a delivery agreement; the data starts at "rsid,chromosome,position,allele1,allele2".
- MyHeritage: DNA Tools → Manage DNA Kits → Download Raw Data. CSV format, similar to AncestryDNA in structure.
- FamilyTreeDNA: Family Finder → Download Raw Data → Build 37 or Build 38 (we accept both, as long as the build is declared).
- Clinical WGS providers (Nebula, Dante Labs, Veritas, Sano): ask for the VCF (variant calls) or FASTQ (raw reads). Most clinical labs deliver via a secure download link with a one-time password.
Size limits
A single uniform limit applies across every supported format: 4 GB per file. The upload widget uses S3-compatible multipart transfer, so the file is split into parts and uploaded in parallel; a dropped connection resumes from the last completed part rather than starting over. Typical file sizes for context:
- Chip arrays (23andMe / AncestryDNA / MyHeritage / FTDNA): 5–35 MB compressed. Well under the limit; uploads finish in seconds on a normal connection.
- Clinical WGS VCF (Nebula, Dante, Veritas, Sano): typically 1–4 GB depending on whether structural variants and gVCF blocks are included.
- FASTQ paired-end raw reads: each mate can run several GB. If your combined R1 + R2 exceeds 4 GB, upload them as separate files and the pipeline will pair them by filename convention.
What happens after upload, in order
- Validation (under 1 second): the file is parsed against the expected schema for its detected format. A malformed header or an unrecognised chromosome label aborts the run with an actionable error message before anything is persisted.
- Encryption + persistence (under 5 seconds): the validated file is encrypted with a per-user key and stored to encrypted object storage. The encryption key is held in our key-management service and is never written to the application database.
- Parsing + ref/alt alignment (roughly 1 minute): the format-specific parser emits typed variant objects, and each variant is aligned against a public reference so strand orientations across different consumer arrays are normalised before anything downstream runs.
- Variant enrichment (3 to 5 minutes): the aligned variants are joined against a large reference database that carries population frequencies, ClinVar significance, GWAS effect sizes, and ancestry-informativeness scores.
- Parallel analyser execution (15 to 25 minutes): the seventeen analyser modules run concurrently. Ancestry inference and the Bayesian polygenic-score methods are the two long tails; the other analysers finish early and wait on them. Failures in one analyser are caught and reported per-module rather than aborting the whole run.
- Result persistence + notification (under 1 minute): outputs are upserted so re-processing overwrites cleanly, and you receive a notification through the platform plus an email if you have email notifications enabled.
The end-to-end run takes roughly 30 minutes for a typical chip-array file. Whole-genome runs are dominated by the analyser stage and finish in 45 to 90 minutes depending on depth of coverage and whether the file requires decompression and variant re-calling before the analytical stages can start.
Sex chromosome handling
The pipeline infers your biological sex from the X and Y chromosome variant counts independently of any self-reported value. A typical XY user shows around 50,000 Y-linked calls and roughly half the X-linked calls of an XX user. When the inferred sex disagrees with your profile setting, the result is flagged for you to resolve, since the disagreement is sometimes a clerical error and sometimes a true clinical signal (XYY, XXY, mosaic sex chromosomes, intersex variation).
Re-uploading or re-processing
If a method improves or a new analyser ships, the platform offers to re-process your existing file rather than asking you to upload again. Re-processing reads the encrypted file from object storage, runs the new pipeline version, and upserts the results so you do not end up with duplicates. The Genome page exposes the pipeline version that produced each result, and an "update available" badge appears when a re-process would refresh anything materially.
If you upload a new file from a different test (a 23andMe file followed later by a clinical WGS, for example), the new run replaces the old result set as the active one. The previous result is retained internally for comparison but no longer drives the active Genome page.
Common errors and how to resolve them
- "Header not recognised": the file is from a provider whose format we have not catalogued, or the file was edited before upload. Re-download from the provider without modification.
- "Less than 100,000 calls detected": the file is too sparse for meaningful analysis. Confirm the download is the full file rather than a sample.
- "Sex chromosome inference disagrees with profile": review the inferred sex on the result page and update either your profile or the analysis as appropriate.
- "Pipeline timeout": rare; usually a transient infrastructure issue. Re-trigger from the Data page and contact support if it recurs.
Explain this article in the context of my own genome and tell me what is most relevant for me.