Embedding similarity for Networks

How a dense vector representation captures genome plus interests, and how cosine similarity surfaces compatible matches without revealing identifying information.

4 min read · updated Apr 19, 2026

Networks recommendations and AI candidate search both rest on the same primitive: one dense embedding per user that captures their ancestry composition, their haplogroups, their archaic introgression, key health and pharmacogenomic flags, and a bag of interests pulled from their profile bio. The embedding is produced by a commercial provider operating under a no-training contract, so the text we embed is not used to train future models.

Why cosine similarity

Cosine similarity measures the angle between two vectors rather than their Euclidean distance. Two embeddings that point in the same direction have similarity 1, two that point oppositely have similarity -1. For high-dimensional embeddings of mixed-content data, cosine similarity is more robust than Euclidean distance because it is invariant to vector magnitude, which would otherwise be dominated by how much text the user wrote in their bio.

What an embedding does and does not encode

The embedding includes ancestry components, haplogroup labels, archaic percentages, headline pharmacogenomic flags, and a bag of interest tokens pulled from your bio. It does not include your name, your email, your raw genotypes, your medical history, your location, or any other identifying field. The embedding alone cannot reconstruct your identity, only your statistical profile, which is what we need to surface compatible matches without revealing anything you have not chosen to share.

What an embedding can vs cannot leak

Modern commercial text embeddings are one-way in practice: there is no efficient algorithm that inverts an embedding back to the original text. An attacker who exfiltrated the full table of user embeddings would learn the relative similarity structure of the user cohort but could not recover any individual user's ancestry composition, haplogroup, or bio. They could, however, observe that user A and user B sit close together in embedding space, which is enough to deanonymise users who have already published their identity on another platform alongside identifiable profile content.

The mitigation is the privacy-flag system documented separately: hideFromAISearch removes a user from any embedding-distance query at the API layer, regardless of what an exfiltrated database might contain.

Ask Mirror about this for your own genome

Explain this article in the context of my own genome and tell me what is most relevant for me.