A compact genotype sketch for cross-platform sample matching

Giulgaz Muradova* and Fedor Konovalov

Independent Clinical Bioinformatics Laboratory, Moscow, Russia

gm [at] clinbio.ru

Abstract

Reliable comparison of sequencing datasets across laboratories has become increasingly important as patients often undergo repeated exome or genome sequencing on different platforms. Existing sample-matching approaches usually require dense genome-wide genotypes, tolerate sparse or uneven exome data poorly, generate impractically large signatures, or preserve enough genotype information to support unintended identity or relatedness analyses. We therefore developed a compact, lossy genotype sketch designed to combine exome compatibility, tolerance to no-calls and limited genotyping errors, small hash-string footprint, and deliberate reduction of recoverable variant-level information.

We selected 462 common, high-coverage, population-broad single-nucleotide polymorphisms from exome-accessible regions. Candidate variants were filtered for high allele count and intermediate allele frequency across multiple populations, restricted to protein-coding exons, and pruned by genomic distance to reduce linkage between markers. Genotypes at the selected loci were transformed using a custom many-to-one encoding scheme. The method shuffled the genotype string, encoded adjacent SNP pairs into three ambiguity-preserving classes, and compressed the resulting class string into a fixed-length representation suitable for rapid comparison and two-dimensional barcode workflows. In parallel, we evaluated established locality-sensitive hashing (LSH) approaches on the same genotype representations and found them either impractical due to signature size or less effective in discriminating matching from non-matching samples. This design preserved similarity between samples while rendering direct recovery of individual SNP genotypes practically infeasible.

We evaluated the approach using publicly available paired whole-genome and whole-exome data from the 1000 Genomes Project and multiple publicly available sequencing datasets of the reference sample NA12878 generated on different platforms. The resulting sketches consistently separated matching from non-matching samples and remained comparable under substantial missingness, supporting their use with heterogeneous exome data. The compact output enabled practical storage in laboratory reports and allowed straightforward report-to-sample verification without sharing raw genotypes.

This approach provided a practical external quality-control layer for sample continuity verification, detection of sample swaps, and cross-platform comparison of sequencing datasets. Because the representation was intentionally lossy, it should be interpreted as a QC checksum rather than a forensic identifier. Further evaluation of relatedness leakage and privacy properties is ongoing.

Keywords: genotyping, exome, LSH, privacy