Genetic association studies with population samples hold the promise of uncovering the susceptibility genes underlying the heritability of complex or common disease. Most association studies rely on the use of surrogate markers, single-nucleotide polymorphism (SNP) being the most suitable due to their abundance and ease of scoring. SNP marker selection is aimed to increase the chances that at least one typed SNP would be in linkage disequilibrium (LD) with the disease causative variant, while at the same time controlling the cost of the study in terms of the number of markers genotyped and samples. Empirical studies reporting block-like segments in the genome with high LD and low haplotype diversity have motivated a marker selection strategy whereby subsets of SNPs that 'tag' the common haplotypes of a region are picked for genotyping, avoiding typing redundant SNPs. Based on these initial observations, a plethora of 'tagging' algorithms for selecting minimum informative subsets of SNPs has recently appeared in the literature. These differ mostly in two major aspects: the quality or correlation measure used to define tagging and the algorithm used for the minimization of the final number of tagging SNPs. In this review we describe the available tagging algorithms utilizing a 3-step unifying framework, point out their methodological and conceptual differences, and make an assessment of their assumptions, performance, and scalability.
- Association mapping
- Haplotype tagging
- Linkage disequilibrium
- Minimum informative subset
- Single nucleotide polymorphism