Introduction
An individual’s genome comprises a mosaic of chromosomal sections deriving from the (possibly various) ancestral populations from which an individual descends. The global ancestry of an individual is the relative proportion of each of these ancestral populations from which these sections derive, integrated over the entire genome of the individual in question. For instance, if an individual’s father is 100% East Asian and the mother 100% Northern European, the individual’s global ancestry is 50% East Asian and 50% Northern European.
For some project configurations, Gencove delivers global ancestry estimates (GAE).
Methods
Gencove uses a proprietary algorithm for the estimation of global ancestry directly from sequence reads. Briefly, the algorithm is a supervised version of the admixture model introduced by Pritchard et al. (2000) adapted to account for sequence data as input. Given a reference panel of genetic data from individuals from known populations and sequence reads from a given individual, ancestry proportions for the individual are estimated with respect to the reference populations. For more details and a historical perspective on STRUCTURE
-like approaches, please see Novembre (2016).
Considerations
Interpretation of global ancestry estimates should take into account the following points:
- Estimates are only as good as the datasets used as reference. Although Gencove has assembled an extremely diverse set of individuals for use as a reference panel, there remain populations for which we have no reference individuals.
- In cases where samples belong to a population not part of the reference panel, results will typically reflect an assignment to the population in the reference set that is closest to the un-assayed population
- Global ancestry estimates do not provide any information about local ancestry; i.e., the regional ancestry at a specific genomic segment. GAEs are genome-wide estimates.
- The inherently probabilistic nature of low-coverage sequencing means that the set of polymorphic sites covered by at least a single sequencing read for a given sample differs slightly from run to run, which means that while ancestry estimates are on the whole extremely stable, there will be a small amount of variation in estimated ancestry proportions for the same individual, if assayed multiple times.