Haplotter: Web based tool to analyze HapMap data to detect signature of positive selection

Voight, B. F., S. Kudaravalli, et al. (2006). “A Map of Recent Positive Selection in the Human Genome.” PLoS Biology 4(3): e72.

Here, I am reviewing a web based tool, called Haplotter, for human population genetic analysis.  The above reference is the paper that describes the analytical methods used in this web tool.  Haplotter is very easy to use and may be useful for teaching upper level anthropological genetic or human population courses.  It is also useful for generating hypotheses, when you suspect that positive selection has acted on genes for a particular phenotypic trait, but you do not have sufficient genomic data to investigate.

Haplotter uses HapMap Phase 1 and 2 data and look for signature of positive selection in three continental populations (YRI, CEU, and ASN).  Chinese and Japanese samples were combined into ASN.  You can query by genomic region, gene or SNP.

It provides fourstatistics:

iHS (integrated Haplotype Score) is a statistic for detecting long-range haplotype.  It is based on EHH (extended haplotype
homozygosity) that measures decay of identity as a function of distance.  It was designed to detect signature when the variants have not reached fixation and are in the intermediate frequency.

Fay and Wu’s H and Tajima’s D are used to examine skew in allele frequency spectrum and large negative values indicate positive selection.  Fay and Wu’s H is sensitive, when the selected alleles are close to be fixed in a population, while Tajima’s D detect selection when there are abundant of low frequency polymorphisms.

Population pairwise FST (between three pairs of populations) is used to detect large allele frequency differences between pairs of populations that resulted from selection that acted on loci in one population, but not the other.

See also here for brief explanations of methods.  The P-value was obtained from empirical distribution of both Tajima’s D and Fay and Wu’s H in 50 SNPs windows and the rank of the statistic in the window compared with overall genome distribution.

But we have to remember, there are several important issues that we need to consider and I list three here.  First, ascertainment
bias that I mentioned earlier (see here) may affect some of the analyses.  Second, only three continental populationsused and these  three populations may not be representative samples of worldwide human populations.  Third, many SNPs in the HapMap data set are common variants with Minor Allele Frequency (MAF) greater than 0.05.


Ascertainment bias in HapMap data: Should we use HapMap data for population genetics studies?

Clark, A. G., M. J. Hubisz, et al. (2005). “Ascertainment bias in studies of human genome-wide polymorphism.” Genome Research 15(11): 1496-1502.

The HapMap project produced dense genotype data of genome-wide polymorphisms.  In the phase 3, 11 world-wide populations were included.  This project was carefully designed, but there are several issues (e.g., representativeness of sampled populations and ascertainment bias of SNPs chosen for genotyping).

In their article, Clark et al. analyzed the phase 1 HapMap data to examine how ascertainment bias affects observed within population heterozygosity (genetic diversity) and FST (population differentiation).  The phase 1 data set includes Yoruban from Nigeria, Chinese from Beijing, Japanese from Tokyo, and Europeans from Utah.

In population genetics, ascertainment bias is sampling bias that usually occurs during SNPs or genetic markers selection for analysis.  Traditionally, European individuals are used for SNP and marker discoveries and then the SNPs and genetic markers discovered from the European samples are used to analyze genetic variation of other populations, such as Asians and Africans.  All the statistical and population genetics analyses (e.g., analysis to examine pattern of population differentiation among three geographical groups) using these markers are biased.

To assess the effects of ascertainment bias, Clark et al. compared the HapMap data to Perlegen data.  The HapMap project was design to find common SNPs that have allele frequency of > 5%, so these SNPs can be used for disease association studies.  On the other hand, Perlegen data was produced from resequncing of individuals from ethnically diverse populations.

They found that observed within population heterozygosity and FST between each pair of populations are inflated in HapMap data set.  HapMap data is often used for population genetics studies, but they argue that we have to be careful with interpretation of the data.  For example, FST is often used to detect the genetic evidence of positive selection (see here).  We can observe increased FST, because of ascertainment bias, not because of localized positive selection.

They think that this ascertainment bias does not affect genetic association studies, but I wonder how ascertainment bias affect association studies, when you are testing association adjusting for population stratification using STRUCTURE or PCA.

Also, we need to think if the CEPH-Human Genome Diversity Project data has ascertainment bias.  If so, world-wide human population structure observed (see here) could be in part due to ascertainment bias.

Genetic evidence of Indian Ocean slave trade from Indian Siddis

Shah, Anish M., R. Tamang, et al. (2011). “Indian Siddis: African Descendants with Indian Admixture.” American Journal of Human Genetics 89(1):154-161.

Compared to Tran-Atlantic slave trade, Indian Ocean Slave trade is less known, maybe less understood, but has longer history.  Indigenous Africans were captured and traded by other Africans, Arabs, and Europeans.  Some of the slaves were sent to Middle East and South Asia.  Sub-Saharan African mtDNA haplogroups have been found among the Middle Eastern populations and the frequencies range from 9 to 34%.  Sub-Saharan African Y chromosome haplogroups are rare, but are also found among the Middle Eastern Arab populations.  Richards et al. (2003) and Quintana-Murci et al. (2004) argue that these sub-Saharan African mtDNA haplogroups were brought to the Middle East and reached South Asian through the Arab slave trade.  African females were incorporated into Islamic societies, but African males did not have much chance of reproduction.

In this article, Shah et al. (2011) demonstrate that Siddis, or Habishis, from India, the descendants of slaves from Africa, have genetic characteristics of sub-Saharan Africans.  They genotyped 850,000 autosomal SNPs, 32 Y chromosome biallelic markers, and 17 Y chromosome STR and sequenced mtDNA hypervariable region I.

Among Siddis, sub-Saharan genetic contribution estimated based on autosomal SNPs is quite large ranging 62.3-74.4% and they are plotted more closely to HapMap Yorubans than Indians, Europeans, or Asians on the PC plot.  Contrary to the previous studies, they found more male sub-Saharan contribution to the Siddis than female contribution.  You could expect this from the Indians marriage rule of endogamy, but gene flow between the Siddis and neighboring ethnic groups or communities was unidirectional.  They found South Asians and Eurasian genetic
contribution to the Siddis from their neighboring communities, but they did not find Sub-Saharan genetic contribution from the Siddis to neighboring ethnic groups.

Genetic studies to understand slave trade are usually conducted using uniparental markers (mtDNA and Y-chromosome).  The molecular genetic and analytical techniques to trace the origin are relatively simple, but the problem is that you are tracing only two lineages (maternal and paternal) out of thousands of possible ancestors for a particular individual.  By analyzing autosomal markers, you are getting genetic information of all the ancestors.  Recent advancement in molecular genetics allows researchers to genotypes many single nucleotide polymorphisms (SNPs) per individual.  Sometime in the future, it will be possible to genotype over 1 million SNPs per individual without huge cost.  Down side of this is that it requires more sophisticated statistical and analytical techniques.


Quintana-Murci L, Chaix R, Wells S, Behar DM, Sayar H, Scozzari R, Rengo C, Al-Zaheri N, Semino O, Santachiara-Benerecetti AS, Coppa A, Ayub Q, Mohyuddin A, Tyler-Smith C, Mehdi SQ, Torroni A, and McElreavey K (2004) Where west meets east: the complex mtDNA landscape of southwest and Central Asian corridor. American Journal of Human Genetics 74:827-845.

Richards M, Rengo C, Cruciani F, Gratrix F, Wilson JF, Scozzari R, Macaulay V, and Torroni A (2003) Extensive female-mediated gene flow from Sub-Saharan Africa into Near Eastern Arab Populaitons. American Journal of Human Genetics 72:1058-1064.

Accelerated genetic drift on chromosome X during the human dispersal out of Africa

Keinan, A., J. C. Mullikin, et al. (2009). “Accelerated genetic drift on chromosome X during the human dispersal out of Africa.” Nat Genet 41(1): 66-70.

Keinan and his colleagues provide data that challenges Hammer’s argument.  While Hammer and his colleagues have argued that female effective population size is larger than male effective population size largely due to polygynous practices, comparing X chromosome variation to autosomal variation, Keinan and his colleagues show that female effective population size was reduced outside of Africa (note that females carry two X chromosomes and males carry one, so X chromosome variation reflect female demographic history more than male).   

Compared to Hammer et al., Keinan et al. used bigger genome data.  They analyzed 130,000 SNPs using subset of the HapMap data, 1,087 additional SNPs that they discovered in two West African copies of X chromosomes, and sequence data consist of over a billion base pairs of DNA from five North Europeans, four East Asians, and five Africans. 

First, using SNP data, they obtained the ratio of X chromosome and autosomes allele frequency differentiation between two populations (FST) to estimate the amount of genetic drift.  The ratios obtained between North European and East Asian were not significantly different from expected ratio (3/4 = 0.75), but the ratios between African and non-African were reduced.

Second, they compared the X chromosome and autosomes SNP allele frequency distribution within each population.  The shape of allele frequency distribution for X chromosome and autosomes was significantly different for non-Africans.  Non-Africans have more high-frequency derived allele on X chromosome than expected and the X chromosome allele frequency distribution of non-Africans does not fit the expected distribution.

Third, they obtained the X-to-autosome sequence divergence ratios for each population.  West African has ratio close to expected, but non-Africans have significantly smaller ratio than expected (0.635 for North European and 0.690 for East Asian).

They think that X chromosome experienced accelerated genetic drift and sex-biased demographic processes rather than natural selection is likely explanation.  However, the data do not support that polygyny is one of the process, because polygyny increases the ratio, but they observed decreased ratios.  Alternatively, they suggest that non-Africans received long-range male migration from Africa or females have longer generation time than males.  Also, some females were reproductively more successful than the others during out-of-Africa dispersal.

Two different conclusions were obtained from different groups of researchers, maybe because of several factors.  First, the samples used by two groups were different.  Hammer et al. have more sample populations that Keinan et al did not use.   Second, although Hammer et al have sequence data from more individuals than Keinan et al., they used much smaller genomic data.  Third, two groups used very different analytical methods.

Race: a social destruction of a biological concept

Sesardic, N. (2010). “Race: a social destruction of a biological concept.” Biology and Philosophy 25(2): 143-162.

Anthropologists, other social scientists, philosophers, and human population geneticists have argued that there is no genetic basis for racial classification, but in this article, Sesardic (2010) argues that non-genetic basis of human race arguments are not supported by the recent multilocus genetic data.  The point that he is making is not existence of human biological race, but questioning the scientific basis for the non-existence of biological race arguments.

Like Pigliucci and Kaplan, Sesardic starts out with a problem of defining race, but he mainly focus on examining how philosophers and others, who argues no genetic differences among human groups, define race to illustrate the way they define race are not supported by recent genetic data showing genetic differences among human groups. 

Sesardic argues that if frequencies of alleles on one locus are used for racial classification, individuals cannot be classified correctly into right racial categories, but if multilocus genetic data are used as demonstrated by Risch and his colleagues and Rosenberg et al (2002), many individuals can be classified into racial or geographical categories correctly.  Similarly, if forensic anthropologists look at many skeletal traits, they can accurately infer the racial identity of individuals. 

As Sesardic suggested, no genetic difference argument is not supported by many genetic and osteological studies.  However, we should avoid a naïve conclusion.  The multilocus genetic data showing genetic differences among human groups should not be used to argue the existence of human biological race (note that Sesardic is not arguing this).  We have to consider evolutionary and historical process as well as sampling and statistical effects that cause the clustering of human groups illustrating genetic differences.

Human Genome Diversity Cell Line Panel samples

Cann, H. M., C. d. Toma, et al. (2002). “A Human Genome Diversity Cell Line Panel.” Science 296(5566): 261-262.

Cavalli-Sforza, L. L. (2005). “The Human Genome Diversity Project: past, present and future.” Nature Reviews: Genetics 6: 333-340.

Human Polymorphism Study Center (CEPH) website   

Cultured cell lines of 1050 individuals from 51 populations are stored at the Center for the Study of Human Polymorphism (CEPH), the Foundation Jean Dausset in Paris to facilitate anthropological and medical genetic research.  Samples were collected with full consent.  The sampled populations are of anthropological interests and are from five continents.  Cavalli-Sforza says that these sampled populations are potentially non-admixed with Europeans.

However, because the sampled populations are not randomly collected samples from the world, the sampled population set could be inadequate for understanding of human evolution and population structure.  For example, despite the great genetic variation exist in sub-Saharan Africa, only six sub-Saharan Africans were included and of six, three of them are forager populations.  One of the sub-Saharan African population sampled, Bantus, are collections of samples from six different Bantu ethnic groups and 12 Bantu individuals from Kenya, but the Bantus are linguistically, culturally, and genetically diverse.  Similarly, the Yoruba is also a culturally and potentially genetically diverse group. 

Compared to sparse collection of samples in Africa, samples from Asia are concentrated around Pakistan (8 ethnic groups) and China (Han Chinese and 14 ethnic minorities).  The gaps in the geographical distribution of sampled populations are in the area where more admixed populations occupy, such as North Africa, Middle East, India, and Central Asia.  In these areas, there are series of prehistoric and historic migrations, expansion of states/empires, and long-distance trade.  Also, there are many areas of the world beside Africa that requires additional sampling. 

Cavalli-Sforza addresses potential inadequacy of samples collected for intended research purposes.  Although there are great geographical gaps in the sampled population set and more samples need to be collected in the future, he believes that this initial collection is essential to determine how samples need to be collected later.  

Research projects that used CEPH samples are reviewed here.

Mobile elements reveal small population size in the ancient ancestors of Homo sapiens

Huff, C. D., J. Xing, et al. (2010). “Mobile elements reveal small population size in the ancient ancestors of Homo sapiens.” Proceedings of the National Academy of Sciences 107(5): 2147-2152

Huff et al. (2010) analyzed genome variation of two samples, focusing on the SNPs around the mobile element insertion areas.  The theory behind this project is that mobile element insertions (Alu and LINE1) are much rarer, so they have deep genealogies (ancient coalescent time). 

Their research basically supports this theoretical point.  First, TMRCA estimated based on 9,609 SNPs in the 10 kb around insertion was 462 k years old, which is older than the TMRCA estimated from other genomic regions.  Second, more interestingly, they estimated significantly larger ancient effective population size than modern effective population size.  They used a coalescent-Maximum likelihood based method to estimate three demographic parameters.

Modern effective population = 8,500

Ancient effective population = 18,500 (C.I. 14,500-26,000)

Time of population size change = 1.2 M years

This means that effective population size before 1.2 M years ago was 18,500.  The small effective population size of modern human support many previous genetic studies, but it is interesting to see that modern human have genetic evidence that suggests that ancestors of modern human, such as Homo erectus, had much larger effective population size and they were much more genetically diverse than anatomically modern human.  Since effective population size of modern humans is much smaller than Chimpanzee, it has been suggested that our ancestors experienced series of bottleneck, but this research data show the significant reduction in the population size occurred after 1.2 M years ago.  Jorde actually said in the NIH Genome Center Lecture series that our ancestors almost became extinct.