Imputing untyped SNPs with a program IMPUTE

January 6, 2012

IMPUTE is a program to estimate the genotype of untyped SNPs, usually in disease-SNP association studies.  Currently, commercially available whole-genome genotyping array allows genotype data for 500K to 2M single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs), but there are more SNPs than these arrays can capture.  Whole genome sequencing is currently very costly and has an issue with accuracy determining the alleles of rare variants.  Therefore, in many cases, it is better to impute untyped SNPs.  IMPUTE and another similar program, MACH, allows imputing using the HapMap and/or 1000 Genomes data as reference.  Here, I am reviewing the IMPUTE2 for imputation using 1000 Genomes data.

As of today (1/6/2012), the most recent version, IMPUTE v2.2 beta, is available for three platforms (Windows, Mac, and Linux).  To use 1000 Genomes data for reference panel, you need to use the most recent version.  Previously, imputation was most accurately performed using combined reference data (HapMap and1000 Genomes data together).  Now, 1000 Genomes have genotype data for enough individuals from various ethnic backgrounds, so it is no longer necessary to use combined data.

A unique feature of IMPUTE2 is use of multi-population reference panels, so you do not need to choose a population that you want to use for reference panel.  The program can choose which reference haplotype to use.  Basically, population labels or information on relatedness of individuals are not used in the program, but the program looks for the haplotype sequence in reference best match the study samples.  Regardless of the ancestry, the program looks for a shared haplotype between reference and study sample, while identifying and ignoring the highly diverged haplotypes.  Then, IMPUTE2 uses that information to impute untyped or missing SNPs.  Therefore, this method is not sensitive to the ancestry composition of reference panel.   According to authors of the program, this process works well with homogeneous or admixed populations.  They also argue that genotype of low frequency alleles (MAF<0.05) can be imputed more accurately.

Imputation could be a useful method in anthropological genetics and genomics, first because we can explore the association of untyped SNPs in genome-wide association study with anthropologically interesting phenotypes, such as skin color, weight, height, etc.  Second, the untyped SNPs could be naturally selected, so as a result, SNPs show significant association with phenotypes in genome-wide association studies.


TYR and OCA2: two genes associated with skin pigmentation in African Americans

November 17, 2011

Shriver, M. D., E. J. Parra, et al. (2003). “Skin pigmentation, biogeographical ancestry and admixture mapping.” Human Genetics 112(4): 387-399.

Previously, I wrote about correlation between West African Ancestry (WAA) estimates and skin color among African Americans and African Caribbeans (here).  They used 33 ancestry informative markers (AIMs) that have large frequency differences between African and European populations.  Three of these markers are candidate genes for skin pigmentation (TYR, OCA2, and MC1R), so they examined, if these skin color candidate genes are associated with skin color (Melanin Index measure using the DermaSpectrometer).

Two pigmentation candidate genes (TYR and OCA2) and many other AIMs were associated with M Index without adjusting for WAA.  When they adjust for WAA, only TYR remained significant.  Then, they used ADMIXMAP, admixture mapping software, to find segments of genome that are associated with skin pigmentation because of the differences in their genetic ancestry.  In this analysis, TYR and OCA2 are associated with skin color, but not MC1R.

Their analyses demonstrated that two pigmentation candidate genes (TYR and OCA2) likely to cause differences in skin color between African and European populations.  TYR produces an enzyme, tyrosinase, which catalyzes the first two reactions in the melanin synthesis pathway.  Mutations in OCA2, or P gene, cause the common type of albinism.

I hope to review follow-up research projects later to further understand genes involved in production of dark skin in African and African American populations.


Evidence of ancient admixture between the Denisova and anatomically modern human from Southeast Asia, Oceania, and New Guinea

November 3, 2011

Reich, D., R. E. Green, et al. (2010). “Genetic history of an archaic hominin group from Denisova Cave in Siberia.” Nature 468(7327): 1053-1060.

Reich, D., N. Patterson, et al. (2011). “Denisova Admixture and the First Modern Human Dispersals into Southeast Asia and Oceania.” American Journal of Human Genetics 89(4): 516-528.

I went to a session that David Reich talked about his research on the Denisova during the American Society of Human Genetics annual meeting in Montreal last month and I had a chance to talk to him briefly after the session.

These articles are the results of collaboration of leading scientists, David Reich (Harvard University), Svante Paabo (Max Planck Institute), Mark Stoneking (Max Planck Institute), Montgomery Slatkin (University of California, Berkeley).  It is really a dream team of scientists.

The most important finding from this project is that they found the evidence of ancient admixture between the Denisova and anatomically modern human from Southeast Asian, New Guinea, Australia, and Oceania.  They estimated that the Denisova contributed up to 7% of genetic materials of modern people from the areas.

Considering that the Denisova was found in southern Siberia, the mechanism of interaction between the Denisova and modern human is difficult to understand.  From reading the articles and talking to David Reich, I am guessing that they considered many scenarios of interaction, but based on their available data, they believe the interaction took place in Southeast Asia.

Another important thing from this project is that now we have better understanding of the relationship between the Denisova and Neanderthals and between the Denisova and anatomically modern human.  The Denisova is more closely related to the Neanderthals than modern human, and they shared an ancestor about 640,000 years ago.  Modern human shared an ancestor with the Denisova and Neanderthals about 804,000 years ago.  The phylogenetic tree constructed from whole genome data was very different from the tree based on mtDNA genome data (See here).

Ancient genome data from archaic human is still limited, but current data favors the Multiregional model and suggests that both Denisova and Neanderthal (go here for the Neanderthal genome) contributed genetic material to the gene pool of anatomically modern human.  If we have a lot more ancient genome data, we may find evidence of substantial genetic contributions from archaic human, completely rejecting simplistic view of Out-of-Africa model.


HGDP Selection Browser: Web based tool to analyze Human Genome Diversity Project data to detect signature of positive selection

October 25, 2011

Pickrell, J. K., G. Coop, et al. (2009). “Signals of recent positive selection in a worldwide sample of human populations.” Genome Research 19(5): 826-837.

I am reviewing another web based tool, called HGDP Selection Browser, for human population genetic analysis.  The above reference is the paper that describes the analytical methods used in this web tool.  Like Haplotter, HGDP Selection Browser is very easy and it uses Human Genome Diversity Project SNP data generated by Li et al. (2008) using Illumina 650K platform for 53 populations.

It provides four statistics:

FST is estimated using AMOVA (Analysis of Molecular Variance) approach and population grouping identified by Rosenberg et al. (2002).  The –long10 of the empirical P-values are plotted.

Heterozygosity, genetic diversity of populations is compared.  When a genomic region is selected in a population, heterozygosity in the selected region of the population is reduced compared to that of other populations.

iHS (integrated Haplotype Score) is a statistic for detecting long-range haplotype.  It is based on EHH (extended haplotype homozygosity) that measures decay of identity as a function of distance.  It was designed to detect signature when the variants have not reached fixation and are in the intermediate frequency.  However, it is not sensitive to detect the selection, when selected alleles are closed to fixation.  It also loses its power, when sample size is small.

XP-EHH (Cross Population Extended Haplotype Homozygosity) is another method for detecting long-range haplotype, and it is sensitive when selective sweep is near fixation and have more power than iHS, when the sample size is small.

There are several problems using the HGDP samples.  One of the problems with HGDP data set is small sample size, so in some of the analyses, closely related populations are pooled together.  Another problem is low density of SNPs genotyped compared to HapMap data set.  Also, we have to consider possible effects of ascertainment bias, nonrandom population sampling, etc.


Haplotter: Web based tool to analyze HapMap data to detect signature of positive selection

October 20, 2011

Voight, B. F., S. Kudaravalli, et al. (2006). “A Map of Recent Positive Selection in the Human Genome.” PLoS Biology 4(3): e72.

Here, I am reviewing a web based tool, called Haplotter, for human population genetic analysis.  The above reference is the paper that describes the analytical methods used in this web tool.  Haplotter is very easy to use and may be useful for teaching upper level anthropological genetic or human population courses.  It is also useful for generating hypotheses, when you suspect that positive selection has acted on genes for a particular phenotypic trait, but you do not have sufficient genomic data to investigate.

Haplotter uses HapMap Phase 1 and 2 data and look for signature of positive selection in three continental populations (YRI, CEU, and ASN).  Chinese and Japanese samples were combined into ASN.  You can query by genomic region, gene or SNP.

It provides fourstatistics:

iHS (integrated Haplotype Score) is a statistic for detecting long-range haplotype.  It is based on EHH (extended haplotype
homozygosity) that measures decay of identity as a function of distance.  It was designed to detect signature when the variants have not reached fixation and are in the intermediate frequency.

Fay and Wu’s H and Tajima’s D are used to examine skew in allele frequency spectrum and large negative values indicate positive selection.  Fay and Wu’s H is sensitive, when the selected alleles are close to be fixed in a population, while Tajima’s D detect selection when there are abundant of low frequency polymorphisms.

Population pairwise FST (between three pairs of populations) is used to detect large allele frequency differences between pairs of populations that resulted from selection that acted on loci in one population, but not the other.

See also here for brief explanations of methods.  The P-value was obtained from empirical distribution of both Tajima’s D and Fay and Wu’s H in 50 SNPs windows and the rank of the statistic in the window compared with overall genome distribution.

But we have to remember, there are several important issues that we need to consider and I list three here.  First, ascertainment
bias that I mentioned earlier (see here) may affect some of the analyses.  Second, only three continental populationsused and these  three populations may not be representative samples of worldwide human populations.  Third, many SNPs in the HapMap data set are common variants with Minor Allele Frequency (MAF) greater than 0.05.

 


Geneticists tend to overemphasize the importance of genetic factors for resolving the ethnic health disparities

October 17, 2011

Sankar, P., M. K. Cho, et al. (2004). “Genetic Research and Health Disparities.” JAMA: The Journal of the American Medical Association 291(24): 2985-2989.

We (geneticists, media, students, etc) tend to focus on the genetic aspects of research and overemphasize on the importance o f genes on human evolution, health, etc.  As an anthropologist, I tried to be careful about it and I tried to consider socio-cultural aspects as well, but I admit that I often focus on genetic aspects more than socio-cultural aspects.  In this article, Shankar et al (2004) argue that overemphasizing genetic factors in ethnic health disparities research can have negative impacts.

Although they are well aware of many factors causing ethnic health disparities, researchers tend to overemphasize the potential benefits of their genetic research resolving the health disparities problems.  One of the reasons why geneticists overemphasize on genetic factor is their funding.  U.S. National Human Genome Research Institute (NHGRI) took an initiative to address the health disparities.  The negative consequence is that the attention shifts away from real social-cultural problems that need to addressed, but are difficult to fix, such as poverty, unequal access to health care, diet, etc.  Also, overemphasizing genetic factors may reinforce the racial rebelling and stereotyping.

It is true when we write grant proposals and papers, we have to say that our findings from genetic research can uniquely contribute to resolve the existing problems.  We do not mean to overemphasize, but it is important to note that research findings can be very important.  In addition, we scientists loose objectivity and tend to thick findings from our research project is so special.


Ascertainment bias in HapMap data: Should we use HapMap data for population genetics studies?

September 26, 2011

Clark, A. G., M. J. Hubisz, et al. (2005). “Ascertainment bias in studies of human genome-wide polymorphism.” Genome Research 15(11): 1496-1502.

The HapMap project produced dense genotype data of genome-wide polymorphisms.  In the phase 3, 11 world-wide populations were included.  This project was carefully designed, but there are several issues (e.g., representativeness of sampled populations and ascertainment bias of SNPs chosen for genotyping).

In their article, Clark et al. analyzed the phase 1 HapMap data to examine how ascertainment bias affects observed within population heterozygosity (genetic diversity) and FST (population differentiation).  The phase 1 data set includes Yoruban from Nigeria, Chinese from Beijing, Japanese from Tokyo, and Europeans from Utah.

In population genetics, ascertainment bias is sampling bias that usually occurs during SNPs or genetic markers selection for analysis.  Traditionally, European individuals are used for SNP and marker discoveries and then the SNPs and genetic markers discovered from the European samples are used to analyze genetic variation of other populations, such as Asians and Africans.  All the statistical and population genetics analyses (e.g., analysis to examine pattern of population differentiation among three geographical groups) using these markers are biased.

To assess the effects of ascertainment bias, Clark et al. compared the HapMap data to Perlegen data.  The HapMap project was design to find common SNPs that have allele frequency of > 5%, so these SNPs can be used for disease association studies.  On the other hand, Perlegen data was produced from resequncing of individuals from ethnically diverse populations.

They found that observed within population heterozygosity and FST between each pair of populations are inflated in HapMap data set.  HapMap data is often used for population genetics studies, but they argue that we have to be careful with interpretation of the data.  For example, FST is often used to detect the genetic evidence of positive selection (see here).  We can observe increased FST, because of ascertainment bias, not because of localized positive selection.

They think that this ascertainment bias does not affect genetic association studies, but I wonder how ascertainment bias affect association studies, when you are testing association adjusting for population stratification using STRUCTURE or PCA.

Also, we need to think if the CEPH-Human Genome Diversity Project data has ascertainment bias.  If so, world-wide human population structure observed (see here) could be in part due to ascertainment bias.


Four methods to detect positive selection from the genetic data of modern human populations

August 23, 2011

Sabeti, P. C., S. F. Schaffner, et al. (2006). “Positive Natural Selection in the Human Lineage.” Science 312(5780): 1614-1620.

As molecular genetics technology advances, it has become much easier to analyze genes or regions of your interest to see if they show evidence of positive selection.  There are many articles published in last 10 years or so that address positive selection in human populations.  Many different methods are used, but in their articles, Sabeti et al. reviewed major methods to identify genetic signature of positive selection.  This is a quick note on four methods to detect positive selection from the genetic data of modern human populations.

  1. Differences between populations  Natural selection tends to be localized, so large allele frequency differences and large Fst should be observed between geographically distant populations.
  2. High frequency of derived alleles  If derived allele is advantageous and the effect of selection is large, derived allele frequency increase quickly.
  3. Reduction in genetic diversity   Frequency of allele at the loci linked to the positively selected allele increases with frequency of positively selected allele.
  4. Long-range haplotypes   Recombination usually breaks the link between these loci, but if the effect of selection is strong, the linkage between these loci extends for very long.

There are three important things to consider.

  1. Demographic events, such as bottleneck, expansion, and population subdivision, leave similar genetic signatures, so first, we should consider how demographic events affected the genetic variation.  Also, we should examine, if we can observe similar pattern in different genes or regions, which are not positively selected.
  2. If positive selection had small effects on genetic variation, these four methods will not detect the signature, so there are a lot of positively selected genes that we can identify with these methods.
  3. If genetic data from publically available genomic database, such as HapMap and CEPH-Human Genome Diversity Project, effects of ascertainment bias need to be considered (problems associated with ascertainment bias will be discussed in the next post).

The correlation and variability of African genetic ancestry and skin color among African Americans

August 11, 2011

Parra, E. J., R. A. Kittles, et al. (2004). “Implications of correlations between skin color and genetic ancestry for biomedical research.” Nature Genetics 36: S54-S60.

Shriver, M. D., E. J. Parra, et al. (2003). “Skin pigmentation, biogeographical ancestry and admixture mapping.” Human
Genetics
112(4): 387-399.

 

These articles are getting little old, but their findings are interesting and important.  They examined the correlation between skin pigmentation and estimated African ancestry.  Using a DermaSpectrometer, skin color measurements (melanin index) were taken inner part of arm where the UV rarely hit.  African genetic ancestry was estimated using 33 ancestry informative markers (AIMs).

They found that estimated African ancestry was significantly correlated with melanin index, as expected, but more interestingly melanin index and estimated African ancestry vary greatly.  This means that functional genes that determine the skin color are located somewhere else on the genome and the allele frequencies of these skin pigmentation gene variants differ greatly between ancestral populations (e.g., Africans and Europeans for African Americans).  They explain that because African Americans are recently admixed, we are observing the results of the population structure that existed in their ancestral populations.

Skin color is a very heritable trait.  If skin color is determined largely by genes, one may expect to see small variability of melanin index of African Americans with 100% African ancestry, but that is not the case.  They observed a great variation in melanin index of African Americans with 100% African ancestry.  Because skin color is polygenic traits, there are many different genes that determine the skin color, so natural variation in skin color should exist in Africa.

The research was conducted when only a few candidate genes linked to skin color were found, and they confirmed that two candidate genes for skin pigmentation, TYR and OCA2 are significantly associated with melanin index in African Americans (actually, Shriver and colleagues were working on other projects looking for skin color genes when these articles came
out).

How variable is the skin color and melanin index in Africa?  There are research projects that demonstrated that skin color varies among sub-Saharan African populations, but has anybody systematically investigated how variable the skin color is within an African population?


Genetic evidence of Indian Ocean slave trade from Indian Siddis

July 23, 2011

Shah, Anish M., R. Tamang, et al. (2011). “Indian Siddis: African Descendants with Indian Admixture.” American Journal of Human Genetics 89(1):154-161.

Compared to Tran-Atlantic slave trade, Indian Ocean Slave trade is less known, maybe less understood, but has longer history.  Indigenous Africans were captured and traded by other Africans, Arabs, and Europeans.  Some of the slaves were sent to Middle East and South Asia.  Sub-Saharan African mtDNA haplogroups have been found among the Middle Eastern populations and the frequencies range from 9 to 34%.  Sub-Saharan African Y chromosome haplogroups are rare, but are also found among the Middle Eastern Arab populations.  Richards et al. (2003) and Quintana-Murci et al. (2004) argue that these sub-Saharan African mtDNA haplogroups were brought to the Middle East and reached South Asian through the Arab slave trade.  African females were incorporated into Islamic societies, but African males did not have much chance of reproduction.

In this article, Shah et al. (2011) demonstrate that Siddis, or Habishis, from India, the descendants of slaves from Africa, have genetic characteristics of sub-Saharan Africans.  They genotyped 850,000 autosomal SNPs, 32 Y chromosome biallelic markers, and 17 Y chromosome STR and sequenced mtDNA hypervariable region I.

Among Siddis, sub-Saharan genetic contribution estimated based on autosomal SNPs is quite large ranging 62.3-74.4% and they are plotted more closely to HapMap Yorubans than Indians, Europeans, or Asians on the PC plot.  Contrary to the previous studies, they found more male sub-Saharan contribution to the Siddis than female contribution.  You could expect this from the Indians marriage rule of endogamy, but gene flow between the Siddis and neighboring ethnic groups or communities was unidirectional.  They found South Asians and Eurasian genetic
contribution to the Siddis from their neighboring communities, but they did not find Sub-Saharan genetic contribution from the Siddis to neighboring ethnic groups.

Genetic studies to understand slave trade are usually conducted using uniparental markers (mtDNA and Y-chromosome).  The molecular genetic and analytical techniques to trace the origin are relatively simple, but the problem is that you are tracing only two lineages (maternal and paternal) out of thousands of possible ancestors for a particular individual.  By analyzing autosomal markers, you are getting genetic information of all the ancestors.  Recent advancement in molecular genetics allows researchers to genotypes many single nucleotide polymorphisms (SNPs) per individual.  Sometime in the future, it will be possible to genotype over 1 million SNPs per individual without huge cost.  Down side of this is that it requires more sophisticated statistical and analytical techniques.

References:

Quintana-Murci L, Chaix R, Wells S, Behar DM, Sayar H, Scozzari R, Rengo C, Al-Zaheri N, Semino O, Santachiara-Benerecetti AS, Coppa A, Ayub Q, Mohyuddin A, Tyler-Smith C, Mehdi SQ, Torroni A, and McElreavey K (2004) Where west meets east: the complex mtDNA landscape of southwest and Central Asian corridor. American Journal of Human Genetics 74:827-845.

Richards M, Rengo C, Cruciani F, Gratrix F, Wilson JF, Scozzari R, Macaulay V, and Torroni A (2003) Extensive female-mediated gene flow from Sub-Saharan Africa into Near Eastern Arab Populaitons. American Journal of Human Genetics 72:1058-1064.


Follow

Get every new post delivered to your Inbox.