All you need to know about the 1000 Genomes Project

How different are we genetically?

No two humans are identical with respect to their DNA sequences. The complete stretch of DNA is present within 23 chromosome pairs in individuals: one set inherited from the mother and the other from the father. As of 2018, a total of 660 million variants have been characterised. The data was contributed by a wide range of sources and human populations and are available in the latest dbSNP human build 151 (1). These variations, which are known to reside in less than 1% of DNA, drive the genetic and phenotypic diversity that underlies the human race.

What is the 1000 Genomes Project?

The 1000 Genomes Project (1KGP) is one of the major data contributors to our current knowledge about the geographic diversity and functional subtypes of genetic variants. The project was established in 2007 to characterise the genetic differences in and among different ethnic groups worldwide (2). As the name suggests, it was first planned to characterise common genetic variants in the complete genomes of 1000 individuals using whole genome and exome sequencing based approaches (3). As of today, and per the latest version of data release (phase 3) in 2015, the project has now incorporated data from 2,504 individuals who encompass 26 populations from Africa, Asia, Europe and America (4). Furthermore, by incorporating data from dense array genotyping, cutting-edge tools and machine learning approaches in the latest phase, the project has characterised over 88 million variants, including 84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels) and 60,000 structural variants. The project produced a haplotype reference panel of 5008 haplotypes.

What are the benefits?

Imputation

The cost to generate a whole genome sequence and whole exome sequence has recently fallen below $1000 (5). Planning a clinical study that involves thousands of cases and controls is therefore still a major economic burden. Nevertheless, work done by the 1KGP provides a worldwide reference genome for human genetic variation and is already the most extensively used resource to increase the coverage of the human genome in low-coverage microarray-chip-based studies. The process of increasing coverage relies on the principle of the probability of detecting an allele of a SNP at a given location in human DNA, provided that an allele of a nearby SNP is known, a process popularly known as imputation. The studies that use large-scale microarray chips, also known as genome-wide association studies (GWAS), typically genotype less than 1 million SNPs in few thousands of cases and controls combined. By using 1KGP as a reference genome, most GWAS studies can increase their coverage 5- to 10-fold, thereby improving our ability to detect true functional genetic variants that underlie a given association.

Detection of novel variants

Several other applications have recently emerged (6). For instance, 1KGP is now widely used to establish the novelty of variants in many exome sequencing and cancer genome sequencing projects.

Detection of variants under natural selection

Furthermore, variants or regions from such resequencing projects can then by evaluated for strength of the purifying selections operating on them using the rare variant information in the surrounding region from 1KGP. This information may be integrated with information from GWAS, expression studies and multiple-species sequences to further understand potential pathogenicity associated with a given variant or a potential regulatory variant (7).

Understanding population history

Another important application of 1KGP has been its utility in determining the history of human migrations (8).

Efficient assessment of transcriptomics data

The availability of a well-characterised reference genome has enabled rapid alignment of reads from RNA-seq studies without the need of prior knowledge of gene and transcript sequences (9). The information generated from RNA-seq data and functional regulatory ENCODE project data has been further exploited in recent years to understand the molecular mechanisms behind expression quantitative trait loci (eQTLs).

Major findings

A typical genome:

1. Differs from another genome at up to 5 million sites;
2. Has > 99.9% SNPs, up to 2500 structural variants, including 1000 large deletions and 160 copy number variants;
3. Has only 10% of variants with a frequency > 5%;
4. Has on average 150 protein-truncating variants, 11,000 missense variants and 500,000 regulatory sites;
5. Has approximately 2000 variants known to be associated with complex traits, with around 24-30 variants implicated in rare diseases (this number may have increased multifold in recent years with growing number of hits in each new GWAS).

Population specific information:

Most common variants are shared across different ethnic populations.
All humans share a demographic history of 150,000-200,000 years.
Less than 10% of variants show large frequency differences among populations.
A common variant has 15-20 tagging variants (r²> 0.8) in non-African populations compared to 8 in African populations.
The African population has more genetic diversity with a greater proportion of continental and population-specific variants.

Other similar ongoing projects

UK10K project (10)

Approximately 4000 healthy individuals from the UK
Low coverage whole genome sequencing

100,000 Genomes Project (11)

100,000 genomes from 85,000 patients from the UK
Individuals suffering from rare disease or cancer

Haplotype Reference Consortium (HRC) (12)

An imputation panel based on 32,470 samples
Thirty-nine million SNPs with 64,976 haplotypes
Combined individuals from several low coverage whole genome studies
Individuals predominantly of European ancestry
Includes individuals from 1KGP and UK10K
Produced accurate imputation even at low minor allele frequencies < 1%

The future of human genome reference panels

A GWAS of vertical-cup disc ratio was recently performed using 1KGP as well as HRC-based imputations (13). The study demonstrated that HRC imputation significantly improved the P-values and led to the discovery of a greater number of significant variants. In summary, it is expected that more and more GWAS studies will use HRC and other large reference panels in the near future. This increase will thereby increase the statistical power to discover novel common variants as well as low-frequency variants. This ability could lead to the further discovery of new genes.

References

Sandeep Grover

Being a geneticist with a statistical background, I have been actively involved in studying influence of genetics on drug response.

I have now gone from specific to the general, and my interest in the field is deep, abiding and long term. I hope to be counted in my field with a strong background in epidemiology, statistics and clinical research.

My current interest include use of Mendelian Randomization to unearth causal association of biomarkers.