The UK Biobank project is an open source large scale epidemiological project aimed at prospectively collecting data on large number of phenotypes and genotypes across large number of individuals (1). Specifically, the project aims to collect data on lifestyle, environmental and genomic determinants of important health conditions.
The data are being collected using the latest technological advances and stringent quality control measurements are being implemented for cleaning the data. The concept of open source means that the collected data are being made freely accessible to all researchers (for both academic and commercial purposes), subject to approval of assessment of individual research proposal from the researchers by the internal research committee of the UK Biobank (2).
Furthermore, the goal is also to make all the results generated eventually freely accessible to the scientific community using online databases and other sharing platforms, ensuring adequate participant confidentiality, consent and data protection regulations. This would lead to a better understanding of various biological pathways and underlying diseases, thereby further promoting the development of new drugs and personalized medicine. In brief, the following inclusion criteria were implemented at the start of data collection in 2000:
- Study design: Prospective cohort with repeat assessments at different intervals for different phenotypes
- Ethnicity: UK resident
- Age group: 40-69 years
- Gender: Both genders
Henceforth, UK Biobank data is considered to be an epidemiological study on middle and elderly people of predominantly British ancestry. Being a large prospective study, data collection was planned to be conducted at several stages: baseline and a follow-up spanning more than 30 years. Depending upon the study design and subsequent availability of resources and the development of subsequent collaborations, different types of phenotypes and genotypic data were collected at different stages.
Data collection at baseline
- Sample collection at baseline: Blood, urine and saliva samples
- A comprehensive questionnaire/ interview on a wide range of phenotypes (sociodemographic, family history and early life, psychosocial factors, lifestyle, medical history and cognitive function) with data collection on important clinical characteristics including anthropometric and blood pressure measurements.
- Duration of baseline data collection: five years (2006-2010)
- Follow-up duration: >30 years
More than half a million participants were enrolled during the baseline data collection.
Characteristics of enrolled participants
- Number of individuals enrolled: 500,000
- Ethnicity of enrolled individuals: 94.0% whites with self-reported European ancestry (88.0% British; 3.2%; other white background; 2.6% Irish; South Asians: 2.0%).
In line with the spirit of open resource and the goal of promoting rapid research, the interim data are being periodically released by the UK Biobank. However, the release of data depends upon the different stages of data collection for different phenotypes and genetic data and the efforts required to clean the data and making them available publicly. As per the latest release of data on the incidence of diseases at follow-up stages, a considerable proportion of the population showed an alarmingly higher incidence of cancer.
Follow-up data collection (completed and ongoing)
- Incidence of diseases at follow-up: Chronic obstructive pulmonary disease: 17,600; Myocardial infarction: 8,000; Stroke: 7,000; Dementia: 4300; Different types of cancers > 20,000, Deaths > 10,000 (as per the data collected up to 2016)
- Imaging data collection (in up to 100,000 individuals at follow-up stages): MRI: brain, heart and body; X-ray: bones and joints; Ultrasound: carotid arteries (ongoing)
- Functional tests (in up to 100,000 individuals at follow-up stages): heart (electrocardiogram using a heart monitoring device worn for two weeks), lung, eyes (visual acuity, auto-refraction and intraocular pressure data), ears (hearing tests), cognitive ability, physical activity (tri-axial accelerometer) (ongoing)
- Genetic data:
- Genome-wide genotyping
Quality controlled 805,426 genotypes variants (single nucleotide polymorphism and insertion-deletion polymorphisms; includes 110,000 rare variants) in 488,377 samples (Affymetrix platform). 147,731 related participants (30.3%) (third-degree relatives or closer) identified through genetic data. Annotation: GRCh37 assembly of human genome. Imputation: more than 96 million variants using haplotype reference panel (HRC) (completed)
- Exome sequencing : All 500,000 are being exome sequenced (ongoing)
- All the data have been linked to electronic health registry, death and cancer registry for retraction of information on additional phenotypes (completed)
- Web-based questionnaires are being administered to collect data on diet, cognitive function, occupational history and mental health (ongoing)
The use of both the UK Biobank’s source data and generated results have witnessed an exponential growth in last few years as is evident in the published literature and the increasing number of approved research proposals available at the UK Biobank website (3). As expected, in consent with the prevalence of heart diseases, cardiovascular outcomes have been most widely explored.
- >15,000 researchers from more than 1000 institutes with access to UK Biobank data with over 1500 ongoing or completed projects
- >1000 published articles
- Most commonly explored research areas: Cardiovascular; Metabolic and Endocrine; Cancer; Mental health
- Most commonly requested data: death and cancer (95%), genetics (73%), hospital admissions (63%), imaging (30%)
- Freely accessible research outputs:
- GWAS results for 118 non-binary traits and 660 binary traits (4)
- Brain imaging genetics (5)
Important scientific findings
Several important findings have come to the surface based on interim data release by the UK Biobank for different phenotypes as well as complete genome wide genotyping data. For instance, a recent study based on 474,129 individuals showed that individuals with an increasing number of cardio-metabolic diseases performed poorly in cognitive tasks (6). Another study in a subset of 78,947 individuals showed that greater physical activity was associated with lower adiposity (7). The study also showed that use of accelerometer-based measurements provided a better estimate of physical activity in comparison to questionnaire-based self-reports. Another study showed that later time of going to bed was associated with increased mortality (8).
And lastly, the availability of large-scale genetic data has also enabled the discovery of new underlying biological and causal pathways. For instance, a recent study identified several genetic variants in the excitatory synaptic pathways underlying depression (9), paving the way for the exploration of new drug targets. A recent Mendelian randomisation study combined UK Biobank data with Norway’s large scale HUNT cohort to show that BMI and all-risk mortality showed a J-shaped relationship with lowest risk in the BMI sub-group of 22-25.
An alarming rise in the incidence of cancer cases among UK biobank participants as compared to the general population may perhaps reflect a bias in the enrollment of participants. In such a scenario, all the research results must be carefully scrutinised based on adequate stratified sensitivity analysis. It has long been suggested that the research benefits from the UK Biobank data may not reach people from all ethnic backgrounds as most of the samples were drawn from the white population. One of the main reasons for the ensuing debate is that the research project has been funded by public money with contributions coming from all ethnic groups. Henceforth, inclusion of ethnic minorities in such large-scale projects is clearly an important priority (10).
Being a geneticist with a statistical background, I have been actively involved in studying influence of genetics on drug response.
I have now gone from specific to the general, and my interest in the field is deep, abiding and long term. I hope to be counted in my field with a strong background in epidemiology, statistics and clinical research.
My current interest include use of Mendelian Randomization to unearth causal association of biomarkers.