title
logo
Informatics

Informatics for Epidemiology

A collaborative project between the Cambridge Genetics Knowledge Park/Public Health Genetics Unit (now the PHG Foundation) and the company Linguamatics, funded by the East of England Development Agency, aimed to investigate the feasibility of informatics-based approaches for undertaking secondary research in epidemiology, and in particular for the identification and interrogation of gene-disease association studies.

The challenge to explain the genetic determinants of complex diseases has already led to a substantial body of literature on associations between genetic variants and disease risk. With some tens of thousands of genes, each with numerous variants, and hundreds of common diseases, the compilation and dissemination of this information presents a daunting task. Despite large-scale investment in bioinformatics for the management of biological and genetic information, and clinical informatics for the management of patient information in clinical practice, there has been almost no investment in an informatics infrastructure for linking genes with disease.

At present, attempts to organise the literature on gene-disease association studies revolve around systematic reviews. These reviews involve searches of existing literature (for example using PubMed), collation of reports, extraction of information, assessment of potential biases and synthesis of results. All these stages are conducted manually, and are time-consuming and resource-intensive. In future, conducting systematic reviews using conventional, and largely keyword-based manual search methods will likely be insufficient to cover the explosion in epidemiological research.

The project had huge potential for improving the integration of genetic, environmental and clinical data to improve the understanding of disease and to support commercial exploitation of research. Ultimately this would lead to the development of more effective therapies for patients.

Linguamatics had already developed innovative, Natural Language Processing (NLP)-based software, called I2E, for mining information from unstructured text, and is a leading provider of text mining software in the pharma-biotech industry. The use of NLP enables the system to “understand” the grammatical structure and context of information in free text form, giving much higher quality results than conventional keyword search. Relevant facts are extracted, or mined, from documents and presented directly to users, rather than having to painstakingly read through large quantities of documents to find the information. This project looked at the feasibility of tailoring the I2E platform for epidemiological applications, to extract and organise appropriate evidence from substantial collections of information, pinpoint relevant facts within extracted articles – for example details of study sample methods and results – and store them in a standard format. The epidemiology team at CGKP contributed expertise in defining epidemiological notions and interrelationships, building taxonomies and thesauri, and investigated how these can best be applied to epidemiological papers and restricted sources of information such as abstracts.

The project addressed the following questions:

  1. What resources are available for identifying potentially eligible studies for systematic reviews in this area?
  2. To what extent can identification of studies, classification of their characteristics and extraction of information be undertaken using text-mining techniques, and what other benefits can text-mining offer in this field?
  3. What are the key obstacles (epidemiological, technological and policy-wise) to implementing any successful approaches identified under question (2)?

Outputs from the project included:

 

 

  Related websites: Main HuGENet site | Greece Coordinating Centre | Canada Coordinating Centre

mrc logo