Overview Of PhD Research

Guiding Statement

I propose a system for discovery and evaluation of evidence-supported relationships between genes and diseases. The focus of this research will be on linking human transcription factors to brain disease.

Example Result

A list of brain diseases associated with a gene, with number of supporting articles and p-values. For example:

TP53 is associated with Brain Neoplasms, supported by 27 PubMed references. The confidence of this prediction has p-value 8.46E-53.

Preliminary Prototype

For the preliminary prototype, I have chosen Gene Ontology (GO) annotated transcription factors in Entrez Gene as a source of genes, PubMed articles as the source of experimental evidence and Medical Subject Headings (MeSH) terms as a vocabulary of brain diseases.

To link the genes to the evidence, I use the NLM Gene Reference into Function (GeneRIF) annotations stored in Entrez Gene. GeneRIFs describe the "function" (liberally including basic biology such as isolation, structure, genetics in addition to biological function) of a gene and link it to supporting PubMed evidence. These links are curated by NLM staff during indexing of PubMed articles, and can also be submitted for review to the NLM by the general public.

NLM indexers continue to index all PubMed articles using MeSH terms. MeSH terms are organised into categories, such as Anatomy, Organism, Chemical Compounds, etc. The terms are organised in these categories in a tree, with the general categories and terms at the top, connecting down to the more specific terms. A term can occur multiple times in the MeSH tree, as appropriate — for example, the term "Brain Neoplasms" occurs under the more specific terms "Central Nervous System Neoplasms" and "Brain Diseases". For the prototype, all PubMed articles labelled with the term "Brain Diseases" or a subterm were considered.

These relationships allow the extraction of direct links between transcription factor genes and brain diseases. By looking not only at the existence of the links, but also counting the number of supporting pubmed articles, we can perform basic statistical tests. These tests allow us to compute p-values that measure the significance of the link — is this link likely to have simply occurred by chance from looking at so many PubMed papers — and also a measure of the relevance — how strong is the link to this particular brain disease.

Abstract

Integrated approaches to the computational analysis of diverse data collections offer the possibility to predict links between genes and diseases. We focus on the analysis of biomedical literature for the identification of genes encoding DNA binding transcription factors which play a previously unknown functional role in the pathology of one or more neurological diseases. Existing databases enumerating human transcription factors, online repositories of abstracts from the biomedical literature and organized ontologies and vocabularies for both gene and disease annotation will be integrated. For example, over one thousand human genes in Entrez Gene are labeled as transcription factors via Gene Ontology (GO) terms. Using Medical Subject Heading (MeSH) terms, over half a million articles are identified as relevant to brain diseases in PubMed. To connect these data sources, we use both manually and automatically annotated linkages, such as the reviewed user-submitted Gene Reference into Function (GeneRIF) annotations in Entrez Gene and the computationally generated Related Articles from PubMed.

By distilling this interconnected network of relationships into an integrated database, we will provide a framework to identify known, direct relationships between transcription factors and brain diseases. This will also allow us to experiment with predicting novel relationships by the study of indirect linkages (e.g. transcription factor-characteristic and disease-characteristic intersections). Statistical scoring methodologies, such as over-representation analysis, will be developed to assess putative links between transcription factors and disease. Predictions will be verified using curated sources on gene-related diseases such as the Online Mendelian Inheritance in Man.

+Resource Usage-focused Summary
Our approach focuses on integrating diverse data collections to allow computational prediction of links between genes and diseases. We focus on the analysis of biomedical literature for the identification of human genes which play a previously unknown functional role in the pathology of one or more neurological diseases. Human genes in Entrez Gene are connected to PubMed articles via annotated linkages, such as the reviewed user-submitted Gene Reference into Function (GeneRIF) annotations and the Gene2PubMed annotations. PubMed articles are connected to relevant keyword terms, including diseases, via Medical Subject Headings (MeSH) provided by the National Library of Medicine.

The analysis of over seven million disease-related PubMed articles with respect to over thirty thousand known and putative human genes requires efficient computational resources and substantial storage requirements. To effectively integrate the various data sources, an SQL database for rapid and selective retrieval of relevant information coupled with the processing capabilities of a high-performance cluster allows for the efficient analysis of hundreds of gigabytes of data and performing the equivalent of hundreds of days of computation to a few days time.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.