Database Search And Evaluation System

Working Title

Integrated Database Evidence Search and Evaluation System for Disease-related Transcription Factors

Lay Summary

The PubMed/Medline database of health-related research has increased by (add some stats here for rate of articles) … This points to an already a large amount health research already done, and more information continues to increase our knowledge daily. However, for the information to be useful, it needs to be made easy to access. One deficiency that exists is that much information is scattered throughout many different databases, requiring health professionals to spend valuable time and resources collecting the information, let alone taking time to evaluate it. The system I will describe will be designed to search different sources of information for specific kinds of information, and in this specific research area, it will be able to use information … In particular, we will look to see if particular genetic traits are related to specific functions, returning evidence of this association, as well as predicting the strength of the evidence.

But the information is not easy to get to - it is stored in different formats in different databases and published in many different journals.

Searching for evidence will combine simple text-based search like traditional search engines with the information already extracted in databases. By restricting ourselves to particular forms of genetic traits and specific functions, we can use

-use ontologies - collections of words with relations between them

Research Summary

I propose a system that will allow a researcher to test a gene for specific properties. The system will collect
evidence and compute a confidence score. This will be accomplished by searching for evidence in annotated
database entries and mining free text, looking for direct evidence as well as similarity-based indirect evidence.
Initially, we shall focus on identifying disease-related transcription factors. The experimental evidence will be
initially in the form of PubMed/Medline publications. Therefore, the results of the system will provide not only a
quantitative confidence that the gene of interest is linked to disease and performs a particular function, but the
evidence used in the prediction may also be subsequently verified by the researcher.
Direct annotations with references will be found in curated sequence databases such as RefSeq and Swiss-
Prot. Text mining for the properties in question (e.g. «transcription factor», «SCA7») will be performed on the
abstracts available from PubMed itself, as well as free-text databases annotated with PubMed references, such as
OMIM.
Indirect evidence will be generated by looking at related properties (e.g. «DNA-binding», «Purkinje cell
degeneration»), using the previously mentioned search strategies as well as other appropriate databases, such as
molecular interaction databases like BIND. Evidence for a gene will be supplemented by evidence in similar genes,
such as homologues in other species, found using resources such as HomoloGene.
The system will unify these disparate sources of data, and use a machine-learning prediction algorithm to
integrate all the evidence found for a gene into a final score. This tool aims to enable scientists in their research by
providing not only predictions on whether the gene in question is linked with specific functions and diseases, but also
provides verifiable supporting evidence for these predictions.

Research Proposal

The examination of cells under a control condition and an experimental condition, involving
extrinsic conditions such as temperature, or intrinsic conditions such as the age of the cells, often
reveals a change in cellular processes. A key determinant of cellular behaviour is at the level of the
gene transcription from the template DNA to messenger RNA (mRNA). High throughput techniques,
such as DNA microarrays and SAGE for mRNA expression, and ChIP-chip experiments for protein
binding to DNA, allow us to investigate not just one or two carefully chosen genes, but numbers on
the order of all the genes of a chromosome at one time.
To filter these results in a meaningful manner, it is no longer feasible for the experimenter to
consider each gene individually. The system I propose will allow the researcher to formulate a
hypothesis about the properties of the genes. It then collects evidence in support of this hypothesis that
can be further investigated and compared with experimental results. Finally, it will generate a
prediction based on the strength of the evidence gathered. This initial version of this tool will deal with
one specific property – whether a gene encodes a transcription factor. The system will search for
evidence by examining annotated database entries as well as mining free text, searching for direct
evidence as well as similarity-based indirect evidence.
The most direct evidence we shall consider are annotations in curated sequence databases, such
as RefSeq and SWISS-PROT, indicating that a gene is a transcription factor, supported by
experimental evidence in the form of PubMed/MEDLINE publication database references. Genes with
this evidence will serve as a reference set of known transcription factors, serving in the validation the
prediction results when using all the other forms of evidence.
Searching for co-occurrence of the gene name with the property (“transcription factor”) in free
text, in the manner of traditional search tools, will be another method of generating evidence. Initially,
we shall perform text mining on PubMed abstracts, however, this can also be extended to other freetext
databases such as OMIM, another database that provides free text information annotated by
PubMed references.
Sub-properties form the first kind of indirect evidence – terms such as “DNA-binding” and
“transactivation” describe aspects of transcription factors. Sub-properties may be annotated in other
databases – for example, molecular interactions will be annotated in databases like BIND – and can
also be searched as before in free text. Gene Ontology (GO) terms applicable to transcription factors
will be chosen, supplemented by synonyms and mappings to other vocabularies such as SWISS-PROT
keywords and PubMed MeSH terms.
In addition to the above forms of direct and indirect evidence for a gene, the same evidence for
similar genes will provide additional, weaker evidence. Similar genes include homologues in other
species, found using resources such as HomoloGene, and genes with low BLAST E-values. Genes
having similar domains, or sharing an overlap of GO terms, are other forms of similarity which can be
considered as needed.
The system will access an extended version of the Atlas Data Warehouse to integrate disparate
data sources such as the various sequence and molecular interaction databases. It will also provide a
framework for database and text searching, and for computing similarity scores. Finally, a machinelearning
prediction tool will integrate all the evidence found for a gene into a final score. Linear
weighting of the evidence types will be trained using the reference set of genes, with more complex
machine learning techniques like neural networks or support vector machines used if needed.
This system is easily expandable by adding new evidence types, and can be made more general
by extending the vocabulary of properties. High-throughput techniques provide a wealth of
information – this tool aims to enable scientists in their research by providing not only predictions, but
also verifiable supporting evidence.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.