# P-values for significance of terms

For a given term and disease: Given that there are *n* articles cited by the gene and *N* articles in PubMed, and given that *k* of the cited articles have the disease term and that *K* articles in PubMed entire have the disease term.

articles marked with disease term | articles not marked with disease term | |

articles cited by the gene | k |
n-k |

articles not cited by the gene | K-k |
N-K-n-k |

Score is probability of seeing *k* or more articles with the disease term among *n* when drawing randomly from the bag of *N* pubmed articles with *K* disease term marked articles.

# Scoring for gene-mesh profiles, disease-mesh profiles

Let $Pr_1$ be the probability (as above) for the mesh term associating with the gene

Let $Pr_2$ be the probability (as above) for the mesh term associating with the disease

Probability that *either* of these occurred randomly is (assuming independence)

This can then be used to weight the strength of the two profiles being equivalent. For that, we could do an unpaired t-test, comparing the fraction of articles with the mesh term among articles cited by the gene, with the fraction of articles with the mesh term among articles cited by the disease.

This would give a "probability that the two terms did not occur randomly and that the ratios are similar"

This would then be done for each of the terms, and then the probabilities combined using something like Fisher's meta-analysis.

Other methods would include adapting methodologies used for GO term annotation comparison (e.g. Resnick)

# Desirable Properties of our comparison/scoring metric

- penalise low annotation level/reward high annotation level (papers, MeSH terms)
- penalise low overlap between gene and disease profile
- penalise/do not consider overly general/too many terms
- reward low p-values
- reward similarity in annotation level?
- explainable by statistical/mathematical theory (even if only reverse-engineered)

# Validation

- "ROC" style curve
- don't have known negatives, so replace by number of predictions
- True positive rate on y axis vs. number of prediction on x
- still want the ROC curve effect - we want known predictions to rank highly
- (name for this kind of curve?)

# Other Scoring Functions

Some of these are inspired from the GenoMeSH ISMB poster

- Jaccard index $\frac{ | X \cap Y |}{| X \cup Y |}$
- cosine similarity $\theta = \arccos {A \cdot B \over \|A\| \|B\|}$ where A and B are vectors of term freq./inverse document freq. - this is the Tanimoto index if binary (presence/absence) is used and equivalent to the Jaccard in that case

- Dice's Coefficient - $\frac{2 | X \cap Y |}{| X | + | Y |}$
- Manhattan Distance
- Euclidean
- Morisita-Horn (and others used by Community Ecologists - see R function vegdist)