Validation

Want and ROC/Area Under Curve style analysis

To compute AUC for a large dataset

Assume data in format of n pairs of
(score,truthYN)

Sort the dataset by score

Iterate and get the total number of truth=Y , # of unique scores = k

Iterate over the k scores

Want the sum as we increase score of the fraction of truth correctly predicted

Result

Score 13: Sum of the log of combined p-values, normalised by number of articles
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.790389333205
Score 12: Sum of the differences of log p values
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.109667407254
Score 11: L2 distance of log-p of intersecting terms, normalised by number of articles
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.16410224205
Score 10: L2 distance of counts intersecting terms normalised by number of articles
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.57678677985
Score 9:L2 distance of log p-values
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.121019876066
Score 8:L2 distance of p-values
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.108249081623
Score 7:L2 distance of counts, normalised by number of articles
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.882637542365
Score 6:L2 distance of counts
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.103497883709
Score 5:Total number of terms (union)
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.0990834916757
Score 4: Number of Terms (Intersect)
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.144670047259
Score 3: Gene ID
Counting Lines and Positives
Number of Lines: 40994837
Number of Positives: 69292
AUC: 0.69288065231

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.