Research Results

Implementation Details

GeneRIF vs Gene2Pubmed

  • watson
    • 302091 GeneRIFs, 12453338 Gene2Pubmed
    • Overlap is 301690 (missing 401 GeneRIFs)
  • chickenwire wcdb
    • 189629 GeneRIFs, 3223895 Gene2Pubmed
    • 189386 Overlap (missing 243 GeneRIFs)
  • chickenwire wcdb2
    • 302092 GeneRIFs, 12453510 Gene2pubmed
    • 301690 Overlap (missing 402 GeneRIFs)

Data Archival

Entrez Utilities Results

  • When implementing , the option usehistory=y causes WebEnv and query_key info to be returned, so results of one search can be passed from one search to the next. Therefore, should be able to chain the search and fetch queries.

Entrez Gene

geneRIF-count.png
redo graph with rescaled axes
  • Results: getcount.txt each of the 1193 human TFs in Entrez Gene with GeneRIF counts and GeneRIF PMID counts
  • Total: 32 357 GeneRIFs
  • includes Interaction Databases

Entrez PubMed

  • a heading might not be a major topic, but with qualifier might be.

Combining GeneRIFs and MeSH Pubmed

(-8 lines for comments)

Entrez Gene

  • TF Genes : 1192
SELECT COUNT(*) FROM gene;
  • 32417 geneRIFs
    • wc tf-generifs.txt = 32425-8
  • 21274 distinct gene-pmid pairs
SELECT COUNT(DISTINCT gene_id, pmid) FROM generif;
  • 15741 PMIDs ref'ed by geneRIF
    • cut -f 2 tf-generifs.txt | sort | uniq | wc = 15749-8
 SELECT COUNT(DISTINCT pmid) FROM generif;

Pubmed

  • Pubmed Articles marked Brain Disease (MeSH) : 660596
    • wc disease-counts.txt = 660604-8
  • Unique Pubmed Articles marked Brain Disease (MeSH): 660538
SELECT COUNT(DISTINCT pmid) FROM pubmed_mesh, braindisease WHERE
 pubmed_mesh.term=braindisease.term;
  • MeSH annotations: 11 899 111
    • wc disease-mesh.txt
  • Unique pmid-term-qual triples: 11 898 933
SELECT COUNT(*) FROM pubmed_mesh;
  • Unique MeSH annotations: 19384
    • cut -f 3 disease-mesh.txt | sort | uniq -c | wc = 19392 - 8
SELECT COUNT(DISTINCT term) FROM pubmed_mesh;
  • Unique MeSH-Qualifier annotations: 152588
    • cut -f 3,4 disease-mesh.txt | sort | uniq -c | wc = 152596 -8
SELECT COUNT(DISTINCT term,qual) FROM pubmed_mesh;
  • Unique Brain Disease MeSH terms: 260
SELECT COUNT(DISTINCT braindisease.term) FROM pubmed_mesh, braindisease where pubmed_mesh.term=braindisease.term;
  • Unique Brain Disease MeSH-Qualifier pairs: 7041
SELECT COUNT(DISTINCT braindisease.term,qual) FROM pubmed_mesh, braindisease where pubmed_mesh.term=braindisease.term;

Intersect

  • 508 GeneRIF PMIDs with brain disease MeSH
    • sort -k 2 tf-generifs.txt | join -1 1 disease-counts-sorted.txt -2 2 - | wc = 508
  • 146 unique genes
    • sort -k 2 tf-generifs.txt | join -1 2 - -2 1 disease-counts-sorted.txt | cut -f 2 -d " " | sort | uniq | wc = 146

MeSH

  • MeSH terms total: 47143
  • MeSH Terms for Brain Disease: 312

GeneRIFs and PubMed MeSH Annotations

  • Genes to Brain Disease associations by number of citations
SELECT locus, braindisease.term, COUNT(DISTINCT generif.pmid) AS pubmed_refs FROM gene,generif, pubmed_mesh, braindisease WHERE generif.pmid=pubmed_mesh.pmid AND braindisease.term=pubmed_mesh.term AND gene.gene_id=generif.gene_id GROUP BY locus,braindisease.term ORDER BY pubmed_refs;
  • Brain Disease terms by number of genes
SELECT braindisease.term, COUNT(DISTINCT generif.gene_id) AS gene_num FROM generif, pubmed_mesh, braindisease WHERE generif.pmid=pubmed_mesh.pmid AND braindisease.term=pubmed_mesh.term GROUP BY braindisease.term ORDER BY gene_num;

Related Articles

  • Actual number of "related articles" that reference brain disease articles
SELECT COUNT(*) FROM related_articles WHERE related_pmid IN (SELECT pmid FROM pubmed);

45609560 (26 mins)

Can't do the following due to MySQL limitations

  • Min Score of top 0.95 articles
SELECT score from related_articles ORDER BY score LIMIT 43329082,1;
  • Min Score of top 0.9 articles
SELECT score from related_articles ORDER BY score LIMIT 41048604,1;
  • Articles related to GeneRIFs
SELECT generif.gene_id, COUNT(DISTINCT related_pmid) FROM generif, related_articles WHERE generif.pmid=related_articles.pmid GROUP BY generif.gene_id;
  • Brain Disease Articles related to GeneRIFs
SELECT generif.gene_id, braindisease.term, COUNT(DISTINCT related_pmid)  FROM generif, related_articles, pubmed_mesh, braindisease WHERE generif.pmid=related_articles.pmid AND related_articles.related_pmid=pubmed_mesh.pmid AND pubmed_mesh.term=braindisease.term GROUP BY generif.gene_id, braindisease.term;

Homologene

  • Gene to Mouse to PubMed - all generif pmid counts
SELECT g1.locus AS hgene, g2.locus AS mgene, COUNT(DISTINCT generif.pmid) AS pubmed_refs FROM gene AS g1, gene AS g2, homologene AS h1, homologene AS h2,generif WHERE g2.gene_id=generif.gene_id AND g1.taxon_id=9606 AND g2.taxon_id=10090 AND h1.gene_id=g1.gene_id AND h2.gene_id=g2.gene_id AND h1.homologene_id=h2.homologene_id GROUP BY g1.locus,g2.locus ORDER BY pubmed_refs;
  • Gene to Mouse to PubMed - brain disease terms
SELECT g1.locus AS hgene, g2.locus AS mgene, braindisease.term, COUNT(DISTINCT generif.pmid) AS pubmed_refs FROM gene AS g1, gene AS g2, homologene AS h1, homologene AS h2,generif, pubmed_mesh, braindisease WHERE generif.pmid=pubmed_mesh.pmid AND braindisease.term=pubmed_mesh.term AND g2.gene_id=generif.gene_id AND g1.taxon_id=9606 AND g2.taxon_id=10090 AND h1.gene_id=g1.gene_id AND h2.gene_id=g2.gene_id AND h1.homologene_id=h2.homologene_id GROUP BY g1.locus,g2.locus,braindisease.term ORDER BY pubmed_refs;
  • Extract Results
SELECT gh.hgene, gh.mgene, ght.term, ght.pubmed_refs AS braindisease_refs, gh.pubmed_refs AS generif_refs,  tc.num_pmid AS term_refs FROM generif_homologene AS gh, generif_homologene_term AS ght, term_citations AS tc WHERE gh.hgene=ght.hgene AND gh.mgene=ght.mgene AND tc.term=ght.term;

Disease Vocabularies

  • all included in UMLS Metathesaurus

MeSH Tree Structures

International Statistical Classification of Diseases and Related Health Problems (ICD)

  • ICD-9-CM: Official variant for US Hospitals
  • ICD-10-CA: All provinces (except maybe Quebec ETA Jul 2006) have (officially) adopted ICD-10-CA

Systematized Nomenclature of Medicine Clinical Terms(SNOMED CT)

  • "emerging" International Standard?
    • formation of a non-profit group (in Denmark) that will take over SNOMED - Snomed Standards Dev. Org.
    • supposed to be formed by end of last year
    • Nine countries are proposed charter members, including Canada
  • considered (?) for Canada Health Infoway - possible canada-wide license?
    • French version not yet completed
  • Licensed by NLM for UMLS Metathesaurus
  • maps to ICD-9-CM, ICD-10 (possibly incomplete)
  • non-U.S. use is limited to Category 3 (among other things, research)
    • From the Q2 of the SNOMED CT FAQ: "Section 12.3 (Category 3) continues to apply to use of SNOMED by non-U.S. UMLS users."
  • Avoid direct use to avoid contravention of terms of agreement potential problems with:
    • publishing papers
    • making data publically available/accessible
    • unforseen landmines/hurdles

Analysis

  • Using Related Links
    • compare "depth" of related links
    • relevancy scores — how to use these (only get scores for related links)

Statistics Results

Gene to Pubmed Table

  • Attempt to speed up related_articles generation by only tracking gene-pubmed relations
  • Then attempt map to mesh terms, then map to mesh-child
CREATE TABLE IF NOT EXISTS gene_pubmed
(
gene_id int,
pmid int,
degree int,
PRIMARY KEY(gene_id, pmid)
);
ALTER TABLE gene_pubmed ADD INDEX (degree);
INSERT IGNORE INTO gene_pubmed SELECT gene_id, pmid, 1 AS degree FROM generif;

And then for each degree

CREATE TABLE gene_pubmed_tmp AS SELECT gene_id, related_pmid AS pmid, 2 AS degree FROM gene_pubmed, related_articles WHERE gene_pubmed.degree=1 AND gene_pubmed.pmid=related_articles.pmid;
INSERT IGNORE INTO gene_pubmed SELECT * FROM gene_pubmed_tmp;

CANNOT select and insert on the same query (will lose results)

Results

related_articles table

PMID: 288434
Related PMID: 7605242

gene_pubmed

Degree Records Notes
0 660 538
1 189 628 (1.88s)
2 2 446 180 (23.15 sec)
3 97 566 040 (1 hour 5 min 48.05 sec) Records: 268445073 Duplicates: 170879033 Warnings: 0
4 1 710 959 317 Query OK, -109447855 rows affected (12 hours 2 min 49.95 sec) 8 480 486 737

Then to get brain disease MeSH terms with children, do

SELECT locus, bdc.term, COUNT(DISTINCT gene_pubmed.pmid) AS num_pmid FROM gene_pubmed, pubmed_mesh, braindisease_child AS bdc, gene WHERE gene_pubmed.pmid=pubmed_mesh.pmid AND pubmed_mesh.term=bdc.child AND degree=1 AND gene.gene_id=gene_pubmed.gene_id GROUP BY locus, bdc.term ORDER BY locus;

To get all pmids referenced by a gene

 SELECT locus,COUNT(DISTINCT pmid) FROM gene_pubmed, gene WHERE degree=1 AND gene.gene_id=gene_pubmed.gene_id GROUP BY locus;

Get all pmids referenced by a term

SELECT bdc.term, COUNT(DISTINCT pmid) FROM pubmed_mesh, braindisease_child AS bdc WHERE bdc.child=pubmed_mesh.term GROUP BY bdc.term;
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.