Implementation Details

MySQL LOAD DATA LOCAL INFILE

Default binaries are built to allow this command, but if mysql was built from source, the client may disable this by default.

Enable by specifying "—local-infile=1" on the command line

Security issues:
http://dev.mysql.com/doc/refman/5.0/en/load-data-local.html

Chemical Terms

SELECT DISTINCT gene_id, pmid
FROM pubmed_chem, warrendb.mesh_refseq, warrendb.gene2refseq
WHERE (mesh_refseq.refseq=gene2refseq.rna OR mesh_refseq.refseq=gene2refseq.protein) AND 
pubmed_chem.chem=mesh_refseq.mesh;

Gene-MeSH and Disease-MeSH profiles

SELECT gene.gene_id, locus, dc.term, COUNT(DISTINCT generif.pmid) AS pubmed_refs 
FROM gene,generif, pubmed_mesh, mesh_child AS dc 
WHERE gene.gene_id=2 AND pubmed_mesh.term=dc.child AND generif.pmid=pubmed_mesh.pmid AND gene.gene_id=generif.gene_id 
GROUP BY gene.gene_id, dc.term;
SELECT mc1.term, mc2.term, COUNT(DISTINCT pm1.pmid) AS pubmed_refs
FROM pubmed_mesh AS pm1, pubmed_mesh AS pm2, mesh_child as mc1, mesh_child as mc2
WHERE pm1.pmid=pm2.pmid AND mc1.term=\"Alzheimer Disease\" AND
mc1.child=pm1.term AND mc2.child=pm2.term
GROUP BY mc1.term, mc2.term;"

UMLS Co-occurrence loading

  • file: MRCOC.RRF
  • Loading:
LOAD DATA LOCAL INFILE 'MRCOC.RRF' 
INTO TABLE umls_coc 
FIELDS TERMINATED BY '|' 
(cui1, @dummy, cui2, @dummy, Source_ID, type, refs);

Entrez EUtils Data Parsing in Unicon

TODO

  • Investigate duplicate lines in disease-counts.txt, disease-mesh.txt
  • Update PubMed miniDB to conform with Minimal Data Element Set
  • Is NLMUniqueID = MEDLINE ID?
  • Investigate GeneSymbol in PubMed
    • GeneSymbol is deprecated (was in place for a couple years but no longer supported)
  • Preliminary Testbed is programmed in Unicon (dialect of Icon).
  • Emacs mode (by Robert Parlett): unicon.el
  • Successfully parsed Entrez Gene
  • PubMed parsing
    • handling non-ascii characters (See XML for PMID 17359734)
      • ignoring things that are not: letters, digit and a subset of punctuation
    • handling interruption of long downloads
      • interrupted around the 70000/80000 record mark
      • might have to write a wrapper for the reading code to extract "bite-sized chunks" by restricting to X entries at a time
      • was trying wget via pipe, will now try direct http
      • http also failed
      • rewrote to only grab 20,000 articles at a time
      • current speed: (6052-5230)/42.83 = 19.19 articles/sec
      • ETA for 660184 articles = 9.5 hours

mesh_child

CREATE TABLE mesh_child AS 
SELECT DISTINCT major.term, child.term AS child 
FROM mesh_tree AS major, mesh_tree AS child 
WHERE major.tree_num=child.tree_num OR child.tree_num LIKE CONCAT(major.tree_num,'.%');
ALTER TABLE mesh_child ADD PRIMARY KEY(term, child);
ALTER TABLE mesh_child ADD INDEX(child);

Local Resources

Database

  • sonoma: MySQL Database version: 5.0.27-standard
SELECT VERSION();
  • Approximately 470 MB of table data, 350 MB index data ( Summer 2007 )
SELECT SUM(Data_length),SUM(Index_length) from information_schema.tables;

Cluster

  • watson: Rocks cluster running SunGridEngine

Disk Usage

  • work files currently on laptop and watson
  • Pubmed files: on sonoma:/space/data/pubmed

UMLS

  • currently loaded using knime workflow
    • an objective second look shows that it could probably be reimplemented as a shell script
  • only extracting mesh and mim entries
CREATE TABLE mesh_mim AS SELECT DISTINCT umls1.Source_Text AS term, umls2.Source_ID AS mim_id FROM umls AS umls1, umls AS umls2 WHERE umls1.Source_Name='MSH' AND umls2.Source_Name='OMIM' AND umls2.Source_ID_type='PT' AND umls1.Concept_ID=umls2.Concept_ID;

Loading Significant Portions of Pubmed

  • Download all of pubmed
nohup nice wget ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/* > pubmed_baseline.log &
  • use XSL transformation to convert to delimited file format, then load into database

MySQL Loading

  • Is picky about line endings (do dos2unix/etc. as needed beforehand)
  • Skipped usually means attempts to insert the same key twice

gene

LOAD DATA LOCAL INFILE 'gene_info' INTO TABLE gene IGNORE 1 lines (taxon_id, gene_id, locus);

pubmed

LOAD DATA LOCAL INFILE 'disease-counts.txt' INTO TABLE pubmed IGNORE 8 lines;

Query OK, 660538 rows affected (7.72 sec)
Records: 660596 Deleted: 0 Skipped: 58 Warnings: 0

generifs

LOAD DATA LOCAL INFILE 'parsed_basic_rif.txt' INTO TABLE generif (gene_id, pmid, description);

Query OK, 189628 rows affected, 65535 warnings (10.72 sec)
Records: 189933 Deleted: 0 Skipped: 305 Warnings: 189933

pubmed_mesh

LOAD DATA LOCAL INFILE 'disease-mesh.txt' INTO TABLE pubmed_mesh 
IGNORE 8 LINES (pmid, major, term, qual);

Query OK, 11898933 rows affected (1 hour 4 min 58.60 sec)
Records: 11899111 Deleted: 0 Skipped: 178 Warnings: 0

mesh

LOAD DATA LOCAL INFILE 'mtrees2007-03-27.bin' INTO TABLE mesh
FIELDS TERMINATED BY ';';

Query OK, 47143 rows affected (1.93 sec)
Records: 47143 Deleted: 0 Skipped: 0 Warnings: 0

  • There are duplicate entries (same term in multiple positions of the tree)
CREATE VIEW braindisease AS SELECT * FROM mesh WHERE tree_num LIKE 'C10.228.140%';

related articles

LOAD DATA LOCAL INFILE 'braindisease-related.txt' INTO TABLE related_articles
IGNORE 8 LINES (pmid, related_pmid, score)

Query OK, 73761430 rows affected (24 min 39.68 sec)
Records: 73761430 Deleted: 0 Skipped: 0 Warnings: 0

gene_go

LOAD DATA LOCAL INFILE 'gene2go' INTO TABLE gene_go (@dummy, gene_id, go_id);
 
CREATE VIEW tf_gene AS SELECT gene.gene_id, locus, taxon_id FROM gene, gene_go WHERE gene.gene_id=gene_go.gene_id AND (
go_id='GO:0017163' OR
go_id='GO:0003701' OR
go_id='GO:0003702' OR
go_id='GO:0003709' OR
go_id='GO:0030401' OR
go_id='GO:0003712' OR
go_id='GO:0003700' OR
go_id='GO:0016986' OR
go_id='GO:0016988' OR
go_id='GO:0003715' OR
go_id='GO:0016563' OR
go_id='GO:0003711' OR
go_id='GO:0016564' OR
go_id='GO:0000156' OR
go_id='GO:0030528');

Using GO Term "transcriptional regulator activity" and children

homologene

LOAD DATA LOCAL INFILE 'homologene.data' INTO TABLE homologene (homologene_id, @dummy, gene_id)

205669 records loaded.

OLD Methods

Gene

LOAD DATA LOCAL INFILE 'tf-counts.txt' INTO TABLE gene 
IGNORE 8 LINES (gene_id, locus, @dummy, @dummy);

Result: Query OK, 1192 rows affected (0.02 sec)
Records: 1192 Deleted: 0 Skipped: 0 Warnings: 0

generif

LOAD DATA LOCAL INFILE 'tf-generifs.txt' INTO TABLE generif 
IGNORE 8 LINES (gene_id, pmid, heading, description);

Query OK, 21274 rows affected (0.36 sec)
Records: 32417 Deleted: 0 Skipped: 11143 Warnings: 0underline text

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.