Welcome to DNAHelix.org, wiki publishing platform for Warren Cheung. My blog will continue to be available at Twinram.com
Warren Cheung is a PhD Student at the University of British Columbia, in the Bioinformatics Program. I am co-supervised by Wyeth Wasserman (at the CMMT - the Centre for Molecular Medecine and Therapeutics), as well as Francis Ouellette (now at the OICR - the Ontario Institute for Cancer Research).
For now, this area is primarily a repository for my research and academic materials, past and present. In time, this might be extended to all sorts of other uses…we'll have to see how the experiment fares.
DNAhelix Blog
Gene Properties to Disease
2009-04-15T22:52:21Z
What I want to do now is go from entrez gene_id and grab relevant “sequence features” – the size of the coding region, length of the gene, chromosome, RNA length, and so on. Doesn’t look like the info is archived locally, so possible avenues to whack are using BioMart or NCBI via web services to convert the gene_ids to features. Then it’ll be a matter of doing some strange merge of gene_id/term/ValidYN with the result.

Work Report
2009-02-24T00:10:41Z
Bugfix – propogating the makefile generation code. Makefiles are no longer generated separate from the files that use them.
BUG TOFIX – auc computation doesn’t stop on error. Perhaps we should make this a submake, so we can compute all the scores simultaneously?
Computing – gene-genevalues for old
TODO – biopython MDMR compute. Use MDMR data in other Machine Learning techniques. Use feature reduction with other ML tech.

WIP
2009-02-20T22:30:31Z
MDMR – Nicholas Schork Multivariate distance matrix regression. Explain a gene distance matrix using parameters (e.g. micro-array chip, or for us, MeSH annotation=disease) BioPython function exists.
http://www.pnas.org/content/103/51/19430.abstract?ck=nck
Profile comparison used to make the distance matrix? Also possible – look for pubmed co-citation (an independant distance matrix, not involving MeSH?)
Regenerating generif.
BG-profiles computed, appear to be less effective. Some issues computing digenei (old) using BG
Figured out Makefile choking on generating .mk files which aren’t found by subsequent steps even though they are requirements – instead have them generated in situ.

Progress Report
2009-02-10T18:41:37Z
Reorganise
2009-02-08T21:33:25Z
Time to clean up the scientific notebook (this blog, in other words). Categories vs. tags make for several organisational methods.
Categories should definitely separate “blog” (organisational notes like this one) from research. The question is whether to separate via topic – disease-gene, motif finding, etc and use tags for whether this is phd-thesis or side project, in addition to tagging topics like directories, wip, meeting minutes, etc. A category “Wasserman Lab” for group meetings, other group members’ research and such. “Events” will cover conferences and other research/science related events I attend. Should think about this some more – there are 80 posts in this blog to (possibly retag) if I do this!
Also related would be to add BCRMTA, cisreg.ca, sourceforge, github links for the relevant projects to the wikisite. Also, better documentation for all the code…

Weekend Work
2009-02-08T10:45:07Z
GeneBG seems to be OK, trying to get the diseaseBG off the ground. Need to investigate why PMID 9753684 is flagged with MeSH term “Chromosome Aberration” and AGAMOUS Protein, Arabidopsis, but isn’t being picked up by the mesh-disease…
Ah. I see the problem – it’s a matter of odd intersections in the tree – Mosaicism is a child of Chromosome Aberration, but only under the Mutation branch, not under the Disease branch. Interesting thing I never knew about…Therefore, to catch this case we’ll have to parse mesh-parent rather than the original mesh files.

New Results
2009-02-06T21:49:48Z
Seems pretty similar to the cmp-digenei result. Should note that CTD validation and training validation are unchanged, since those only depend on the predictions.
Mouse direct connections are done. Write a simple cmd-line tool to grep results? What other organisms would be useful use cases – yeast perhaps?
I really need a web interface – somewhere people can look up via gene or disease, or upload a list and get a bunch back.
Still need to merge the gene/disease backgrounds into p-values. Likely this will be two more files – another prefix perhaps to indicate the background – maybe something like:
hum-gene2pubmed-gene-mesh-p.txt -> gene2pubmedBG-gene2pubmed-gene-mesh-p.txt
disease-comesh-p.txt ->diseaseBG-disease-comesh-p.txt
In these cases, the filtering (hum or disease) needs to be done first, as computing p-values for the superset makes no sense (done)
OK, the names are starting to get a bit messy, with hum-gene2pubmed-gene-mesh.txt vs hum-gene-gene2pubmed-mesh-refs.txt
Need better documentation for filter-file.py – the –field parameter was incorrectly used resulting in errors in the last makefile (blech!)

Update
2009-02-05T23:26:39Z
Currently In Progress
2009-02-05T00:04:08Z
-mus TAXON_NAME test run in digenei1/ (new TAXON_NAME code)
-hum first compute in digenei3/
TO RUN – cmp-digenei2 once digenei3 is complete
TO FIX – reference to taxon_id=9606 (e.g. in direct_gd_predict.mk)
TO REVISE – maybe use comesh counts (comesh for disease) to get the background stats (save computation)

Mouse Organism Prediction
2009-02-03T20:45:46Z
Change organism references from hum to TAXON_FILTER. Hard-code the extraction of mouse genes, but everything else should follow pretty straightforward from that.
Extract the organism-specific filtering from the general computation (get_pval.mk) in direct-predict to the main Makefile, to make switching between organisms possible via simply changing the TAXON_FILTER.








