Welcome to DNAHelix.org, wiki publishing platform for Warren Cheung. My blog will continue to be available at Twinram.com
Warren Cheung is a PhD Student at the University of British Columbia, in the Bioinformatics Program. I am co-supervised by Wyeth Wasserman (at the CMMT - the Centre for Molecular Medecine and Therapeutics), as well as Francis Ouellette (now at the OICR - the Ontario Institute for Cancer Research).
For now, this area is primarily a repository for my research and academic materials, past and present. In time, this might be extended to all sorts of other uses…we'll have to see how the experiment fares.
DNAhelix Blog
Figuring things out
2010-01-21T04:13:58Z
In addition to whacking some ideas around for the title, I’m thinking about a better (set?) of initial figures. Some kind of “cloud of articles -> ZAP -> profile of keywords” effect. Current mockups are kinda clunky…especially when I have to redo it as “cloud of articles with same MeSH term -> ZAP -> profile”. And then there’s the secondary relationships.
I’ve got a less detailed version, with just “stack articles -> profile -> addition related terms”….maybe take that even more abstract, and then have more detailed figures for explanations?
The similarity thing is also tricky – how to diagram the “similarity of profiles implies that upper level terms are related?” … maybe have two tiers? Circular topics linked by the profile boxes?
The name game
2010-01-15T08:43:19Z
Still slogging a bit through the intro. Uncertain which experiments to perform yet – I guess the first round will anyways focus on reorganising (since I think the databases are still a bit borkened. Then again, the web interfaces should still be okay…and making an interactive overrepresentation tester might be even better…
Anyways, thinking about what to name the method. MeSH overrepresentation is a bit the easy one, but maybe too easy. Words to try: Bibliographic, MeSH, Keyword, Overrepresentation, Summary
BiSOM, BibSum == BS
BibOS, BibMOS
MeKO
MS, MOS, OMS
oMeSH, overMeSH, OM
Hmm….not really any standouts there.
VanBUG Day
2010-01-14T07:02:19Z
Paper progress – started hacking at the new intro. Looks like related work will be pretty sparse – maybe pull in CAESAR as “in the style of but completely different” , just like GO overrepresentation as being “kinda the same but we have way more detail”. Maybe call it MeSH Overrepresentation?
VanBUG had a pair of excellent speakers today.
Parisa Shooshtari presented a variation of spectral clustering (spectral == graph cut) for very large number of points (when polynomial just doesn’t cut it). Idea is that uniform random sampling won’t sample very often from sparsely covered areas – and perhaps you don’t want to lose the really sparse areas. The Fix? Faithful sampling, where you grab points at random, but when you take a point, you also take all points “nearby” (some fixed distance) – this forms a “community” represented by the point. Then you make a graph of the representatives – you lose some info, but since you can save the community of each point, you can make the edges in the graph be some kind of “average weight” of all the points in the community. Might be fancier if you could somehow do some kind of “sub-clustering” to form the communities – some kind of “fractal clustering”.
Evan Eichler from UWash talked a lot about copy number variations – his def was really geared towards variation > 1kb. Two main types: Large, rare, bumps lots of genes (but can’t stay all that long in the population?). vs. Multi-copy common susceptibility/risk factor.
Usually CNV are thought to be the product of highly repetitive/similar/duplicated sequence + recombination error (and for some reason humans/ape family have a lot more of this than other species)
Ideal technology will get copy #, content (sequence) and structure (part of a gene/promoter/etc)
Current tech maps sequence to genome looking for things like unusual distance between paired end reads (sequencing/ArrayCGH/microarray) – but different technologies find different variation == no tech can cover it all yet (or lots of error?)
mrFAST – BLAST style searching of genomes (to map the variation)
Variation Hunter – uses mrFAST to map, then set cover to minimise …
Nexgen sequences – different bias, as no biased to shorter cnvs/more difficulty mapping (since there’s more sequence to map) (more false positives? more potential matches?)
Can use read depth to get copy number (e.g. shotgun sequencing reads, mapped to our ref. genome). More reads (corrected for paralogues, homologs etc) == more CNV.
Use multiple seq align on copies to find unique seq, which can then be used to search for the presence of these CNVs.
All in all, really liked this talk. Went longer than usual VanBUG, and speaker was really quiet/mike was busted, but even so didn’t have any trouble following.
Meeting with the Supervisors
2010-01-13T00:32:34Z
New Plan: Refocus paper more on gene properties, with appplication to gene-disease as an example of things that can be done – I like the new tack, it feels both more in tune with the later results, and also a more novel subject area than the tired association track.
Need to find some related work. Can probably grab some of the GO stuff, but anything else? Maybe some text mining major topics or such…
Deadline: Fri/Sun for first rewrite, Next Fri/Sun for second edit pass
Aiming to get this out the door at the end of the month (and looks like it’s going to happen!)
Also – need to finish the reformat into submission conformance.
Back in the Saddle
2010-01-11T23:50:51Z
Working on: Top Genes with the most Refs in 2007 over time.
PC saved in
cs/Research/top-refs
On the server:
python filter_file.py digenei1/txt/direct_gene_disease/hum-gene.txt digenei1/txt/gene/all-gene2pubmed-gene-refs.txt | sort -n -t “|” -k 2 -r | less
TODO: work on GCI correlating to # of refs, doing correlation stats (date to # refs), UCE stats
Gene Properties to Disease
2009-04-15T22:52:21Z
What I want to do now is go from entrez gene_id and grab relevant “sequence features” – the size of the coding region, length of the gene, chromosome, RNA length, and so on. Doesn’t look like the info is archived locally, so possible avenues to whack are using BioMart or NCBI via web services to convert the gene_ids to features. Then it’ll be a matter of doing some strange merge of gene_id/term/ValidYN with the result.
Work Report
2009-02-24T00:10:41Z
Bugfix – propogating the makefile generation code. Makefiles are no longer generated separate from the files that use them.
BUG TOFIX – auc computation doesn’t stop on error. Perhaps we should make this a submake, so we can compute all the scores simultaneously?
Computing – gene-genevalues for old
TODO – biopython MDMR compute. Use MDMR data in other Machine Learning techniques. Use feature reduction with other ML tech.
WIP
2009-02-20T22:30:31Z
MDMR – Nicholas Schork Multivariate distance matrix regression. Explain a gene distance matrix using parameters (e.g. micro-array chip, or for us, MeSH annotation=disease) BioPython function exists.
http://www.pnas.org/content/103/51/19430.abstract?ck=nck
Profile comparison used to make the distance matrix? Also possible – look for pubmed co-citation (an independant distance matrix, not involving MeSH?)
Regenerating generif.
BG-profiles computed, appear to be less effective. Some issues computing digenei (old) using BG
Figured out Makefile choking on generating .mk files which aren’t found by subsequent steps even though they are requirements – instead have them generated in situ.
Progress Report
2009-02-10T18:41:37Z
Mouse general background compute completed. Disease/GeneBG compute in progress.
cmp-digenei2 for mouse being run
TODO: adapt cmp-digenei2 for BG computes, grep mus results
Reorganise
2009-02-08T21:33:25Z
Time to clean up the scientific notebook (this blog, in other words). Categories vs. tags make for several organisational methods.
Categories should definitely separate “blog” (organisational notes like this one) from research. The question is whether to separate via topic – disease-gene, motif finding, etc and use tags for whether this is phd-thesis or side project, in addition to tagging topics like directories, wip, meeting minutes, etc. A category “Wasserman Lab” for group meetings, other group members’ research and such. “Events” will cover conferences and other research/science related events I attend. Should think about this some more – there are 80 posts in this blog to (possibly retag) if I do this!
Also related would be to add BCRMTA, cisreg.ca, sourceforge, github links for the relevant projects to the wikisite. Also, better documentation for all the code…






