Locality Sensitive Hashing For Motif Finding

Optimising Locality-Sensitive Hashing on Sequences in the Context of Motif Finding

Abstract

Locality-sensitive hashing has been applied in several problems in
bioinformatics to quickly search large sequences by examining the
n-grams of the sequences. We observe in these application a bias in
the random generation of hashes and define an effective equivalence
for hashes that generate nearly identical results. Methods that
address these two points demonstrate an improvement to the rate at
which unique hashes are generated, especially when many hashes are
generated. These methods can be easily applied integrated with
existing applications of locality-sensitive hashing, resulting in
improved performance by reducing the time algorithms spend revisiting
duplicate hashes as well as eliminating biases potentially hindering
the solving of certain problem instances.

Paper

Submitted to CSCBCE 2010

Source Code

  • Python Implementation of Hashing algorithms with testing function
  • Raw Data used to generate figures
  • Python script to summarise data and summarised raw data
  • DOS batch script for generating and summarising
  • Excel Spreadsheet for generating the figures, statistical tests and timing results

test-nr-lsh.zip (642K)

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.