OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
Re: [Dailydave] approximate string matching

From: Martin Roesch (roeschsourcefire.com)
Date: Fri Sep 01 2006 - 09:06:15 CDT


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Have you taken a look at Bloom Filters? I'm not sure what your exact
requirements are but Bloom Filters are fast and do probabilistic
pattern matching. Check out:

http://en.wikipedia.org/wiki/Bloom_filter
http://www.arl.wustl.edu/~sarang/analysis.pdf

They're easy to implement and fast, so maybe that will work for you.
Hope that helps!

      -Marty

On Sep 1, 2006, at 6:52 AM, Mateusz Berezecki wrote:

> Hello list readers,
>
> Stumbling upon some filtering problem I realised exact string matching
> is not the way to go. Thinking a bit I realised doing an approximate
> string matching is a good solution.
>
> What is approximate string matching? It's the way to allow pattern
> matching allowing errors, such as typos, character deletions.
> It's often called a fuzzy string matching or string matching allowing
> errors. (please visit
> http://www.nist.gov/dads/HTML/stringMatchwError.html for more
> information)
>
> The heart of a good fuzzy string matching algorithm is a
> filtering function which unlike in exact string matching doesn't
> tell if
> the strings are exact (match / no match answer) but rather tells us
> if the strings are similar or not similar.
>
> Each filtering (often called a distance function) function f accepts 2
> strings and returns a value which is called differently and the naming
> strictly depends on the method but the general term is a distance.
>
> This distance function has the following properties:
>
> 1) D(X, X) = 0
> 2) D(X, Y) > 0 when (X != Y)
> 3) D(X,Y) = D(Y,X)
>
> The returned match coincidence value is *the* value which tells how
> similar are the strings. The value returned by the distance function
> is also a distance and tells us how far apart are the 2 strings in the
> same metric space.
>
> Having this in mind I started doing a small research (looking it up on
> my favorite search engine) and what I found was really surprising.
>
> The only good resource for such algorithms is
> which gives a handy list of distance functions
>
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
>
> Knowing of Hamming, Levenshtein and Q-grams functions a bit
> I know they are too slow for my needs. I have no idea of other
> functions
> but looking at execution time comparision graph on the page specified
> above I saw that Jaccard Similarity function or Overlap coefficient
> are giving quite good results. Then I need this quite quick so
> implementing an unknown function is going to be quite painful to me.
>
> So in the era of fuzziness everywhere I thought that this might be the
> good place to ask as my favorite search engine is useless in this
> case.
>
> Is anyone aware of a good implementation of any of these algorithms
> in C or perhaps some opensource C library for that purpose?
> Do you have any recommendations?
>
>
>
> thanks,
> Mateusz Berezecki
> _______________________________________________
> Dailydave mailing list
> Dailydavelists.immunitysec.com
> http://lists.immunitysec.com/mailman/listinfo/dailydave
>

- --
Martin Roesch - Founder/CTO, Sourcefire Inc. - +1-410-290-1616
Sourcefire - Security for the Real World - http://www.sourcefire.com
Snort: Open Source IDP - http://www.snort.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFE+D5Xqj0FAQQ3KOARAphFAJsG4pxiugksePQ4FLC8tJ0HH6NegACfbee/
ffzWTEf6dPKxfD9Ao2kDCtU=
=dOBm
-----END PGP SIGNATURE-----
_______________________________________________
Dailydave mailing list
Dailydavelists.immunitysec.com
http://lists.immunitysec.com/mailman/listinfo/dailydave