Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email firstname.lastname@example.org
Re: [Dailydave] approximate string matching
From: Martin Roesch (roeschsourcefire.com)
Date: Fri Sep 01 2006 - 09:06:15 CDT
-----BEGIN PGP SIGNED MESSAGE-----
Have you taken a look at Bloom Filters? I'm not sure what your exact
requirements are but Bloom Filters are fast and do probabilistic
pattern matching. Check out:
They're easy to implement and fast, so maybe that will work for you.
Hope that helps!
On Sep 1, 2006, at 6:52 AM, Mateusz Berezecki wrote:
> Hello list readers,
> Stumbling upon some filtering problem I realised exact string matching
> is not the way to go. Thinking a bit I realised doing an approximate
> string matching is a good solution.
> What is approximate string matching? It's the way to allow pattern
> matching allowing errors, such as typos, character deletions.
> It's often called a fuzzy string matching or string matching allowing
> errors. (please visit
> http://www.nist.gov/dads/HTML/stringMatchwError.html for more
> The heart of a good fuzzy string matching algorithm is a
> filtering function which unlike in exact string matching doesn't
> tell if
> the strings are exact (match / no match answer) but rather tells us
> if the strings are similar or not similar.
> Each filtering (often called a distance function) function f accepts 2
> strings and returns a value which is called differently and the naming
> strictly depends on the method but the general term is a distance.
> This distance function has the following properties:
> 1) D(X, X) = 0
> 2) D(X, Y) > 0 when (X != Y)
> 3) D(X,Y) = D(Y,X)
> The returned match coincidence value is *the* value which tells how
> similar are the strings. The value returned by the distance function
> is also a distance and tells us how far apart are the 2 strings in the
> same metric space.
> Having this in mind I started doing a small research (looking it up on
> my favorite search engine) and what I found was really surprising.
> The only good resource for such algorithms is
> which gives a handy list of distance functions
> Knowing of Hamming, Levenshtein and Q-grams functions a bit
> I know they are too slow for my needs. I have no idea of other
> but looking at execution time comparision graph on the page specified
> above I saw that Jaccard Similarity function or Overlap coefficient
> are giving quite good results. Then I need this quite quick so
> implementing an unknown function is going to be quite painful to me.
> So in the era of fuzziness everywhere I thought that this might be the
> good place to ask as my favorite search engine is useless in this
> Is anyone aware of a good implementation of any of these algorithms
> in C or perhaps some opensource C library for that purpose?
> Do you have any recommendations?
> Mateusz Berezecki
> Dailydave mailing list
Martin Roesch - Founder/CTO, Sourcefire Inc. - +1-410-290-1616
Sourcefire - Security for the Real World - http://www.sourcefire.com
Snort: Open Source IDP - http://www.snort.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)
-----END PGP SIGNATURE-----
Dailydave mailing list