|
Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com |
[Dailydave] approximate string matching
From: Mateusz Berezecki (mateuszb
gmail.com)
Date: Fri Sep 01 2006 - 05:52:13 CDT
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello list readers,
Stumbling upon some filtering problem I realised exact string matching
is not the way to go. Thinking a bit I realised doing an approximate
string matching is a good solution.
What is approximate string matching? It's the way to allow pattern
matching allowing errors, such as typos, character deletions.
It's often called a fuzzy string matching or string matching allowing
errors. (please visit
http://www.nist.gov/dads/HTML/stringMatchwError.html for more
information)
The heart of a good fuzzy string matching algorithm is a
filtering function which unlike in exact string matching doesn't tell if
the strings are exact (match / no match answer) but rather tells us
if the strings are similar or not similar.
Each filtering (often called a distance function) function f accepts 2
strings and returns a value which is called differently and the naming
strictly depends on the method but the general term is a distance.
This distance function has the following properties:
1) D(X, X) = 0
2) D(X, Y) > 0 when (X != Y)
3) D(X,Y) = D(Y,X)
The returned match coincidence value is *the* value which tells how
similar are the strings. The value returned by the distance function
is also a distance and tells us how far apart are the 2 strings in the
same metric space.
Having this in mind I started doing a small research (looking it up on
my favorite search engine) and what I found was really surprising.
The only good resource for such algorithms is
which gives a handy list of distance functions
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Knowing of Hamming, Levenshtein and Q-grams functions a bit
I know they are too slow for my needs. I have no idea of other functions
but looking at execution time comparision graph on the page specified
above I saw that Jaccard Similarity function or Overlap coefficient
are giving quite good results. Then I need this quite quick so
implementing an unknown function is going to be quite painful to me.
So in the era of fuzziness everywhere I thought that this might be the
good place to ask as my favorite search engine is useless in this
case.
Is anyone aware of a good implementation of any of these algorithms
in C or perhaps some opensource C library for that purpose?
Do you have any recommendations?
thanks,
Mateusz Berezecki
_______________________________________________
Dailydave mailing list
Dailydave
lists.immunitysec.com
http://lists.immunitysec.com/mailman/listinfo/dailydave
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]