OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
From: Clifton Royston (cliftonr_at_lava.net)
Date: Wed Sep 04 2002 - 13:24:24 CDT

  • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

    On Tue, Sep 03, 2002 at 04:52:06PM -0400, Greg A. Woods wrote:
    > > This also presumes (like every other anti-spam silver bullet) that
    > > spammers are completely incapable of adapting, which has been
    > > repeatedly shown to be wrong.
    >
    > Do you have a scientific criticism of Graham's assertion about how
    > robust a Bayesian filter should be even when the "attacker" knows
    > exactly what algorithm is being used?

      Sure. And in fact, they are *already* counter-adapting to both the
    regular expression body matches and the Bayesian approach. I have seen
    3 different approaches already!

      Approach 1, outline of proof: the Bayesian algorithm deals with
    tokens. It is trivial for an "attacker" who knows the algorithm for
    the tokenization to randomly break up the text in each spam being sent
    out such that the same tokens do not reappear and so do not get
    "learned" and can not be matched.

      I just got forwarded a copy of one such spam in which HTML comments
    embedded between words and in the middle of words are being used to
    break them up. This would stop both most regex matches and break the
    Bayesian algorithm as applied to the body, but the bulk of email
    readers using HTML-enabled mail clients will not even see those breaks.
    Once you get to the headers... well, the only thing you can trust is
    what's filled in by your own site, so everything else could pretty well
    be randomized too.

      Once this becomes common, then in your case and probably Paul
    Graham's case, you could tweak your tokenization rules and look for the
    HTML comment tag to show up in your stats as an indication of spam.
    However, that may not work for those who do receive HTML mail, and it
    also would n't stop the s enders from j ust ra ndoml y bre aking up
    wo rds l ike t his which impairs readability remarkably little but
    again defeats the algorithm before it gets started.

      Approach 2, outline of proof: take a statistically significant sample
    of known non-spam, build a Bayesian frequency model from it, and
    construct your sample spam according to that frequency model. I've got
    another recent spam in my collection that is written as a very
    plausible chatty personal email in reply to someone else's email (which
    it wasn't.) Talks about having moved recently, wanting to get together
    sometime, blah blah blah, and then mentions the sender's home page
    (URL.) The latter turns out to be a porn site entry page. In this
    email there were no typical spam phrases or keywords. The Bayesian
    algorithm is going to have a very hard time with something like that; I
    do not believe it can readily be mechanically distinguished from a
    genuine personal email which happens to mention a URL. A really
    diligent attacker would take a broad sample of real personal emails,
    sampled from an ISP, use them to get word frequency counts, and then
    use those as a starting point for randomly permuting a text base with
    synonyms, so as to avoid building high statistical counts for any few
    words. There goes the statistical recognition model - any Bayesian
    metric which excludes this is provably likely to also randomly exclude
    real email which does not fit the profile of what the user has received
    lately.

      Approach 3 (anecdotal): One of my staff has (but unfortunately didn't
    save) another spam in which there are HTML comment blocks embedding
    various chunks of technical discussion. This would have some chance of
    ghosting through Paul Graham's filters, or other techies' filters, on
    the strengths of the positive matches from the irrelevant embedded
    material. This approach seems more sketchy but it might be made to
    work with better analysis of words which are likely to have high
    positive scores in real technical email, and hence in the positive
    metrics of users of the Bayesian filtering method.

      Before Cantor and Siegel, the Internet technical community kept
    saying that the kind of person who would mechanically spam Usenet (or
    email) was too dumb to figure out the technical details of how to do
    it. I'm amazed that people still keep taking that viewpoint, this many
    years further on.

      Again, to repeat my earlier statement: I am *not* saying the Bayesian
    technique is useless. I think it will do very well at classifying some
    broad ranges of spam, and will be a very useful addition to the
    arsenal. At the least, if it becomes widely used, it will raise the
    ante on the necessary sophistication for spammer software. However, I
    think that on the strength of one essay it is being hailed as a
    panacea, and "panaceas are poison."

      -- Clifton

    -- 
        Clifton Royston  --  LavaNet Systems Architect --  cliftonrlava.net
    "What do we need to make our world come alive?  
       What does it take to make us sing?
     While we're waiting for the next one to arrive..." - Sisters of Mercy
    -
    To unsubscribe, send mail to majordomopostfix.org with content
    (not subject): unsubscribe postfix-users