|
Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com |
From: Wietse Venema (wietse_at_porcupine.org)
Date: Mon Sep 02 2002 - 09:58:05 CDT
Interesting. This certainly pushes the limit of usability for a
solution that does not match the problem well (Solution: match each
message body line as a completely independent piece of data.
Problem: mail content is organized into clumps of similar content).
My fault. Neat hack, though.
Wietse
Bert Driehuis:
> I ran into performance issues with body_checks. My cleanup daemon was
> using huge amounts of CPU time, and profiling showed that (surprise!)
> pcre_exec was the culprit. I've toyed with solutions outside of
> postscript (like merging as many rules as possible into one regexp), but
> this quickly proved itself to be unworkable (mechanising it requires a
> PhD in traveling salesman planning, PCRE limitations show, and the
> resulting regexp is error prone and hard to debug, and barely faster to
> boot). After a bunch of false starts, this is what I think is the bottom
> line:
>
> The bottleneck of Postfix's current PCRE dictionary is that for every
> line, all regexps have to be tried. With 800 body checks, as I had at
> some stage, that's millions of regexp matches for a single message with
> an attachment of a couple hundred K. Header_checks don't hurt as badly,
> but if you handle large volumes of small messages and have more than a
> few rules, this will still take a significant bite out of rule checking
> (especially if you wish to keep your ruleset legible).
>
> I tried to optimized it by skipping irrelevant checks. I considered
> doing this in a way that's completely hidden to the user, but PCRE is
> too rich to safely automate this, and besides, I didn't want to burden
> the Postfix code with complex (and thus, hard to test) code.
>
> The attached diff allows for an IF .. ENDIF construct, like this:
>
> IF /http/
> /http:\/\/banned_site\// REJECT
> /http:\/\/anothersite\// REJECT
> /http:\/\/yetanothersite\// REJECT
> ENDIF
> IF /</
> /<script.*evil.js/ REJECT
> /<embed src.*evil.swf/ REJECT
> ENDIF
>
> The regexps between the IF end the ENDIF are not evaluated unless the
> regexp after the IF matches. Some trivial reorganisations of my
> body_checks reduced the CPU load to a third of what it was before, when
> run on my spam archive (I don't have long term measurements from a live
> system yet, but I expect my spam mailbox to be richer in the words I key
> off than real-life mail). I haven't even started going over the whole
> set; I'm currently still evaluation about a 100 rules that probably
> share more keywords to key off.
>
> This still doesn't turn PCRE checks into a procedural language, which is
> probably a good thing; anyway, this is intended for optimization only.
>
> I'd like to get feedback both on the quality of the solution, and on its
> necessity. I'll be happy to keep this as a private solution and maintain
> it like Jozsef Kadlecsik's per-user UCE patch, but maybe Wietze can be
> swayed into implementing this or something like it in the official
> Postfix release eventually. Please drop me a note in private if you just
> want to say "me too" or "I don't care", so as not to swamp this list
> with the "I just want to be counted" responses. It's busy enough as-is
> :-) Google turned up a few concerns about the regexp overhead, but not
> enough to convince me that my performance issue is typical, so feedback
> is definitely appreciated.
>
> Note that this patch addresses PCRE exclusively, but the classic regex
> implementation is similar and I'll be happy to code this up for regex if
> there is interest.
>
> Cheers,
>
> -- Bert
>
> --
> Bert Driehuis, MIS -- bert_driehuis
nl.compuware.com -- +31-20-3116119
> Dihydrogen Monoxide kills! Join the campaign at http://www.dhmo.org/
Content-Description:
[ Attachment, skipping... ]
-
To unsubscribe, send mail to majordomo
postfix.org with content
(not subject): unsubscribe postfix-users
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]