|
Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com |
Subject: Re: Content filtering 101
From: Liviu Daia (Liviu.Daia
imar.ro)Date: Wed Jun 07 2000 - 19:33:30 CDT
- Next message: Brad Knowles: "Re: multiple mail servers"
- Previous message: ChrisG: "SMTP vs. SENDMAIL for sending."
- In reply to: Bennett Todd: "Re: Content filtering 101"
- Next in thread: Bennett Todd: "Re: Content filtering 101"
- Reply: Liviu Daia: "Re: Content filtering 101"
- Reply: Bennett Todd: "Re: Content filtering 101"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 7 June 2000, Bennett Todd <bet
rahul.net> wrote:
> 2000-06-07-09:16:00 Liviu Daia:
> > (2) Bennett Todd's tailbiter, version 1.1 (with the latest patch).
> >
> > It has the same major drawbacks as the previous versions of
> > macofida:
>
> Yup. I was trying to figure out the filtering interface and get a
> simple tool in place for function and performance testing; I'm going
> to take a look at the latest macofida as a framework for running
> my filter, although I want to lose the factor-of-4 performance hit
> first. Then again, I guess I oughta try macofida and make sure it
> suffers a comparable performance hit; the problem could be in the
> choice of modules I'm using for the SMTP, and the way I'm using them.
As a matter of fact, you do IMO have a problem with
Net::SMTP::Server (see below), but I think Net::SMTP::Server::Client and
Net::SMTP are pretty much all you can hope for (if you forget about the
OOP overhead that is).
Anyway, as it is now macofida is slower than your tailbiter, simply
because it reads the messages one line at a time, and saves them to
temporary files. I believe this can be fixed though; see below.
> > - it can't send back an error code to the client process, so if
> > something goes wrong the best it can do is to save the message to
> > /var/tmp/tailbiter.<PID> (provided it managed to read it first ---
> > otherwise the message is simply lost;
>
> If it didn't manage to read it first, I'd like to hope that Postfix
> would have noticed that the dialogue failed, and held the message. If
> not there's not much I can do.
Ok, true.
> It's true that it'd be sexier to be able to report other problems ---
> can't relay back in to postfix, out of memory, out of disk, whatever
> --- back to the sending postfix, to get the message held where it
> is. That's the main incentive I see to switch to macofida.
The only thing that really matters here is being able to send back
a '450' in case of error. Conditions like "out of memory" and "out
of disk" should be logged by the filter itself before bailing out.
Additionally sending the explanation back to Postfix along with the
'450' would be nice, but not essential.
> > - the messages is loaded completely in memory; that happens because
> > of the way Net::SMTP::Server::Client works, so saving it to a
> > temporary file won't change that;
>
> And indeed that's necessary if the scanning algorithm you wish to
> apply is running a regexp (or series of regexps) over the whole
> message in multi-line match mode.
I'm not sure I want to load a 10 Mb message in memory, no matter
what. Actually, I'm not even sure running a multi-line match against
a message that size would be faster than a line-by-line match. I'd
also expect the match function alone to eat something like 20 Mb all by
itself.
But you do have a point here: the vast majority of messages _are_
small enough to read in memory. Now, the right way to handle that is
probably to get the size of the message from the SMTP negotiation, and
act accordingly:
- if it's smaller than, say, 100k load the message in memory;
- if it's smaller than 1 Mb read the message one line at a time and save
it to a temporary file;
- otherwise relay the message unfiltered.
In order to do that you'd have to hack Net::SMTP::Server::Client to
- advertise "ESMTP" and "SIZE";
- parse the size from the "MAIL FROM:".
That's easy with the current version of macofida --- although I'm
starting to believe the right thing to do would be to write your own
_simple_ SMTP client and server, and forget about all these contortions.
> > - it uses Net::SMTP::Server for spawning children, which looks much
> > less robust than Net::Daemon, at least AFAICT.
>
> Sorry? Net::SMTP::Server doesn't spawn children, and if I didn't
> explicitly fork right in the perl I wrote, this would be a
> single-threaded, purely sequential SMTP server. And given the
> benchmark results I got --- no improvement from multi-processing ---
> that'd probably be more efficient. But Linux forks quick, so I'm
> not going to sweat that performance hit right now. The main reason
> I forked was so that when reading the whole message into memory,
> one gigantic bloater of a message wouldn't leave the long-lived,
> persistent daemon all bloated up.
Like I said, Net::Daemon is much more robust. Among _many_ other
things, it can handle interrupted system calls, which is something you
have to do on SysV OSes. Net::SMTP::Server is a toy; the one you want
is Net::Daemon.
> > It also has a few other minor annoyances, like calling
> > "gethostbyaddr" for no real reason (fun when not running a DNS),
>
> Well, it seemed to me the natural way to check and make sure the
> connection was coming from localhost. Easy enough to # out if you
> don't want to do that check. Or recode it to compare the raw addr
> against a packed 127.0.0.1.
Yup, that's the way I'd do it. There's no need to call
"gethostbyaddr".
> Whatever.
>
> > [...] and not being able to cope with my (admittedly not the
> > latest-and-greatest) Perl 5.004_04 at home, because:
>
> because 5.004_04 doesn't support pre-compiling regexps, that's why.
>
> If you want to recode the filter to recompile the regexps for every
> message, so you can continue to use an unfortunately old release, that
> shouldn't be too hard; in fact, I think it'll work if you just:
>
> --- tailbiter Tue Jun 6 09:59:57 2000
> +++ tailbiter.oldperl Wed Jun 7 11:31:37 2000
> 
-64,7 +64,7 
> open(FP, "<$pat") or die "$0: $pat: $!\n";
> while (<FP>) {
> chomp;
> - push
re, qr/$_/im;
> + push
re, $_;
> }
> close FP;
$ ./tailbiter
: syntax error at ./tailbiter line 106, near ") for "
: Global symbol "smtp" requires explicit package name at ./tailbiter line 107.
: Global symbol "msg" requires explicit package name at ./tailbiter line 110.
: syntax error at ./tailbiter line 111, near "} else"
: Execution of ./tailbiter aborted due to compilation errors.
Sorry for being an asshole and not fixing it myself. :-)
> > (a) Basically, the above setup is useless for testing filter
> > performance. What seems to happen here is that Postfix will
> > happily pump up the messages to the filter at full speed (because
> > of the 1000 limit above I get essentially the same rate as in the
> > unfiltered case), a huge queue is created by the second smtpd, and
> > smtp-source returns without waiting for it to drain. stat-ing the
> > spool would probably interfere with the results, so how do we test
> > this? Comments / corrections / suggestions welcome.
>
> Grab timing from the logfile, rather than the injector. Remember to
> reset the logfile between each run, both to ensure identical testing
> circumstances (probably not significant) and to make it really easy to
> sort out logfile entries by which run they apply to.
Yuck. Ok, I can do that.
> > (b) Running tinydns as above seems to be important. Even with
> > "disable_dns_lookups = yes", Postfix tries to resolve localhost.
> > Without actually looking into it, I'd say $disable_dns_lookups
> > only affects smtp, while smtpd still tries to resolve client's IP.
> > Wietse?
>
> Tries to _resolve_ localhost? Or to gethostbyaddr localhost?
I think I said "tries to resolve client's IP".
> The latter should be happy to refer to /etc/hosts if you really have
> no DNS at all.
That's what I thought too, but for some reason it doesn't. I might
even bother to find out why one day.
> I ran dnscache as my resolver, and it has localhost wired in, so I
> didn't notice that potential problem.
>
> > (c) I didn't try running tailbiter instead of macofida, but I
> > suspect the same thing happens with it: the "backlog" observed
> > earlier by Bennett is actually the queue created by the
> > second smtpd, and the big speed difference is actually due to
> > "gethostbyaddr" and / or other DNS lookup failures. Again, comments
> > / corrections are welcome.
>
> Now _that's_ a fascinating theory, I should be able to test that easy
> enough....
>
> Nope, I do not believe that's the case, at least with my test setup.
> When I page through a logfile from my test run, I get 5-10 messages
> being accepted by smtpd for each smtp sending on into the filter,
> until the injector visibly stops loading messages in, and the backlog
> drains. The filter was continuing to be invoked until very near the
> end of the logfile; the back-end smtp delivering into /dev/null does
> not seem to have collected any backlog.
I posted a correction. Anyway, did you raise the smtpd limit to
1000 like I did?
> Would you like me to email you a log? Only about 50KB compressed with
> bzip2.
I take your word for it. We're not testing the same setup anyway.
Regards,
Liviu Daia
-- Dr. Liviu Daia e-mail: Liviu.Daiaimar.ro Institute of Mathematics web page: http://www.imar.ro/~daia of the Romanian Academy PGP key: http://www.imar.ro/~daia/daia.asc
- Next message: Brad Knowles: "Re: multiple mail servers"
- Previous message: ChrisG: "SMTP vs. SENDMAIL for sending."
- In reply to: Bennett Todd: "Re: Content filtering 101"
- Next in thread: Bennett Todd: "Re: Content filtering 101"
- Reply: Liviu Daia: "Re: Content filtering 101"
- Reply: Bennett Todd: "Re: Content filtering 101"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]