OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
Re: Greylisting: soft fail if grey listing server down

From: Aaron Wolfe (aawolfegmail.com)
Date: Mon May 12 2008 - 18:36:44 CDT


On Sun, May 11, 2008 at 4:29 PM, mouss <moussnetoyen.net> wrote:

> Geert Hendrickx wrote:
>
> >
> > Why do people always insist to link "unavailable" to "crappy software" ?
> >
> >
>
> because it is? now, stop talking theory. give us examples of real failures
> of good policy servers.
>

I can give an example. We wanted to evaluate greylisting here, and after
looking at the available options decided sqlgrey would fit our needs. We
installed it on a non production box, learned how it worked, did some
testing, etc. Things looked good, so we added it to our live environment.
Things worked well for several days, might have been a week.

Then one day, *very* early in the morning, I woke to my cell going crazy
with emails from nagios and voicemails from frantic sales people. No mail
was coming in, oh the humanity, horror, horror, etc etc.

sqlgrey had a bug (since fixed) where it would occasionally die during log
rotation. Restart and all is well. It managed to die on both of my inbound
servers at the same time: jackpot winner.

Later that day I wrote a script to check and restart it as neccesary. It
still bothered me that I was deferring a handful of messages every few days
while the script did it's business, but soon a patch was available (although
that script still runs, just in case).

I did have monitoring in place when the failure occurred, but since the bug
was unknown to me, I had nothing automatic to restart the service. If I
could have just indicated to postfix that I really don't care if that policy
server works or not somehow, that would have been a much better morning for
me and lots of the people I support. It wasn't a huge thing, obviously
everything eventually came in, but it did kind of suck for a while. Finding
some new errors in the logs and sorting out the failure on my own time would
have been vastly preferable to dealing with angry salesholes in the early
morning.

Maybe I am a bad administrator? I am completely open to that idea, but if
you have criticism please express it in a way that will help me learn better
ways to do this.

The idea of writing a proxy/wrapper to ensure a DUNNO result had occurred to
me before, but that feels like a hack. Maybe the only way though. Would
this be a "Bad Idea" to do?

What is the best way to run a service that you don't care to depend upon?
If everything is always critical, how do you test new things? I would like
to be as 'safe' as possible. For some things, I suppose you can just test
it on only one of your servers and be fairly safe, but with spam fighting
stuff like greylisting, that doesn't really work.

-Aaron