Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email firstname.lastname@example.org
From: Alex Zinin (azininCISCO.COM)
Date: Mon Mar 26 2001 - 23:18:29 CST
> Ok, so the question I have now is: Why is the OSPF process crashing?
> a) bad hardware
> b) bad software
> Case A is interesting if we consider the subtle cases, such as ... oh I
> don't know, how about a Process Memory Parity Error, not that it happens
> in a real network or anything. Now essentially we have some scrambled
> state somewhere in the box, we don't know what it is, all we know is that
> it caused our software to crash. Let us gracefully restart the process.
> Never mind that we don't know what really happened, what state any of the
> databases are in, and we can now assume we gracefully rebuild all of them?
Let me ask you a question: when you see a process/router crash
after it has been running for a while, what is the first thing
you're doing---decode the crashinfo and gdb the coredump or bring
the router back up?
Or this way: would you not like your router to at least try to
bring the process that has just crashed back up with (of course,
with an oscillation prevention mechanism)?
> Case B is even more interesting. The software crashed. Ok, so now I'm
> supposed to believe that crashing software can be fixed by adding MORE
> software and protocol extentsions, even though the original software
> didn't work well enough not to crash in the first place? Essentially my
> take on this is: I couldn't write robust software, so let me make the
> system more robust by adding more software!
Following your logic here we would probably have to remove
protected memory support from all CPUs and OSes and have all programmers
fix all their bugs in all their programs... Oh, dear...
> The idea here being that perhaps the assumptions underlying "graceful"
> restart could stand to be be examined a bit more carefully. What exactly
> are the cases you are solving for?
Those including the unplanned restart case.