Well that was damn annoying. I had assumed that during the outage there were no bounced emails, there was no reason for them to bounce. Unfortunately it seems there was a screw up in in the backup MX server (emx1), that did cause some emails to bounce.
On emx1 (the backup mx server) the DNS configuration was wrong. /etc/resolve.conf had been set with the internal DNS server as the primary one, but with an external DNS server as the secondary. That's clearly bogus when the resolver is used to resolve internal domain names.
On top of that, it turns out in postfix that if a DNS lookup for an *lmtp* delivery host fails, then it's treated as a permanent error and the email is bounced. From what I can see, there's actually no configuration setting to change this in postfix, which is annoying.
Anyway, we've now completely revamped the internal DNS setup. There's now dual slave servers at the main NYI site. This is something we used to have, and we had planned to have again, but hadn't actually finished with the switch over to slaving our DNS from Opera's servers internally. We've setup resolv.conf configs to fail back if one internal DNS is down. emx1 also has it's own slave server, will a fallback to the NYI one via the VPN if for some reason it's local one is down. The external DNS servers have been removed from the resolv.conf so it'll never query those directly.
We really should remember Murphys Law. Things will fail at the worst possible time and in the worst possible way. It's about the only kernel crash we've had in the last 6 months, and of course it would happen on the primary DB server, at the time we had the not-yet-redundant DNS server on there as well... *sigh*
Normally I'd go and work out who was affected by the bounce. Unfortunately the logs kept on emx1 are much smaller than at the main site, so we've actually already rotated the ones from when the event occurred, so we can't actually work out exactly which emails bounced
. Grrr. The good news is that emx1 will only be around another 2-3 months, and the replacement will be much more interesting and useful.
Ooooo, actually, I can tell how many emails bounced, because to send out the bounce emails, emx1 would have forwarded them back into our outgoing server at NYI (and it does so via a static IP default transport, so unaffected by the DNS error). So I can see in the main outgoing server log emails where from=MAILER-DAEMON@messagingengine.com, and source=emx1.messagingengine.com. That will tell me how many emails bounced, and where they were bounced to, but it unfortunately won't tell me who the emails were originally sent to, that's been lost because it's not in any of the SMTP information that's logged. I can fix that for the future as well, by logging the Original-Recipient header in the logs. That won't help for the past, but will help if we need to track down who bounce emails were for in the future if for some reason we only have the bounce email log information.
Back to the data, after checking the logs, 354 bounced emails. Damn. Any bounced emails are bad, bad, bad.
All up summary:
1. We were in the process of making internal DNS server changes. We'd started but not finished it fully. We got caught with a kernel crash on a machine that had the new DNS service but it hadn't been replicated yet. Our fault, we shouldn't have left things half done.
2. The DNS failure caused a mis-configuration on emx1 that had never been noticed before to be triggered. That caused some email to bounce. That configuration has been updated so it doesn't happen again.