EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail.FM Forums > FastMail.FM General Discussions
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

FastMail.FM General Discussions Everything that does not belong in the help or feature requests Forums goes here. This includes discussion about FastMail.FM policies, development (such as stylesheet development),FastMail.FM support sites like the Wiki, and so forth.

Reply
Thread Tools
Unread 5th January 2002, 05:41 AM   #1
Shelded
 Moderator 
 
Join Date: Aug 2001
Location: USA Northwest
Posts: 3,842
Unhappy FM is down

I don't see notice that this was scheduled but FM is down and it's been over half an hour.

-----------------
This screen was last refreshed at: Fri Jan 4 22:36:10 2002

The current time is Fri Jan 4 22:42:06 2002

When we last checked, the server was DOWN.

The server has been down since Fri Jan 4 21:47:31 2002
--------------------
Shelded is offline   Reply With Quote
Unread 5th January 2002, 05:44 AM   #2
munchkin
 Moderator 
 
Join Date: Nov 2001
Location: Milliways
Posts: 1,165
yea...noticed the same...sitting here and waiting..but i got plenty time
munchkin is offline   Reply With Quote
Unread 5th January 2002, 05:54 AM   #3
munchkin
 Moderator 
 
Join Date: Nov 2001
Location: Milliways
Posts: 1,165
...seems to be working again...
munchkin is offline   Reply With Quote
Unread 5th January 2002, 05:57 AM   #4
cornbread
Junior Member
 
Join Date: Jan 2002
Posts: 2
Also...

I had problems yesterday afternoon sending messages from the web interface and IMAP was hanging in Mozilla.
cornbread is offline   Reply With Quote
Unread 5th January 2002, 05:59 AM   #5
cornbread
Junior Member
 
Join Date: Jan 2002
Posts: 2
It's back!

It looks like it is working again!
cornbread is offline   Reply With Quote
Unread 5th January 2002, 07:16 AM   #6
Jeremy Howard
Ultimate Contributor
 
Join Date: Sep 2001
Location: Australia
Posts: 11,499
Well, it's been a while since we had an outage more than a few minutes because we've got a few back up systems in place now. So for a substantial problem to happen now requires multiple failures, which is what happened this morning.

The web server was down for about and hour and a half. All IMAP and mail services continued to operate normally. The point of failure was a program we have that provides IMAP connections to the web processes. When this failed a number of logs were created which we're now looking at to determine exactly what problem occured.

Normally when there is a problem, our regular testing program that runs every 20 minutes will identify it and attempt to create corrective action, as well as notifying us and logging diagnostic information. The program correctly identified the problem, and attempted to restart the relevent services (in this case, the web service). Unfortunately, after we added the front-end a couple of weeks ago that compresses data, we forgot to include something to restart the front-end when corrective action is taken. So the problem with the communication between the web server and IMAP server was not corrected. The monitoring program hadn't previously needed to restart any services since we added the compressing front-end, so this problem hasn't previously occured.

After corrective action is taken, the monitoring program waits 60 seconds and tries again. Normally if the corrective action failed, the 2nd monitoring attempt would fail, and at this point Rob and I get paged. However, the nature of the problem was such that the partial restart of the backend actually resulted in the server working for a couple of minutes correctly, so this 2nd attempt actually succeeded, and we didn't get paged...

But of course 20 minutes later when it tried again, it failed again. And again it failed to take corrective action. Each time corrective action is taken Rob and I are sent a warning email, but we didn't see these because it's night-time in Australia. When I got up I saw the warning emails (one every 20 minutes for a 90 minute period) and manually fixed the problem.

Anyway, the bit of good news is that when we have a problem like this it gives us a lot of information on how to avoid it next time. Those of you who have been with FastMail.FM for a while have hopefully noticed how it's reliability has been consistently improving. As a result of this latest problem I'm going to make some more changes:
  • I'm going to include a test of recovery from failure to part of the testing routine that we run before we upload new code. If we had this the corrective action would have succeeded the first time
  • I'm going to change our monitoring program so that if it has to take corrective action on 2 runs in any 1 hour, it will page us
  • Rob and I will study the logs of this latest problem to see why our IMAP socket daemon failed in the first place and will fix the pertinent section of code.
Jeremy Howard is offline   Reply With Quote
Unread 5th January 2002, 09:35 AM   #7
Shelded
 Moderator 
 
Join Date: Aug 2001
Location: USA Northwest
Posts: 3,842
while the sheriff sleeps, where's the deputy?

First off, I'm a happy 'customer' and I've PAID for far worse service than this. Sorry to give you such a rude awakening this morning.

Weren't you going to have some users with the admin acess who could restart things when this was a problem? I went to the admin site to diagnose it and if I'd been an admin it was wanting me to cut loose with the remedy.

At least a few key users should be allowed to email your pager and have it wake you.
Shelded is offline   Reply With Quote
Unread 5th January 2002, 10:49 AM   #8
Jeremy Howard
Ultimate Contributor
 
Join Date: Sep 2001
Location: Australia
Posts: 11,499
Yes, there are 7 users with exactly this (the capability to page me and Rob). But none of them paged me until I was already up (not surprising at this time of year). I really need to get a few more people on board.

Shelded--shoot me an email if you don't mind doing this and I'll send you an admin username and password.
Jeremy Howard is offline   Reply With Quote
Unread 5th January 2002, 10:59 PM   #9
Ann_jr
Senior Member
 
Join Date: Nov 2001
Location: CT, USA
Posts: 124
This gives me a good feeling

Quote:
Originally posted by Jeremy Howard
[...] Unfortunately, after we added the front-end a couple of weeks ago that compresses data, we forgot to include something to restart the front-end when corrective action is taken. [...]
I am always impressed when folks who know a lot are willing to say where things went wrong, and then explain what's being done to keep them from happening again. Thanks!
Ann_jr is offline   Reply With Quote
Unread 6th January 2002, 09:39 AM   #10
moquette
Junior Member
 
Join Date: Nov 2001
Posts: 10
Indeed! I second that! Very professional Jeremy.

Glad to be an end user,
Anthony.
moquette is offline   Reply With Quote
Unread 6th January 2002, 01:23 PM   #11
vorapoap
Junior Member
 
Join Date: Dec 2001
Posts: 3
Happy New Year..

I am your happy customer

Keep up a good work

Cheer!
vorapoap is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 09:59 AM.

 

Copyright EmailDiscussions.com 1998-2013. All Rights Reserved