EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail.FM Forums > FastMail.FM General Discussions
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

FastMail.FM General Discussions Everything that does not belong in the help or feature requests Forums goes here. This includes discussion about FastMail.FM policies, development (such as stylesheet development),FastMail.FM support sites like the Wiki, and so forth.

Reply
Thread Tools
Unread 20th June 2006, 03:18 PM   #1
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098

Representative of:
Fastmail.FM
Status update

I thought I'd pull together some bits bron and I have posted to separate threads.

I've been away from the forum for the last week or two because I've been trying to concentrate on getting all our systems stable. It appears there's been a couple of issues that have caused problems.

1. cyrus sometimes goes into a busy wait loop

There's some good and bad news on this. I've spent the last week working really hard on trying to find the cause of the recent instability in cyrus. It's quite strange since things were pretty much fine for months, then suddenly they've all gone bad in the last month or so with no changes.

I've tracked down the main thing I believe was causing this to occur, which was a bug in cyrus causing delivery processes to crash on certain messgages with certain users using sieve body :matches and :regexps. it seems that these crashes at the right/wrong time would cause the DB corruption that then sent cyrus spinning out of control.

The bad news is that there still seems to be some other case because one of the servers had the same problem happen again yesterday.

At least it seems to be greatly reduced (it was happening about once every 1 day or so, but there's only been 1 event in the last week), but I'm still going to try and see if I can track down the edge cases. I posted on the cyrus mailing list, but didn't seem to find any particular help on BDB or other people experiencing the same issue.

http://lists.andrew.cmu.edu/piperma...une/022261.html

What this means I'm not sure. Either there's some particular combination of software that's causing the problem (though seems unlikely given that we have two totally different install environments that show the problem), or it's some weird edge case that others just don't seem to trigger (we've seen that before as well with linux kernel bugs that only seem to show with reiserfs + cyrus in moderate load situations)

2. database errors

We've had several separate problems with our database server. The first was that it was running out of connections. Now we do watch the connection limit on the DB, and have raised it over time and systems grow, more connections are required, etc. Unfortunately it seems there's an system hard limit that was being hit.

http://bugs.mysql.com/bug.php?id=13335

With some tweaking of systems, separating out "read-only" connections, we've significantly reduced the number of required connections and been able to spread the load as well so these should be resolved.

3. New replicated servers

We have our new replicated servers setup and have started moving users onto them. We're keeping the process gradual partly so we don't overload IO to the machines during the day, but also as a precaution to make sure everything is working well at each stage. We've tested the machines with stress tests, but somehow production data and usage always seems to dig up bugs that nothing else can create.

My absolute priority at the moment is trying to get everything as stable as possible and setting up redundant systems for our database and imap servers. All outages are annoying, and it's my aim to avoid them as much as possible.

Rob
robmueller is offline   Reply With Quote
Unread 20th June 2006, 05:09 PM   #2
memac
Senior Member
 
Join Date: May 2002
Posts: 196
Thanks very much for the update Rob. Although I do trust you and the rest of the FM team to keep things running, it's still nice to get updates like this. No other service that I know of ever keeps its users updated with such detailed info on a regular, consistent basis. Being able to stay informed is one of the MANY reasons I will always be a diehard FM user. Y'all rock!!!
memac is offline   Reply With Quote
Unread 24th June 2006, 11:33 AM   #3
Aimlink
Master of the @
 
Join Date: Oct 2005
Location: Here and Now...
Posts: 1,076
Sounded a little scary, but thanks for the update.
Aimlink is offline   Reply With Quote
Unread 26th June 2006, 04:35 PM   #4
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098

Representative of:
Fastmail.FM
Posting information for people from another thread:

http://www.emaildiscussions.com/...789#post379789

I think our biggest long term problem has been capacity planning. This is partly our fault from being conservative with our spending, and partly not quite seeing the "elbow points" before they arrive. Part of the problem is that FastMail as a system is mostly IO bound rather than CPU bound. With IO, basically things can be pretty much going along fine, and then suddenly you'll outgrow the IO limits of your system. At that point, IO requests are coming in faster than your system can handle, and the outstanding request queue grows very fast and causes performance to drop very suddenly. We've seen cases where a machine has gone from performing fine one week, to 2-3 weeks later having high load due to IO constraint. That's really caught us by surprise. It's also generally a lot easier to measure CPU usage than IO usage on systems. Most computers easily show you "% of CPU time not idle", and while "% of time in IO wait state" is generally also available, because of the elbow effect, you can go from having a low % IO wait state to a high % IO wait state very quickly, it's a lot less linear than CPU usage.

Anyway, anticipating and being ready for these situations has been something we've not been very good at, which is why I'm happy with where things are going.

1. We have new cabinets with considerable spare space to add new machines + storage
2. We've probably over spec'ed our new machines with regard to IO relative to the number of users on them (number of drive spindles, RAID controllers and array sizes, battery backed up write-back caches, software tweaks, RAM for read caching, etc). So far I'm happy with that loads that the new machines have been showiing even at peak times and already having moved users to one of them to be 75% full. That's probably about where we'll leave it to allow for space creep over time...
2. All power usage is now monitored, so we know we won't "trip" circuits as happened once before
3. We're setting up (or already using) redundant replicated servers for frontend proxying, IMAP servers, database servers, web servers, mx servers, etc

On another note, I'm now almost certain that the cyrus/BDB problems are related to "strange" emails that either crash the delivery process or cause it to timeout in some way. The good news is that I've now found all the crash cases from what I can see (there are no more core dump files being generated) and the couple of timeout cases I've been able to work around in our proxy delivery system (mostly cases with extremely long lines containing just CR's instead of CRLF pairs as RFC required)

Rob
robmueller is offline   Reply With Quote
Unread 26th June 2006, 06:07 PM   #5
sjk
Master of the @
 
Join Date: May 2002
Location: Hawaii
Posts: 1,975
Thanks for this status update summary and referring to it on the blog.
sjk is offline   Reply With Quote
Unread 26th June 2006, 08:13 PM   #6
janusz
The "e" in e-mail
 
Join Date: Feb 2006
Location: EU
Posts: 3,050
I'd also like to thank for the status update.
While I'm unable to suggest anything useful on to how to make FM better (whatever this may mean) , I'm glad that the management takes time to describe existing problems or past mistakes, and lists plans for improvements. Let's hope your current ideas will lead to a more stable service. Good luck!
janusz is offline   Reply With Quote
Unread 6th July 2006, 03:11 PM   #7
davidmaxwaterma
Essential Contributor
 
Join Date: Apr 2004
Location: BeiJing, PRC
Posts: 276
RAID battery backed memory

Hi,

I had a question concering a comment in the status blog.

You talk about one of the RAID controllers having battery backed up memory (used as a cache).

I've seen similar RAID controllers.

How come you just didn't move the memory from the 'broken' controller to the new one? The battery is usually part of the memory, so it moves with it....

Max.
davidmaxwaterma is offline   Reply With Quote
Unread 7th July 2006, 11:53 AM   #8
adamc00
Junior Member
 
Join Date: Jul 2006
Posts: 2
Re: RAID battery backed memory

Quote:
Originally posted by davidmaxwaterma
Hi,

How come you just didn't move the memory from the 'broken' controller to the new one? The battery is usually part of the memory, so it moves with it....

Max.
On all of the controllers I am familiar with the battery is connected to the controller, not directly to the RAM. If you remove the memory from the card, you remove it from it's power source, goodbye data.

Cheers
--
adamc00
adamc00 is offline   Reply With Quote
Unread 7th July 2006, 12:31 PM   #9
davidmaxwaterma
Essential Contributor
 
Join Date: Apr 2004
Location: BeiJing, PRC
Posts: 276
Re: Re: RAID battery backed memory

Quote:
Originally posted by adamc00
On all of the controllers I am familiar with the battery is connected to the controller, not directly to the RAM. If you remove the memory from the card, you remove it from it's power source, goodbye data.

Cheers
--
adamc00
A quick search reveals the 'Transportable battery backed DIMM module' :

(pdf warning!)
<http://www.lsilogic.com/files/docs/t...4X_FBP_qig.pdf>

As you can tell, I've only dealt with LSILogic controllers (on Dell servers, actually). Perhaps this is unique to them.

I wonder if the modules will work on other cards?

Max.
davidmaxwaterma is offline   Reply With Quote
Unread 7th July 2006, 01:12 PM   #10
adamc00
Junior Member
 
Join Date: Jul 2006
Posts: 2
Re: Re: Re: RAID battery backed memory

Quote:
Originally posted by davidmaxwaterma

I wonder if the modules will work on other cards?

Max. [/b]
That looks like a great solution foir this class of failure. I doubt it would work on any old controller however, only controllers designed specifically to accept this kind of module.

Quote:
From a LSI press release
The MegaRAID SCSI 320-2X ships with the option to field install a battery backup unit for additional fault tolerant data protection. The MegaRAID Battery Backup Unit (part number LSIBBU03) is a mezzanine card that mounts on top of the MegaRAID adapter. This module monitors the voltage level of the DRAM and supporting circuitry installed on the MegaRAID card. If the voltage drops below a predefined level, the Battery Backup Unit switches the memory power source from the MegaRAID card to the battery pack. If the voltage level returns to an acceptable level, the Battery Backup Unit switches the power source back to the MegaRAID adapter card.

A transportable battery backup unit (TBBU) will also be available as a field upgrade option. The TBBU (part number LSITBBU02) adds additional fault tolerance by protecting data in cache memory even in the event of an adapter failure. The TBBU includes battery circuitry integrated onto the DIMM module, which can be removed and mounted onto a replacement RAID controller to prevent data corruption when the system is brought back online.
So it appears that this is an optional extra for at least some LSI cards. The standard battery backup (the first one described above) operates in the manner I am familiar with. The TBBU is what you are familiar with.

I am making it a priority to get familiar with the TBBU

--
adamc00
adamc00 is offline   Reply With Quote
Unread 7th July 2006, 01:45 PM   #11
sflorack
The "e" in e-mail
 
Join Date: Feb 2002
Posts: 2,644
When will we start seeing compensation for the days we've lost due to FastMail outages? I've sustained the two big ones in the past month, and FastMail has done nothing to offer reconciliation.
sflorack is offline   Reply With Quote
Unread 7th July 2006, 01:49 PM   #12
davidmaxwaterma
Essential Contributor
 
Join Date: Apr 2004
Location: BeiJing, PRC
Posts: 276
Re: Re: Re: Re: RAID battery backed memory

Quote:
Originally posted by adamc00
That looks like a great solution foir this class of failure. I doubt it would work on any old controller however, only controllers designed specifically to accept this kind of module.



So it appears that this is an optional extra for at least some LSI cards. The standard battery backup (the first one described above) operates in the manner I am familiar with. The TBBU is what you are familiar with.

I am making it a priority to get familiar with the TBBU

--
adamc00
How to make sure the Fastmail guys know about this solution, so they can at least consider it for their servers?

Max.
davidmaxwaterma is offline   Reply With Quote
Unread 9th July 2006, 05:54 AM   #13
DrStrabismus
The "e" in e-mail
 
Join Date: May 2002
Posts: 2,713
Quote:
Originally posted by robmueller
I think our biggest long term problem has been capacity planning. This is partly our fault from being conservative with our spending, and partly not quite seeing the "elbow points" before they arrive. Part of the problem is that FastMail as a system is mostly IO bound rather than CPU bound. With IO, basically things can be pretty much going along fine, and then suddenly you'll outgrow the IO limits of your system.
Then create a script that simulates an n% increase in system load, and run it.
DrStrabismus is offline   Reply With Quote
Unread 10th July 2006, 01:55 PM   #14
NJSS
Master of the @
 
Join Date: Jul 2002
Location: Hampshire, UK
Posts: 1,054
sflorack wrote:-

Quote:
When will we start seeing compensation for the days we've lost due to FastMail outages? I've sustained the two big ones in the past month, and FastMail has done nothing to offer reconciliation.
I would prefer no compensation, and that all available funds were spent on making FastMail as robust as possible.
NJSS is offline   Reply With Quote
Unread 10th July 2006, 04:12 PM   #15
janusz
The "e" in e-mail
 
Join Date: Feb 2006
Location: EU
Posts: 3,050
That's a rather unusual suggestion... would you say something similar in every case of your supplier's failing to deliver?

Disclaimer: I do not use FM for business (at least not for any serious business). While I do get annoyed by FM's downtimes, the annoyance factor has not yet reached the critical level. So for the time being I'm not going to sue FM and demand my money back
janusz is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 11:37 PM.

 

Copyright EmailDiscussions.com 1998-2013. All Rights Reserved