![]() |
|
|||||||
| FastMail.FM General Discussions Everything that does not belong in the help or feature requests Forums goes here. This includes discussion about FastMail.FM policies, development (such as stylesheet development),FastMail.FM support sites like the Wiki, and so forth. |
![]() |
| Thread Tools |
|
|
#1 |
|
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098
Representative of:
Fastmail.FM |
Status update
I thought I'd pull together some bits bron and I have posted to separate threads.
I've been away from the forum for the last week or two because I've been trying to concentrate on getting all our systems stable. It appears there's been a couple of issues that have caused problems. 1. cyrus sometimes goes into a busy wait loop There's some good and bad news on this. I've spent the last week working really hard on trying to find the cause of the recent instability in cyrus. It's quite strange since things were pretty much fine for months, then suddenly they've all gone bad in the last month or so with no changes. I've tracked down the main thing I believe was causing this to occur, which was a bug in cyrus causing delivery processes to crash on certain messgages with certain users using sieve body :matches and :regexps. it seems that these crashes at the right/wrong time would cause the DB corruption that then sent cyrus spinning out of control. The bad news is that there still seems to be some other case because one of the servers had the same problem happen again yesterday. At least it seems to be greatly reduced (it was happening about once every 1 day or so, but there's only been 1 event in the last week), but I'm still going to try and see if I can track down the edge cases. I posted on the cyrus mailing list, but didn't seem to find any particular help on BDB or other people experiencing the same issue. http://lists.andrew.cmu.edu/piperma...une/022261.html What this means I'm not sure. Either there's some particular combination of software that's causing the problem (though seems unlikely given that we have two totally different install environments that show the problem), or it's some weird edge case that others just don't seem to trigger (we've seen that before as well with linux kernel bugs that only seem to show with reiserfs + cyrus in moderate load situations) 2. database errors We've had several separate problems with our database server. The first was that it was running out of connections. Now we do watch the connection limit on the DB, and have raised it over time and systems grow, more connections are required, etc. Unfortunately it seems there's an system hard limit that was being hit. http://bugs.mysql.com/bug.php?id=13335 With some tweaking of systems, separating out "read-only" connections, we've significantly reduced the number of required connections and been able to spread the load as well so these should be resolved. 3. New replicated servers We have our new replicated servers setup and have started moving users onto them. We're keeping the process gradual partly so we don't overload IO to the machines during the day, but also as a precaution to make sure everything is working well at each stage. We've tested the machines with stress tests, but somehow production data and usage always seems to dig up bugs that nothing else can create. My absolute priority at the moment is trying to get everything as stable as possible and setting up redundant systems for our database and imap servers. All outages are annoying, and it's my aim to avoid them as much as possible. Rob |
|
|
|
|
|
#2 |
|
Senior Member
Join Date: May 2002
Posts: 196
|
Thanks very much for the update Rob. Although I do trust you and the rest of the FM team to keep things running, it's still nice to get updates like this. No other service that I know of ever keeps its users updated with such detailed info on a regular, consistent basis. Being able to stay informed is one of the MANY reasons I will always be a diehard FM user. Y'all rock!!!
![]() |
|
|
|
|
|
#3 |
|
Master of the @
Join Date: Oct 2005
Location: Here and Now...
Posts: 1,076
|
Sounded a little scary, but thanks for the update.
|
|
|
|
|
|
#4 |
|
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098
Representative of:
Fastmail.FM |
Posting information for people from another thread:
http://www.emaildiscussions.com/...789#post379789 I think our biggest long term problem has been capacity planning. This is partly our fault from being conservative with our spending, and partly not quite seeing the "elbow points" before they arrive. Part of the problem is that FastMail as a system is mostly IO bound rather than CPU bound. With IO, basically things can be pretty much going along fine, and then suddenly you'll outgrow the IO limits of your system. At that point, IO requests are coming in faster than your system can handle, and the outstanding request queue grows very fast and causes performance to drop very suddenly. We've seen cases where a machine has gone from performing fine one week, to 2-3 weeks later having high load due to IO constraint. That's really caught us by surprise. It's also generally a lot easier to measure CPU usage than IO usage on systems. Most computers easily show you "% of CPU time not idle", and while "% of time in IO wait state" is generally also available, because of the elbow effect, you can go from having a low % IO wait state to a high % IO wait state very quickly, it's a lot less linear than CPU usage. Anyway, anticipating and being ready for these situations has been something we've not been very good at, which is why I'm happy with where things are going. 1. We have new cabinets with considerable spare space to add new machines + storage 2. We've probably over spec'ed our new machines with regard to IO relative to the number of users on them (number of drive spindles, RAID controllers and array sizes, battery backed up write-back caches, software tweaks, RAM for read caching, etc). So far I'm happy with that loads that the new machines have been showiing even at peak times and already having moved users to one of them to be 75% full. That's probably about where we'll leave it to allow for space creep over time... 2. All power usage is now monitored, so we know we won't "trip" circuits as happened once before 3. We're setting up (or already using) redundant replicated servers for frontend proxying, IMAP servers, database servers, web servers, mx servers, etc On another note, I'm now almost certain that the cyrus/BDB problems are related to "strange" emails that either crash the delivery process or cause it to timeout in some way. The good news is that I've now found all the crash cases from what I can see (there are no more core dump files being generated) and the couple of timeout cases I've been able to work around in our proxy delivery system (mostly cases with extremely long lines containing just CR's instead of CRLF pairs as RFC required) Rob |
|
|
|
|
|
#5 |
|
Master of the @
Join Date: May 2002
Location: Hawaii
Posts: 1,975
|
Thanks for this status update summary and referring to it on the blog.
|
|
|
|
|
|
#6 |
|
The "e" in e-mail
Join Date: Feb 2006
Location: EU
Posts: 3,050
|
I'd also like to thank for the status update.
While I'm unable to suggest anything useful on to how to make FM better (whatever this may mean) , I'm glad that the management takes time to describe existing problems or past mistakes, and lists plans for improvements. Let's hope your current ideas will lead to a more stable service. Good luck! |
|
|
|
|
|
#7 |
|
Essential Contributor
Join Date: Apr 2004
Location: BeiJing, PRC
Posts: 276
|
RAID battery backed memory
Hi,
I had a question concering a comment in the status blog. You talk about one of the RAID controllers having battery backed up memory (used as a cache). I've seen similar RAID controllers. How come you just didn't move the memory from the 'broken' controller to the new one? The battery is usually part of the memory, so it moves with it.... Max. |
|
|
|
|
|
#8 | |
|
Junior Member
Join Date: Jul 2006
Posts: 2
|
Re: RAID battery backed memory
Quote:
Cheers -- adamc00 |
|
|
|
|
|
|
#9 | |
|
Essential Contributor
Join Date: Apr 2004
Location: BeiJing, PRC
Posts: 276
|
Re: Re: RAID battery backed memory
Quote:
(pdf warning!) <http://www.lsilogic.com/files/docs/t...4X_FBP_qig.pdf> As you can tell, I've only dealt with LSILogic controllers (on Dell servers, actually). Perhaps this is unique to them. I wonder if the modules will work on other cards? Max. |
|
|
|
|
|
|
#10 | ||
|
Junior Member
Join Date: Jul 2006
Posts: 2
|
Re: Re: Re: RAID battery backed memory
Quote:
Quote:
I am making it a priority to get familiar with the TBBU ![]() -- adamc00 |
||
|
|
|
|
|
#11 |
|
The "e" in e-mail
Join Date: Feb 2002
Posts: 2,644
|
When will we start seeing compensation for the days we've lost due to FastMail outages? I've sustained the two big ones in the past month, and FastMail has done nothing to offer reconciliation.
|
|
|
|
|
|
#12 | |
|
Essential Contributor
Join Date: Apr 2004
Location: BeiJing, PRC
Posts: 276
|
Re: Re: Re: Re: RAID battery backed memory
Quote:
Max. |
|
|
|
|
|
|
#13 | |
|
The "e" in e-mail
Join Date: May 2002
Posts: 2,713
|
Quote:
|
|
|
|
|
|
|
#14 | |
|
Master of the @
Join Date: Jul 2002
Location: Hampshire, UK
Posts: 1,054
|
sflorack wrote:-
Quote:
|
|
|
|
|
|
|
#15 |
|
The "e" in e-mail
Join Date: Feb 2006
Location: EU
Posts: 3,050
|
That's a rather unusual suggestion... would you say something similar in every case of your supplier's failing to deliver?
Disclaimer: I do not use FM for business (at least not for any serious business). While I do get annoyed by FM's downtimes, the annoyance factor has not yet reached the critical level. So for the time being I'm not going to sue FM and demand my money back ![]() |
|
|
|
![]() |
| Thread Tools | |
|
|