![]() |
|
|||||||
| Fastmail.FM Help and Current Issues This forum is for users to help each other to solve any problems they are having using FastMail.FM. It's also the place to discuss problems such as outages, slowness or other similar issues. |
![]() |
| Thread Tools |
|
|
#1 |
|
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098
Representative of:
Fastmail.FM |
Why replication took time to setup
I posted this already here:
http://www.emaildiscussions.com/...584#post397584 But it was buried in a deep thread, so maybe people didn't read it. It actually contains interesting information, so I thought I should re-post as a top level thread and make some edits. --- The initial issue that made us realise we had to implement some form of replication occurred in November 2005 last year (http://blog.fastmail.fm/?p=521) when corruption on one of our major volumes caused 3 days of down time. After that, we started working on how we were going to get replication setup. On the whole, the process went slower than expected. I'd put this down to a couple of things: 1. The cyrus replication code wasn't really production ready. We knew this when we started, and thought about our options which really were: a) Use cyrus replication and help bring it up to production readiness b) Use some other replication method (eg DRBD - http://www.drbd.org/) We decided to go with cyrus replication because with block level replication, you're still not protected from kernel filesystem bugs. If the kernel screws up and writes garbage to a filesystem, both the master and replica are corrupted. Protection against filesystem corruption was one of our major goals with replication. This wasn't really that crazy because we knew the main replication code itself came from David Carter at Cambridge (http://www-uxsup.csx.cam.ac.uk/~dpc22/), so the original code was used in a university environment. The problems were really to do with integrating those changes into the main cyrus branch and accommodating other new cyrus 2.3 features, so we thought it wouldn't be that much work. Unfortunately it seemed that not that many people were actually using cyrus 2.3 replication, so ironing out the bugs took longer than expected. Additional problems included CMU adding largish new features (modsequence support) to cyrus within the 2.3 branch itself that totally broke replication. Still, we spent quite a bit of time setting up small test environments for replication and ironing out the bugs along with a few others. Unfortunately even after rolling out, there were still other bugs present and the CMU change that broke replication was damn annoying since it wasn't immediately obvious and caused some downtime when we had to switch to the replica (basically replication appeared to work fine, but it turned out when you actually tried to fetch a message from the replica, it was empty). After that disaster we implemented some code that allows us do replication "test" on users, to see that what the master IMAP server presents to the world is exactly the same as what the replica IMAP server presents to the world. We now run that on a regular basis. A few example postings to the cyrus mailing list with some details http://lists.andrew.cmu.edu/pipermai...st/023331.html http://lists.andrew.cmu.edu/pipermai...ly/022595.html http://lists.andrew.cmu.edu/pipermai...st/023336.html http://lists.andrew.cmu.edu/pipermai...ay/021919.html 2. Our original replication setup was flawed There's a number of ways to do replication. The most obvious is you have one machine as the master, and a separate one as the replica. That's a waste however, because the replica doesn't take as much resources as the master (one writer, no readers). So our plan was to have replica pairs, with half masters on one replicating to have replicas on the other and vice-versa. This would provide better performance in the general case when both machines were up. The problem with this is it turned out to be a bit inflexible, and when one machine goes down, the "master" load on the other machine doubles. It also means the second machine then becomes a single point of failure until the other machine is restored. Neither of these are nice. After a bit of rethinking, we came up with the new slots + stores architecture (see bron's posts elsewhere). Basically everything is now broken into 300G "slots", and a pair of these slots on 2 different machines makes a replicated "store". The nice thing about this approach is that: a) Each machine runs multiple cyrus instances. Each instance is smaller, can be stopped & started independently, can be moved easier, restored more quickly from backup if needed, volume checked more quickly, etc. Smaller units are just easier to deal with b) By spreading out each store pair to different machines, when one machine dies, the load is spread out to all the other servers evenly c) Even after one machine dies, a second machine die'ing would only affect maybe one or two slots, rather than a whole machines worth of users The downside to this solution is management. There's now many, many slots/stores to deal with, which means we had to write management tools. Had we gone with this from the start, it would have saved time. On other other hand, it was only really clear that this was a better solution after we went down the first road and saw the effects. Hindsight is a wonderful thing ![]() 3. The original servers we bought proved to be less reliable than expected. Because we knew we had replication, and because we knew we had a very specific setup we wanted (2U server, 12 drives, 8 x high capacity SATA, 4 x high speed SATA, RAID controller with battery backup, etc) that IBM couldn't deliver, we went with a third party supplier. (http://blog.fastmail.fm/?p=524) Suffice to say, this was a mistake. There is a big difference between hardware that runs stable for years vs hardware that runs stable for months. Replication should be a more a "disaster recovery" scenario, or a "controlled failover" scenario, it shouldn't replace very reliable hardware. We went back to equipment we trusted (IBM servers + external SATA-to-SCSI storage units). It's a pity IBM are now 2.5 months late on delivering the servers they promised us. Trust me, we've already complained to them pretty severly about this. It's lucky we were able to re-purpose some existing servers for new replicated roles. --- So all up, how would I summarise. Had we followed the "perfect" path straight up, things would have gotten to the fully replicated stage faster, though not enormously so, the debugging and software stage still took quite some time, it was more the hardware that slowed us down. On the other hand, the "perfect" path is often only visible with the benefit of hindsight. Additionally, by following some dud paths now, you learn not to take them again in the future. Additional: I've mentioned this in other posts now, but I should re-iterate that 85% of users were on replicated stores when this failure occurred. As Bron has mentioned, had it happened 1-2 weeks later, no-one would have noticed because that machine would have been out of service. This is actually part of the reason that soon as the restore was done, we could say "everyone was replicated". So it's not like 11 months had passed and nothing had happened. 1. We'd chosen, tested and helped debugged a replication system 2. We'd built 2 actual replication setups, scrapping the first after we realised a better arrangement 2. We'd bought and organised 2 sets of extra hardware 4. We'd already moved 85% of our user base to completely new servers Rob Last edited by robmueller : 6th November 2006 at 05:34 AM. |
|
|
|
|
#2 |
|
Moderator
Join Date: Dec 2002
Location: USA
Posts: 8,686
|
Rob, thank you so much for providing one place to link to for this information at the top of it's own thread. For those who thought you did nothing, when saying you would after the last outage, can now see you continually worked on getting all users replicated and had most already moved onto the new system.
![]() Sherry |
|
|
|
|
#3 |
|
Master of the @
Join Date: Jul 2002
Location: A.U
Posts: 1,980
|
Yes Rob thank you from all of us....we now have a lovely e-mail system I just love the Fastmail web interface.
|
|
|
|
|
#4 |
|
Junior Member
Join Date: Jul 2002
Posts: 3
|
Is the replication the reason...
... I am not getting any inbound emails?
SMTP is find. WWW interface is fine. Just am not getting any inbound emails today. |
|
|
|
|
#5 |
|
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098
Representative of:
Fastmail.FM |
There's no problem I can see. Emails are coming in fine.
If you email me your account details and what emails you have expected, I can check the logs. Rob |
|
|
|
|
#6 |
|
Junior Member
Join Date: Feb 2004
Posts: 21
|
Rob, thanks for the explanation. Your team's done enough to ensure I'm keeping my Fastmail account.
My remaining principal complaint is the lack of updates on the status blog. When your service goes down, your users rely on whatever tools you provide them to get an idea of the severity of the problem and an estimated time for service recovery. At two points, the blog went significant times without an update. There might not have been anything new to report, I understand. But a "we're still working on it" with a timestamp would ease user worries and complaints. Thanks. |
|
|
|
|
#7 | |
|
Moderator
Join Date: Dec 2002
Location: USA
Posts: 8,686
|
Quote:
![]() Communication During Server Outages. Sherry |
|
|
|
|
|
#8 |
|
Junior Member
Join Date: Aug 2004
Posts: 14
|
Concerns about signal points of failure in new setup
It am concerned about signal points of failure in the new setup.
I am concerned that the networking between the replicated servers are a single point of failure. I recommend using redundant networking. I would also recommend that e-mails that have been delivered be placed in a delivered messages queue so that if delivered e-mail is lost it can be redelivered if necessary. |
|
|
|
|
#9 | |
|
The "e" in e-mail
Join Date: Oct 2002
Location: Holon, Israel.
Posts: 4,484
|
Re: Concerns about signal points of failure in new setup
Quote:
Last edited by hadaso : 9th November 2006 at 03:30 PM. |
|
|
|
|
|
#10 |
|
Junior Member
Join Date: Aug 2004
Posts: 14
|
Sorry, I was not clear.
I was suggesting that e-mail sent to Fastmail.fm users would be put into a queue after they have been put in the users mailbox. That way if any of the messages in the mailbox were lost due to a server failure they could be redelivered |
|
|
|
|
#11 |
|
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098
Representative of:
Fastmail.FM |
This is something we have thought about.
Email delivered goes through an internal "proxy" we've written, so it could maintain a queue of last delivered emails for the last 5 minutes say, so that we can "replay" those messages if needed. On the other hand, this probably shouldn't be required, because replication will keep everything up to date anyway. Rob |
|
|
|
|
#12 | |
|
= Permanently banned =
Join Date: Nov 2005
Posts: 128
|
Quote:
Thanks. I needed that. |
|
|
|
|
|
#13 |
|
Moderator
Join Date: Aug 2001
Location: USA Northwest
Posts: 3,842
|
MOST IMPORTANT: POLITENESS
* Keep the tone and content of your posts respectful and polite (even if you're angry about something). |
|
|
|
|
#14 |
|
Administrator
Join Date: Aug 2001
Location: UK
Posts: 3,083
|
kastaway, this is your only warning from me. One more problem with attitude or anything else that is in clear breach of the Rules, and you will be banned from these forums.
This is not up for debate. |
|
|
|
|
#15 |
|
= Permanently banned =
Join Date: Oct 2006
Posts: 10
|
Mod: Comment removed
Last edited by JeffK : 17th November 2006 at 01:47 PM. |
|
|
![]() |
| Thread Tools | |
|
|