EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail.FM Forums > Fastmail.FM Help and Current Issues
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Fastmail.FM Help and Current Issues This forum is for users to help each other to solve any problems they are having using FastMail.FM. It's also the place to discuss problems such as outages, slowness or other similar issues.

Closed Thread
Thread Tools
Unread 5th November 2006, 06:47 PM   #1
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098

Representative of:
Fastmail.FM
Why replication took time to setup

I posted this already here:
http://www.emaildiscussions.com/...584#post397584

But it was buried in a deep thread, so maybe people didn't read it. It actually contains interesting information, so I thought I should re-post as a top level thread and make some edits.

---

The initial issue that made us realise we had to implement some form of replication occurred in November 2005 last year (http://blog.fastmail.fm/?p=521) when corruption on one of our major volumes caused 3 days of down time. After that, we started working on how we were going to get replication setup. On the whole, the process went slower than expected. I'd put this down to a couple of things:

1. The cyrus replication code wasn't really production ready.

We knew this when we started, and thought about our options which really were:
a) Use cyrus replication and help bring it up to production readiness
b) Use some other replication method (eg DRBD - http://www.drbd.org/)

We decided to go with cyrus replication because with block level replication, you're still not protected from kernel filesystem bugs. If the kernel screws up and writes garbage to a filesystem, both the master and replica are corrupted. Protection against filesystem corruption was one of our major goals with replication.

This wasn't really that crazy because we knew the main replication code itself came from David Carter at Cambridge (http://www-uxsup.csx.cam.ac.uk/~dpc22/), so the original code was used in a university environment. The problems were really to do with integrating those changes into the main cyrus branch and accommodating other new cyrus 2.3 features, so we thought it wouldn't be that much work.

Unfortunately it seemed that not that many people were actually using cyrus 2.3 replication, so ironing out the bugs took longer than expected. Additional problems included CMU adding largish new features (modsequence support) to cyrus within the 2.3 branch itself that totally broke replication.

Still, we spent quite a bit of time setting up small test environments for replication and ironing out the bugs along with a few others. Unfortunately even after rolling out, there were still other bugs present and the CMU change that broke replication was damn annoying since it wasn't immediately obvious and caused some downtime when we had to switch to the replica (basically replication appeared to work fine, but it turned out when you actually tried to fetch a message from the replica, it was empty). After that disaster we implemented some code that allows us do replication "test" on users, to see that what the master IMAP server presents to the world is exactly the same as what the replica IMAP server presents to the world. We now run that on a regular basis.

A few example postings to the cyrus mailing list with some details

http://lists.andrew.cmu.edu/pipermai...st/023331.html
http://lists.andrew.cmu.edu/pipermai...ly/022595.html
http://lists.andrew.cmu.edu/pipermai...st/023336.html
http://lists.andrew.cmu.edu/pipermai...ay/021919.html

2. Our original replication setup was flawed

There's a number of ways to do replication. The most obvious is you have one machine as the master, and a separate one as the replica. That's a waste however, because the replica doesn't take as much resources as the master (one writer, no readers). So our plan was to have replica pairs, with half masters on one replicating to have replicas on the other and vice-versa. This would provide better performance in the general case when both machines were up.

The problem with this is it turned out to be a bit inflexible, and when one machine goes down, the "master" load on the other machine doubles. It also means the second machine then becomes a single point of failure until the other machine is restored. Neither of these are nice.

After a bit of rethinking, we came up with the new slots + stores architecture (see bron's posts elsewhere). Basically everything is now broken into 300G "slots", and a pair of these slots on 2 different machines makes a replicated "store". The nice thing about this approach is that:

a) Each machine runs multiple cyrus instances. Each instance is smaller, can be stopped & started independently, can be moved easier, restored more quickly from backup if needed, volume checked more quickly, etc. Smaller units are just easier to deal with
b) By spreading out each store pair to different machines, when one machine dies, the load is spread out to all the other servers evenly
c) Even after one machine dies, a second machine die'ing would only affect maybe one or two slots, rather than a whole machines worth of users

The downside to this solution is management. There's now many, many slots/stores to deal with, which means we had to write management tools.

Had we gone with this from the start, it would have saved time. On other other hand, it was only really clear that this was a better solution after we went down the first road and saw the effects. Hindsight is a wonderful thing

3. The original servers we bought proved to be less reliable than expected.

Because we knew we had replication, and because we knew we had a very specific setup we wanted (2U server, 12 drives, 8 x high capacity SATA, 4 x high speed SATA, RAID controller with battery backup, etc) that IBM couldn't deliver, we went with a third party supplier. (http://blog.fastmail.fm/?p=524)

Suffice to say, this was a mistake. There is a big difference between hardware that runs stable for years vs hardware that runs stable for months. Replication should be a more a "disaster recovery" scenario, or a "controlled failover" scenario, it shouldn't replace very reliable hardware.

We went back to equipment we trusted (IBM servers + external SATA-to-SCSI storage units). It's a pity IBM are now 2.5 months late on delivering the servers they promised us. Trust me, we've already complained to them pretty severly about this. It's lucky we were able to re-purpose some existing servers for new replicated roles.

---

So all up, how would I summarise.

Had we followed the "perfect" path straight up, things would have gotten to the fully replicated stage faster, though not enormously so, the debugging and software stage still took quite some time, it was more the hardware that slowed us down. On the other hand, the "perfect" path is often only visible with the benefit of hindsight. Additionally, by following some dud paths now, you learn not to take them again in the future.

Additional:

I've mentioned this in other posts now, but I should re-iterate that 85% of users were on replicated stores when this failure occurred. As Bron has mentioned, had it happened 1-2 weeks later, no-one would have noticed because that machine would have been out of service. This is actually part of the reason that soon as the restore was done, we could say "everyone was replicated". So it's not like 11 months had passed and nothing had happened.

1. We'd chosen, tested and helped debugged a replication system
2. We'd built 2 actual replication setups, scrapping the first after we realised a better arrangement
2. We'd bought and organised 2 sets of extra hardware
4. We'd already moved 85% of our user base to completely new servers

Rob

Last edited by robmueller : 6th November 2006 at 05:34 AM.
robmueller is offline  
Unread 6th November 2006, 03:48 AM   #2
Sherry
 Moderator 
 
Join Date: Dec 2002
Location: USA
Posts: 8,686
Rob, thank you so much for providing one place to link to for this information at the top of it's own thread. For those who thought you did nothing, when saying you would after the last outage, can now see you continually worked on getting all users replicated and had most already moved onto the new system.

Sherry
Sherry is offline  
Unread 6th November 2006, 04:57 AM   #3
Terry
Master of the @
 
Join Date: Jul 2002
Location: A.U
Posts: 1,980
Yes Rob thank you from all of us....we now have a lovely e-mail system I just love the Fastmail web interface.
Terry is offline  
Unread 6th November 2006, 02:04 PM   #4
dbarbour
Junior Member
 
Join Date: Jul 2002
Posts: 3
Is the replication the reason...

... I am not getting any inbound emails?

SMTP is find.

WWW interface is fine.

Just am not getting any inbound emails today.
dbarbour is offline  
Unread 6th November 2006, 06:09 PM   #5
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098

Representative of:
Fastmail.FM
There's no problem I can see. Emails are coming in fine.

If you email me your account details and what emails you have expected, I can check the logs.

Rob
robmueller is offline  
Unread 7th November 2006, 06:21 AM   #6
wishlish
Junior Member
 
Join Date: Feb 2004
Posts: 21
Rob, thanks for the explanation. Your team's done enough to ensure I'm keeping my Fastmail account.

My remaining principal complaint is the lack of updates on the status blog. When your service goes down, your users rely on whatever tools you provide them to get an idea of the severity of the problem and an estimated time for service recovery. At two points, the blog went significant times without an update. There might not have been anything new to report, I understand. But a "we're still working on it" with a timestamp would ease user worries and complaints.

Thanks.
wishlish is offline  
Unread 7th November 2006, 07:20 AM   #7
Sherry
 Moderator 
 
Join Date: Dec 2002
Location: USA
Posts: 8,686
Quote:
Originally posted by wishlish
My remaining principal complaint is the lack of updates on the status blog. When your service goes down, your users rely on whatever tools you provide them to get an idea of the severity of the problem and an estimated time for service recovery.
That discussion is going on in another thread.

Communication During Server Outages.

Sherry
Sherry is offline  
Unread 9th November 2006, 02:46 PM   #8
rossmpersonal
Junior Member
 
Join Date: Aug 2004
Posts: 14
Concerns about signal points of failure in new setup

It am concerned about signal points of failure in the new setup.

I am concerned that the networking between the replicated servers are a single point of failure. I recommend using redundant networking.

I would also recommend that e-mails that have been delivered be placed in a delivered messages queue so that if delivered e-mail is lost it can be redelivered if necessary.
rossmpersonal is offline  
Unread 9th November 2006, 03:14 PM   #9
hadaso
The "e" in e-mail
 
Join Date: Oct 2002
Location: Holon, Israel.
Posts: 4,484
Re: Concerns about signal points of failure in new setup

Quote:
Originally posted by rossmpersonal
I would also recommend that e-mails that have been delivered be placed in a delivered messages queue so that if delivered e-mail is lost it can be redelivered if necessary.
Isn't this part of what a "Sent Items" folder is for? Email that has not been delivered successfully is already kept in a queue and delivery is retried until a certain time limit is reached. This is standard email behaviour. I don't see any reasonable way that FastMail can implement a second non-standard mailstore that a user will be able to access to redeliver email that was already delivered successfully that is better than the existing system. The capability already exists for email saved in IMAP folders ("Sent Items" or other). A user can select any message and redirect it to any address, including the original recipient of a message in the "Sent Items" folder. Delivery of a message that has been successfully sent by FastMail to the recipient's mailbox is the responsibility of the recipient's software that has already received the message. I don't think the sender should responsible for misconfiguration of the recipient's email system if it fails to deliver mail it has received and acknowledged receipt of to its final destination. There is no way FastMail can determine what happens to email after its has reached its destination..

Last edited by hadaso : 9th November 2006 at 03:30 PM.
hadaso is offline  
Unread 9th November 2006, 03:52 PM   #10
rossmpersonal
Junior Member
 
Join Date: Aug 2004
Posts: 14
Sorry, I was not clear.

I was suggesting that e-mail sent to Fastmail.fm users would be put into a queue after they have been put in the users mailbox. That way if any of the messages in the mailbox were lost due to a server failure they could be redelivered
rossmpersonal is offline  
Unread 14th November 2006, 08:06 AM   #11
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,098

Representative of:
Fastmail.FM
This is something we have thought about.

Email delivered goes through an internal "proxy" we've written, so it could maintain a queue of last delivered emails for the last 5 minutes say, so that we can "replay" those messages if needed.

On the other hand, this probably shouldn't be required, because replication will keep everything up to date anyway.

Rob
robmueller is offline  
Unread 14th November 2006, 03:39 PM   #12
kastaway
= Permanently banned =
 
Join Date: Nov 2005
Posts: 128
Quote:
Originally posted by robmueller
This is something we have thought about. On the other hand, this probably shouldn't be required, because ....
Heh. Heehee. Ahh, hahaha. BWAAAHaaahahaha. <wipes tear from eye>

Thanks. I needed that.
kastaway is offline  
Unread 14th November 2006, 04:57 PM   #13
Shelded
 Moderator 
 
Join Date: Aug 2001
Location: USA Northwest
Posts: 3,842
Thumbs down I didn't ignore that.

MOST IMPORTANT: POLITENESS

* Keep the tone and content of your posts respectful and polite (even if you're angry about something).
Shelded is offline  
Unread 15th November 2006, 06:41 AM   #14
Edwin
 Administrator 
 
Join Date: Aug 2001
Location: UK
Posts: 3,083
kastaway, this is your only warning from me. One more problem with attitude or anything else that is in clear breach of the Rules, and you will be banned from these forums.

This is not up for debate.
Edwin is offline  
Unread 15th November 2006, 04:28 PM   #15
ebuckley002
= Permanently banned =
 
Join Date: Oct 2006
Posts: 10
Mod: Comment removed

Last edited by JeffK : 17th November 2006 at 01:47 PM.
ebuckley002 is offline  
Closed Thread


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 09:24 PM.

 

Copyright EmailDiscussions.com 1998-2013. All Rights Reserved