Insane. In-freakin-sane! We have migrated 590,000 mail accounts from our old qmail-ldap system to our new Zimbra platform. As with all migrations, there are always unforeseen issues and escalations and for the last week we have been putting out fires and fending of the damn helldesk. I will try to give a break down of the sequence of events that transpired over the last week or so.
By the evening of Sunday 22nd of February, we had migrated 480,000 accounts. I was feeling quite pleased with myself and went to bed that evening a satisfied man. Monday morning was the real eye opener. As I discussed previously, all inbound SMTP was directed to the qmail-ldap system and using the cluster capabilities in qmail-ldap, the mail was forwarded via qmqpd to the load balanced MTA servers. Our MTA servers on monday started to give us a bit of issues as the disk I/O was not enough for the task. There was also some fine tuning required. That done, the MTAs restarted and we were back in business.
We were still seeing some pressure on our mailstores which was a bit surprising seeing as they are all 16 core 14Gb RAM opteron servers – all with SAN attached storage for index,redo,log,store,secondary_store,sqldb/data – and there are 4 of them! We were seeing I/O traffic to the mysqlDB LUN of 30,000+ writes per sec and 15-20,ooo reads per sec. Seriously working!! We decided (incorrectly) to try and make some modifications to the number of lmtp threads on each of the mailstores. This resulted in a restart of the zimbra service and from then on we had problems.
We did our migration by setting up an external data source on every account and popping the mail from the legacy mail system. We did this for a number of reasons – primarily being that we did not have the user’s password to imapsync the accounts and if we changed the password to something we knew, the ldap slaves would not replicate quick enough for the script. The external data source was an excellent choice simply because most of our users are pop3 junkies. We created a data source on every account to poll the legacy mail system every 2 hours. This had been working like a charm up until the restart of the service. A bug in 5.0.11 resulted in the http auth services not being available on each of the four mailstores. The bug, which was detected and logged by Zimbra, was triggered by too many scheduled tasks starting up at once. The http auth service being down gave us some serious headaches.
Added to that was the fact that our helldesk initiated 45000 password changes in one day alone on the Sunday. This resulted in serious delays in ldap replication. The decision was made to stop the ldap service and make a copy of the master DB files to the slaves and start ldap services again. We eventually managed to hose down the fires by Tuesday 24th and from then on in we have only suffered minor service related issues.
However, the biggest problems that have faced during this migration has been dirty data in our legacy billing and ldap platforms. We have had instances of userids that matched other user’s alias – and many other weird combinations of data integrity issues. It has taken us the best part of 10 days to try and identify as much dirty data / borked accounts and fix them up. The numbers of accounts with dirty data must be around 2000-2500 of whic a large percentage are inactive.
As it stands now our systems (2 proxy servers, 5 MTAs, 4 Mailstores, 3 LDAP servers) are handling the load very well. We are managing to deliver mail at a fast rate (8 seconds from gmail to our domain) and our only concern is the mysql/DB LUN that is under pressure. We believe though that by changing the usage patterns will actually give us more available I/O.
I will post some stats and graphs as soon as I have them.