Life, Football, Technology and Vespas…

Migration – Tidying up the pieces

It has been 2 months since the start of our migration from qmail-ldap to ZCS. There have been a number of issues that we have had to deal over this time, and now that things have had a chance to settle, it is a good time to jot it all down.

I mentioned previously that we had the issue of foul data in our billing/provisioning platform. A lot of man hours went in to tidying up this data. Our sync-replication script from our qmail-ldap server (to which our billing platform still provisions to) is still looking for the changes being made to the old ldap server and then making the necessary changes to zmprov.

We ran into an issue with connections to the mysqlDB on our mailstores. We monitor the number of connections to the database, but it is hard to determine how many connections from the connection pool are actually being used. The mailboxd service opens up all of the mysql connectors that it is configured to use and then it is hard to determine any trends or patterns of number of connections. We would run out of connections to the database, which was causing all manners of problems – notably POP3 not being available.

We picked up on some issues with the zimbra cluster checking script. Zimbra implements some cronjobs and other measures rotate logs and such like. Problem is, if you rotate any MTA related log files, as part of the post rotate script, it wants to restart the MTA. Obviously little thought was given to the zmcluctl script which is used by the cluster to determine the status of the cluster. Upon rotation of the logs at 4am every morning, our cluster would think that it was down due to the  MTA not being up and then restart the cluster service. There is a bug logged for this and there is also a work around that we implemented.

Disk I/O. 

Disk IO - 1

Disk IO - 2

The graphs show IO operations for one of our 4 mailstores. DM-13 is the lun for our MySQL data. DM-8 is our primary store:

 

# dmsetup ls

sysvg-tmp       (253, 1)

zimbra1_redo-redo       (253, 11)

zimbra1_sqldb-data      (253, 13)

zimbra1_secondary-first (253, 7)

sysvg-usr       (253, 4)

sysvg-var       (253, 3)

zimbra1_log-log (253, 10)

zimbra1_index-index     (253, 12)

sysvg-home      (253, 2)

zimbra1_base-base       (253, 9)

zimbra1_primary-store   (253, 8)

sysvg-opt       (253, 6)

sysvg-swap      (253, 5)

sysvg-root      (253, 0)

It is clear to see that they system is working hard! Our initial thoughts about the requirements needed did not take into account that POP3 is actually quite IO intensive. It makes sense now obviously, but we had not considered it. The long term plan to encourage default usage to be web based, will help ease the IO on the DB and primary storage LUNS.

Disk Utilization:

Disk Utilization

This is quite a cool graph. It shows the HSM at work. Every Friday, the HSM kicks in and moves mail that is older than a week to the secondary storage, from the primary storage. Remember this is a graph from only one of our mailstores. Everyone has only a 25Mb mailbox at the moment, but that will change once we have made a decision as to what we can scale to easily.

There have been some other quirky minor tweaks, but overall, it is running well and there are very few complaints from our users. Most do not even know that they have been migrated. After we have implemented our customized skin and large default mailbox, then I reckon there will be a drive to start educating customers and moving them to a web based experience.

The next process is to start analysing all the statistics that are available. We have a large number of used licenses, but we do not believe that we have as many as half or our users active. So that operation has to start soon. We will probably munge all the audit logs and put the information into a centralised database for interrogation.

What has to be said in closing is that we are very happy with our RHEL 5.3 + Xen + GFS + RHEL Cluster. It has been very stable for us so far (touch wood) and it was always a bit of an unknown for us during this project. The idea of the clustering of the virtual machines in theory was sound, but we always had a niggle in the back of our minds about the feasibility – especially as we were struggling with some aspects a couple of months prior to going live!

With all that said, it is really come together nicely.

Advertisements

2 responses

  1. Maciej Bogucki

    I have few questions to You:
    1. How many disks and what type (SCSI,FC, SATA) do You have for DM-13(MySQL data)?
    2. What kind of data (MySQL data, logs, indexes) do You have on GFS filesystem? Is GFS supported by Zimbra Support Team?
    3. When You compare chart https://bonoboslr.files.wordpress.com/2009/04/hobbitgraph-1sh.png?w=300&h=151 and https://bonoboslr.files.wordpress.com/2009/02/hobbitgraphsh.png?w=447&h=159 You can see the difference of magnitude for them. Tell me where to show such a difference IOPS?

    July 23, 2010 at 12:22 pm

    • bonoboslr

      1) I forget entirely what we setup, but they were 15k FC (RAID 10) disks we used to create the LUNS for the primary storage
      2) We only had the virtual machines on the GFS. That way live migration of the VMs could happen without the need for mounting or unmounting filesystems across hypervisors.
      3) That I am not entirely sure of. The stats were gathered by a collection script from the EMC Clariion. And one of the graphs might be from the OS.

      Unfortunately, I do not work there any more, so i cannot go and fish out the exact details.

      July 23, 2010 at 1:01 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s