Zimbra Cluster – what is the point?
I am in the process of deciding what to do about the Zimbra cluster installation that has been previously installed in my current place of work. Having had quite a bit of experience with Zimbra clusters as well as having numerous discussions with colleagues and friends, I have come to the conclusion that there is little to almost no point in clustering a Zimbra mail store – at all.
To understand how I get to this conclusion, you have to look at how the Zimbra clustering was designed and what the clustering is meant to achieve. Starting with – what does setting up a zimbra cluster actually achieve?
In my opinion, the Zimbra clustering has not been designed to achieve ‘high availability’ -it has been designed more for physical failover. Which, in service provider environments, is not ideal. Service providers would prefer to have as minimal downtime as possible. So, during upgrades or device failures, you would have no impact to users. In a corporate environment where you can afford larger windows to perform maintenance, this is ok.
The key problem is that Zimbra mail store has too many elements that are configured to reside as an individual instance, making high availability difficult. Databases, indexes and even the storage for the data blobs are designed to run on a single physical machine. During a cluster service relocate, the service must be shutdown on the one node and then file systems remounted on the secondary node and then the service restarted. Anyone that runs Zimbra in a cluster environment will tell you that this is minutes rather than seconds.
A better design, in my opinion, would be to start by breaking up the elements that are designed to run as a single instance on a single machine. Start with the database. The database is one of the services at present that cannot be ‘shared’. Each mail store needs it’s own db but by making a slight change so that it is available over the wire, then the possibility opens up to having multiple machines using the database.
The next consideration would be the data store for the message blobs. As the blobs are stored as individual files in a hierarchy of folders, the possibility of having the message store available via NFS becomes viable. I have run NFS as a mail store for million+ user environments for many years and it has never given me any issues in both reliability or performance – and I am prepared to challenge anyone on this! So by clustering an NFS service and offering the storage over the wire, you now have multiple machines that can access the DB, AND the store, which leaves only really leaves elements such as the logs, config files and the indexes to consider. My first thought would be to utilise a cluster file system, such as GFS, but I have to be honest, I have not completely thought it through. But it seems trivial to me though, and I cannot think of a reason why this setup would not work. Logs can be delivered over the wire to a centralised log server – so this is another element that can be ‘spread out’.
What the above suggestion does is a total redesign of the mail store, breaking it up into a front and a back end. The front end being the devices that actually accept the mail via LMTP with only the binaries being locally installed. The front end machines could be part of a GFS cluster so that configurations (that are not already in LDAP) and indexes can be shared across multiple machines. The back end set-up could be a DB and NFS cluster either individual or combined. This set-up allows for no interruption of service should any device fail.
What the above suggests is a total redesign of the mail store and introduces an increased level of complexity, maybe rendering the mail store difficult to support by Zimbra. So if stick with the traditional cluster set-up offered by Zimbra, what do you really get?
It appears that the only reason for clustering the Zimbra mail store would be for physical device failure. The service would simply be relocated to the standby node, should the primary node fail. When you start looking at large mail store environments that require many CPUs and a fair bit or RAM, you end up possibly having a very expensive machine idling, waiting for the primary machine to failover. A better solution would be to cluster the machine rather than the service.
Using virtualization, such as Xen, it is possible to run a virtual machine as a clustered service. Should the physical machine fail, then the virtual machine would simply be relocated to another physical machine in the cluster. This has the added benefit that should you need to perform physical maintenance then there would be no interruption to the Zimbra service if you utilise live migration. Another benefit is the dynamic allocation of resources. If you notice that too much or too little resources have been allocated to the virtual machine, then they can dynamically be added without any impact to the service.
So all in all, I cannot see the point in the Zimbra cluster setup. If I want physical machine redundancy (as it appears to be designed for) then I would prefer to use virtualization / cloud computing. And if I want high availability, then I have to step away from the existing Zimbra cluster design anyway.