There’s been a lot of talk around the water cooler lately about extremely large VMs and the pain of backing them up as well as the agony of restoring these monster VMs should the need arise. When I say agony, I am thinking of a root canal without pain killers…
We all love that the price of storage has become so cheap that everyone seems to have at least a terabyte of photos and two or three terabytes of movies these days – my 75 year old father included. Gone are the days when server admins religiously set user storage quotas and monitored file shares for .jpg and .mp3 files. Why bother? It costs more to monitor, find and then delete the offending files than it costs to simply add more space.
Or is it?
Running monster VMs in your environment can be cool to brag about, but they are certainly not without peril. I recently embarked on an internal testing project to determine how big is too big from a recovery point of view, using traditional backup methods. For the testing use case, we created a 21TB VM with 14 attached VMDK’s stuffed full of data.
First, some technical and environmental information for the test setup: The test environment is VMware ESXi 5.5 on Cisco UCS. The backup server is a Cisco B200-M3 blade with dual Intel Xeon E5-2630 CPU’s at 2.3Ghz and 24 cores, 128GB RAM. The server is running Windows Server 2012 R2 and Veeam version 9 with 2 additional physical proxy servers to assist with the backup/restore processes. The storage repository is a Cisco 3160 with 400TB of space on 60 * 8TB disks running RAID60 connected by 2 * 10GBps fabric connections. The restore will be made to a 25TB volume running on a Compellent SC8000 array with 576 * 800GB 10K disks.
Okay, now that we got that out the way, onto what we are here for: what does it take to successfully recover a 21TB VM?
Creating a full backup took right at one and a half hours which was about what we expected it to take. Once it completed, we immediately set about creating the restore job. During the restore wizard all of the VM’s data and configuration was checked by Veeam, which took longer than we expected. Once that little delay was over we moved along through the rest of the wizard and with great excitement we clicked START! And nothing happened. For a long time. The nerd excitement waned and we went home.
After some time, Veeam showed that the job was still at 0% but in vSphere we could see disks were being created so things were definitely moving along. Sadly, later we were still at 0% and creating disks in vSphere. No actual data had been copied yet because, by default, Veeam running restores in SAN mode creates all the VMDK’s and formats them as Thick Provision Lazy Zeroed first. Once the disk creation has completed, the actual restoration of data begins. Note that it is possible to change the disk restore method to Thin Provision or Thick Provision Eager Zeroed by creating a new registry entry on the Veeam server, which will effectively make this first process go very quickly.
One could argue that you would not need to do a full restore if you had an OS corruption issue. You could just reload the OS on a new partition and reattach the existing VMDK’s, or simply use the file level restore capability to restore the files you need. But with the rise of ransomware or a large scale data corruption, you may be left with no choice but to restore from an older backup.
Perhaps the main lesson to be learned from these tests is this: At some point a VM really can be too big—regardless of its disk configuration—IF you ever, ever need to restore it. When that day comes, you and the other business owners need to be aware that the restore is definitely not going to be a fast process. Based on my experience, it’s likely that telling someone that they may have to wait more than an hour is going to be a hard sell to anyone needing access to the data.
So what is the solution? Consider migrating the data from your single giant VM to several smaller VM’s or leveraging snapshots with backup software integration. This has the added benefit of distributed VM’s running their workloads on separate hosts in the cluster and, most likely, improved performance. Additionally, it solves the issue of a single point of failure should your single 21TB file server crash. And finally, if one 5TB server needed to be restored, it would likely take orders of magnitude less time to restore than a single 21TB VM would take. More importantly, your entire environment would not be offline, so few people would be lining up at your desk or burning up your phone asking for a status update.