Editor’s Note: As of January 2022, iland is now 11:11 Systems, a managed infrastructure solutions provider at the forefront of cloud, connectivity, and security. As a legacy iland.com blog post, this article likely contains information that is no longer relevant. For the most up-to-date product information and resources, or if you have further questions, please refer to the 11:11 Systems Success Center or contact us directly.
This is the third blog in our series on continuity terminology. The first covered snapshots, and the second covered backups. As noted, this guide is a simple introduction to each of these concepts and helps to explain the differences in clear, simple terms. We will discuss what each one is, isn’t, and how you might want to use them in your environment.
So, let’s get to the last one: replication.
11:11 System offers several solutions to help you in your time of need. Snapshots work as a save point that you can use as needed before a large change. Our seven-day rotational backups allow you to restore a machine to a previous state and run once a day in all of our ECS environments. But the most powerful and flexible solution by far is our true Disaster Recovery (DR) offering. If you want the peace of mind that comes with true DR, replication is the way to go.
What it is Replication?
In the simplest terms, Replication is a streaming backup of your entire production environment. Once your data has been pushed to us and the initial seeding is complete, data is streamed continuously to your DR environment, ensuring that the replicated VMs are as close to their production doppelgangers as possible. Typically, this ensures you can bring up an environment that is minutes, sometimes even seconds, behind your live environment in the event of a true disaster.
11:11 partners with Zerto for replication in our 11:11 Cloud environments. A typical installation consists of two essential parts: a Virtual Replication Appliance (VRA) installed on each host and a Zerto Virtual Manager (ZVM) server that allows centralized communication and management both locally and site-to-site.
After an initial install, your entire environment begins streaming over to your DR site through a pre-configured, secure VPN tunnel. Replication does not affect the performance of your VMs in the slightest. Since it is performed at the host level and allows for multiple VMs to be streamed at a time, syncing your entire environment is seamless, quick, and easy.
Once the initial sync (seeding) is complete, Zerto begins syncing data in an attempt to catch up to your live environment. Once this sync has completed, it begins to build a journal history (four hours worth, by default) of restore points. You can choose one of the hundreds of points in that window to restore your environment to, which allows you to select the most appropriate version of your environment to bring back up after a disaster. Replication will continue streaming in the background from this point on, making sure that your DR site meets your required Recovery Point Objective (RPO), or the number of minutes or seconds your DR environment is behind your production environment.
In the event you do have a disaster, the technology enables you to configure Virtual Protection Groups (VPGs) that can organize and plan your failover. When failing over, these VPGs are what you select to bring up in your DR environment. VPGs allow you to restore clusters of the VMs most integral to your infrastructure with a single click instead of restoring one VM at a time. They also allow the machines to be pre-configured to work in the new environment with new network settings that match your DR site. Both of these are HUGE time savers in an actual disaster and allow you to move faster and focus on getting your environment back online as quickly as possible.
But just having a good DR plan using replication isn’t enough. You need to be able to prove that it actually works, and you and your team need to be familiar with how things will go in case of a disaster.
How does replication work:
Once replication is up and running in your environment, you’ll have the option to failover one of two ways:
A live failover is the failover designed for a full-on DR event. A live failover will assume that your production environment is down and that the VMs being failed over are now supposed to be the primary VMs. Once you have failed over, replication stops coming from your production site, and it will need to be reconfigured in order to continue, which will result in a need to re-seed the data. This makes sense in a true DR event since the original environment would be destroyed or inaccessible. Obviously, this is not the best way to test your DR plan, since it breaks the replication. It can also wreak havoc on a functional live environment, depending on the settings you select (powering off live machines, for example.)
The second option, a failover test, is the only option you should ever choose outside of an actual disaster. This will bring up the VMs that have been replicated to your DR environment and allows you to do whatever testing you need to do without impacting anything in your production environment whatsoever. Once you end the test, the VMs will be removed, and the data will begin syncing again. You can test any time you like, and the process is very straightforward.
Either way, reports are available in the console that show the results of the failover once it has completed.
To sum up:
Replication is the preferred methodfor DR for a few reasons.
– First, it constantly streams data from your live environment to your DR site. This allows the data in your DR site to be much closer in time to the data in your live environment.
– Second, since data is replicated at the host level, it allows for syncing of multiple VMs at a time instead of the “one job at a time” limitations found in most backup solutions. Plus, it has a minimal impact on your production environment and the performance of the VMs being replicated.
By default, replication allows for a history of restore points stretching back four hours. You can failover to any of these points at any time.
VMs that are replicated are only accessible during a failover test or after a live failover. While they are being replicated, they will not be accessible or configurable.
VMs that are involved in a current failover test will not be backed up as part of the daily backups that run in our ECS environments since they are intended to exist temporarily.
Replication at 11:11 can be controlled and managed via the console on the continuity tab. You can perform failover tests, live failovers, and view any information or reports you need, all in one centralized location. This allows you to initiate a failover, even if your production environment is unavailable, without having to call our support team to kick it off for you (though they are glad to do so.)
What isn’t replication?
Replication is not a good choice for file-level backup. While it is possible to recover OS-level files from a failed-over VM, using replication as a file system backup is like using a bazooka to fish. It might technically work, but it’s a mess to deal with, and you are kind of missing the point.
Replication is not something that will help you recover a single lost machine in your production environment. Not only is replication used to cover groups of machines, but it’s also designed to spin them up in another physical location, so it won’t help you if one of your local VMs goes bad during an update.
Replication does not create backup copies of VMs that you can log into. VMs in your DR site are only accessible once they have been failed over, either in a live failover or in a test.