We are in the midst of configuring SRM 4.1.2.
We have a very basic configuration:
Protected Site:
3 x Dell R710
ESXi 4.1u1 / vSpehere 4.1u1 (VC Server)
Dual channel QLogic FC HBA multipathed via 2 FC switches
Compellent C30 based SAN (Dual controller)
Recovery Site:
4 x Dell 2950
ESX 4.1u1 / vSphere 4.1u1 (VC Server)
Single Channel QLogic FC HBA (Multipathed to Compellent storage via iSCSI)
Compellent C20 based SAN (Dual Controller)
50MBps EPL between the sites with "Replicate Active Replay" selected on the Compellent. (This is caught up to realtime MOST of the time)
We have a recovery plan set up to recover 5 VMs off of three protection groups. Two of the VMs use RDMs, the rest are all VMDK based. Compellent Enterprise manager is running at the recovery site.
When I go to test this recovery plan, about one out of every 7 or 8 times it completes without issue.
About half the time it will fail on one of the three "Prepare Storage" tasks with the error "Error: Failed to connect to management system address while executing 'testFailover' command.", however the other two storage tasks will complete successfully. (This seems to me to be a problem with the communication between SRM and the SRA, or the SRA not liking to handle multiple requests simultaneously, but I can't prove it.)
The other half of the time I get a failure on some of the low priority VMs that states one of the follwing:
- "Failed to recover datastore:",
- "Error: Failed to recover RDM device: Failed to failover LUN '16'."
- "Error: Failed to recover RDM device: Failed to failover LUN '100'." (There isn't even a LUN 100)
Compellent states that everything looks good. VMWare was troubleshooting the issue and says it might be because we don't have multiple paths to the storage on the DR site. However, when I look at the storage when connected to the VC server, all the LUNs show up. I enabled multipathing at the DR site via FC/iSCSI and we're still getting the same errors randomly. The VMware engineer is looking into this still, but they haven't come up with much yet. I've upped the HBA rescan count on the recovery site to 2 scans and that seemed to help a little, but it's not sure fire.
I don't know where to go from here. The fact that it works SOME of the time is what really gets to me. When it works, it works well and does exactly what we are looking for!! Something has to be configured correctly for it to work at all, but I'm unsure as to where to go from here. It's really putting a cramp in our project.
I do always have the option of telling my CIO that we will be ready for one out of 7 disasters, but I don't think that's going to go over very well :-)
Anyone been here before or worked through these types of issues?