SRM works randomly, errors out with 2 different errors most of the time

We are in the midst of configuring SRM 4.1.2.

We have a very basic configuration:

Protected Site:

3 x Dell R710

ESXi 4.1u1 / vSpehere 4.1u1 (VC Server)

Dual channel QLogic FC HBA multipathed via 2 FC switches

Compellent C30 based SAN (Dual controller)

Recovery Site:

4 x Dell 2950

ESX 4.1u1 / vSphere 4.1u1 (VC Server)

Single Channel QLogic FC HBA (Multipathed to Compellent storage via iSCSI)

Compellent C20 based SAN (Dual Controller)

50MBps EPL between the sites with "Replicate Active Replay" selected on the Compellent. (This is caught up to realtime MOST of the time)

We have a recovery plan set up to recover 5 VMs off of three protection groups. Two of the VMs use RDMs, the rest are all VMDK based. Compellent Enterprise manager is running at the recovery site.

When I go to test this recovery plan, about one out of every 7 or 8 times it completes without issue.

About half the time it will fail on one of the three "Prepare Storage" tasks with the error "Error: Failed to connect to management system address while executing 'testFailover' command.", however the other two storage tasks will complete successfully. (This seems to me to be a problem with the communication between SRM and the SRA, or the SRA not liking to handle multiple requests simultaneously, but I can't prove it.)

The other half of the time I get a failure on some of the low priority VMs that states one of the follwing:

"Failed to recover datastore:",
"Error: Failed to recover RDM device: Failed to failover LUN '16'."
"Error: Failed to recover RDM device: Failed to failover LUN '100'." (There isn't even a LUN 100)

Compellent states that everything looks good. VMWare was troubleshooting the issue and says it might be because we don't have multiple paths to the storage on the DR site. However, when I look at the storage when connected to the VC server, all the LUNs show up. I enabled multipathing at the DR site via FC/iSCSI and we're still getting the same errors randomly. The VMware engineer is looking into this still, but they haven't come up with much yet. I've upped the HBA rescan count on the recovery site to 2 scans and that seemed to help a little, but it's not sure fire.

I don't know where to go from here. The fact that it works SOME of the time is what really gets to me. When it works, it works well and does exactly what we are looking for!! Something has to be configured correctly for it to work at all, but I'm unsure as to where to go from here. It's really putting a cramp in our project.

I do always have the option of telling my CIO that we will be ready for one out of 7 disasters, but I don't think that's going to go over very well :-)

Anyone been here before or worked through these types of issues?

SRM works randomly, errors out with 2 different errors most of the time

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112