Fix for EMC XtremIO complete failure due to Eaton BBU / UPS issues

Oh yes, you know the drill, EMC (now Dell/EMC) never overlooks a chance to create a dependency from a multi-hundred-thousand dollar disk array on a $1000 battery backup, or a $16 lead acid battery for that matter.  Seriously, with XtremIO  (Gen 2 in this case, not sure about future iterations or the XIO2), if you have the misfortune of both of the included standby UPS’s experiencing issues simultaneously, your array will shut down and refuse to restart.  Anything that would put the UPS into degraded mode can cause this.

But wait, there’s more; as shipped between at least 2015-2016, this iteration of XtremIO with Eaton 5p 1550 UPS’s is a ticking time bomb, because the UPS has a battery lifetime counter that counts down until zero is reached, and upon reaching zero, goes into a degraded state.  I had the pleasant experience of both UPS’s doing this within four hours of each other, so an XtremIO array running production traffic decided to just shut itself down in the middle of the day to protect me from data loss ROFL.  I suspect both Eaton UPS’s were manufactured four hours apart from one another, or perhaps clock drift just left us at the point we arrived at, but seriously, both reached battery life cycle expiry four hours apart, went to degraded state, and the array then shut both storage controllers in the XBrick down “safely” to protect me from myself.

You really can’t make this fucking shit up…

The best part about this is that the batteries had simply reached a pre-determined time limit on when Eaton believes they should no longer be relied on, they hadn’t failed a self test.  That was all it took to shut the array down.

Well this shit show keeps getting better, so strap yourself in.  In the version of firmware that EMC ships these with, you cannot reset the battery lifecycle counter.  Seriously, EMC’s fix is to just replace the entire unit any time the batteries need replacement.  Now that we know how important the UPS is to XIO even functioning, that seems a step crazier than necessary, because if you’re in the process of replacing one and simply bump the serial cord out of the other, the array will go to fully degraded and shut down.

There’s more.  Eaton designed this pile of crap rack mount UPS to require removal from the rack to replace the batteries; ROFL.  You have to pull the thing half way out of the rack, which means disconnecting all the cables (it requires powering off to replace batteries regardless WTF, no hot swap), and bend a metal piece out of the way to complete the operation.

However, whether your array is down due to battery failure, or battery lifecycle counter reaching zero, you still have extra ridiculous steps to perform.  You’re going to need to upgrade the firmware to version 25 or above to gain the ability to reset the lifecycle counter, and then downgrade back to the version you have, followed by a factory reset.

To start that process, first check what version of firmware is on your BBUs via the XtremIO XMS command line interface.  You’ll want to SSH into it as xmsadmin, and then log in as the ‘tech’ user.  You can find the passwords to both on my page here.  SSH in and run “show-bbus”:

xmcli (tech)> show-bbus
Name   Index Model         Serial-Number Power-Feed State   Connectivity-State Enabled-State Input Battery-Charge BBU-Load Voltage FW-Version Part-Number Brick-Name Index Cluster-Name Index Outlet1-Status Outlet2-Status
X1-BBU 1     Eaton 5P 1550 Gxxxxxxxxx    PWR-A      healthy connected          enabled       on    92             24       204     02.08.0016 078-000-122 X1         1     EMC-XIO-3    1     on             on
X2-BBU 2     Eaton 5P 1550 Gxxxxxxxxx    PWR-B      healthy connected          enabled       on    90             20       204     02.08.0016 078-000-122 X1         1     EMC-XIO-3    1     on             on

You’ll see the firmware listed as version 02.08.0016, which is really Eaton’s OEM version, no EMC special at this point.  However, the reason I say to look for this is because your array will refuse to power on if the Eaton firmware is not the version EMC wants it to be at.  That being the case, you need to ensure you have the version it is currently running before starting this process, as you’ll need to downgrade back to that version afterward.  I discovered this issue by way of the following thread:

https://community.spiceworks.com/topic/2105650-eaton-5p-5px-battery-countdown

So, hunt around for the right firmware, and it may not be easy to find.  If you’re lucky enough to be on the same “02.08.0016” version that I am, here’s the link I found it at:

https://www.touslesdrivers.com/index.php?v_page=23&v_code=49143&v_langue=en

The file name is:  Eaton_5P_LVHV11_E0_V02.08.0016_TL00.zip

Here’s my local copy from the above site but obviously you should try to get it from Eaton direct as I won’t vouch for it:  Eaton_5P_LVHV11_E0_V02.08.0016_TL00

Their install tool does perform some level of validation.  Next, you need both the current version (26) and the firmware install tool:

https://powerquality.eaton.com/Support/Software-Drivers/Downloads/5P-UPS-firmware.asp?cx=55

So, upgrade your failed UPS to the current version, then make use of the battery lifecycle reset option to clear that countdown.  Replace the batteries at the same time, regardless of why you’re doing this, given they’ll fail eventually anyway and you can replace them for barely more than $100 as the six packs only cost $15-25 each depending on where you source them.  We’re not done yet.  Now downgrade it back to the prior version.  Perform a factory reset of the UPS via front panel.  Power it off.  Power it back on.  Reconnect it exactly as it was connected before, i.e. serial to the same XtremIO Storage Controller.  You should now have a working UPS that reports as good to the XIO SC.  If it is not charged to at least 70%, the XIO will still refuse to boot, so that’s fun.  Wait for it to charge, and then finally your first storage controller should come back online.

Now do this same process on the second UPS.

If you think you’re going to be slick and just bypass the UPS’s, you can’t.  EMC won’t let you power your own array up in the manner you want.  They feel a cheap poorly designed rackmount UPS is going to be more reliable than modern data center power.  The SC’s must have at least one UPS, reporting a good status, and known firmware, with 70% charge, to one SC, or it will refuse to boot the array up.

So there you  have it, a poorly designed UPS timebomb can take your entire storage array down.  Thanks EMC!

5 Replies to “Fix for EMC XtremIO complete failure due to Eaton BBU / UPS issues”

  1. Another Victim

    You saved us today man. Exact same issue and your solution worked like a charm.

    EMC wake up guys! Seriously they should be sued.

    Reply
  2. Behroozi

    Hi,
    i upgraded the UPS 5p 1550i to the version 02.14.0026 and i connected to it through USB but i could not find any counter about battery life time, would you please explain more specifically about this parameter? I used Eaton SetUPS tool.
    Best Regards
    Behroozi

    Reply
    • Your Mom Post author

      I’m not sure if the counter is visible; did you try the “Battery lifecycle reset” menu choice? Should clear the counter either way, then you downgrade back to the .0016 version.

      Reply
      • King

        We’re also working on this same exact project – when you mention the ‘battery lifecycle reset option’ is this something I’d see while in SetUPS or in a different management utility? Or perhaps the CLI?

        Reply
        • Your Mom Post author

          The battery lifecycle reset becomes an available menu option on the front panel of the unit when (temporarily) running the non-EMC upgraded code.

          Reply

Leave a Reply

Your email address will not be published. Required fields are marked *