Fix for EMC XtremIO complete failure due to Eaton BBU / UPS issues

Oh yes, you know the drill, EMC (now Dell/EMC) never overlooks a chance to create a dependency from a multi-hundred-thousand dollar disk array on a $1000 battery backup, or a $16 lead acid battery for that matter.  Seriously, with XtremIO  (Gen 2 in this case, not sure about future iterations or the XIO2), if you have the misfortune of both of the included standby UPS’s experiencing issues simultaneously, your array will shut down and refuse to restart.  Anything that would put the UPS into degraded mode can cause this.

But wait, there’s more; as shipped between at least 2015-2016, this iteration of XtremIO with Eaton 5p 1550 UPS’s is a ticking time bomb, because the UPS has a battery lifetime counter that counts down until zero is reached, and upon reaching zero, goes into a degraded state.  I had the pleasant experience of both UPS’s doing this within four hours of each other, so an XtremIO array running production traffic decided to just shut itself down in the middle of the day to protect me from data loss ROFL.  I suspect both Eaton UPS’s were manufactured four hours apart from one another, or perhaps clock drift just left us at the point we arrived at, but seriously, both reached battery life cycle expiry four hours apart, went to degraded state, and the array then shut both storage controllers in the XBrick down “safely” to protect me from myself.

You really can’t make this fucking shit up…

The best part about this is that the batteries had simply reached a pre-determined time limit on when Eaton believes they should no longer be relied on, they hadn’t failed a self test.  That was all it took to shut the array down.

Well this shit show keeps getting better, so strap yourself in.  In the version of firmware that EMC ships these with, you cannot reset the battery lifecycle counter.  Seriously, EMC’s fix is to just replace the entire unit any time the batteries need replacement.  Now that we know how important the UPS is to XIO even functioning, that seems a step crazier than necessary, because if you’re in the process of replacing one and simply bump the serial cord out of the other, the array will go to fully degraded and shut down.

There’s more.  Eaton designed this pile of crap rack mount UPS to require removal from the rack to replace the batteries; ROFL.  You have to pull the thing half way out of the rack, which means disconnecting all the cables (it requires powering off to replace batteries regardless WTF, no hot swap), and bend a metal piece out of the way to complete the operation.

However, whether your array is down due to battery failure, or battery lifecycle counter reaching zero, you still have extra ridiculous steps to perform.  You’re going to need to upgrade the firmware to version 25 or above to gain the ability to reset the lifecycle counter, and then downgrade back to the version you have, followed by a factory reset.

To start that process, first check what version of firmware is on your BBUs via the XtremIO XMS command line interface.  You’ll want to SSH into it as xmsadmin, and then log in as the ‘tech’ user.  You can find the passwords to both on my page here.  SSH in and run “show-bbus”:

xmcli (tech)> show-bbus
Name   Index Model         Serial-Number Power-Feed State   Connectivity-State Enabled-State Input Battery-Charge BBU-Load Voltage FW-Version Part-Number Brick-Name Index Cluster-Name Index Outlet1-Status Outlet2-Status
X1-BBU 1     Eaton 5P 1550 Gxxxxxxxxx    PWR-A      healthy connected          enabled       on    92             24       204     02.08.0016 078-000-122 X1         1     EMC-XIO-3    1     on             on
X2-BBU 2     Eaton 5P 1550 Gxxxxxxxxx    PWR-B      healthy connected          enabled       on    90             20       204     02.08.0016 078-000-122 X1         1     EMC-XIO-3    1     on             on

You’ll see the firmware listed as version 02.08.0016, which is really Eaton’s OEM version, no EMC special at this point.  However, the reason I say to look for this is because your array will refuse to power on if the Eaton firmware is not the version EMC wants it to be at.  That being the case, you need to ensure you have the version it is currently running before starting this process, as you’ll need to downgrade back to that version afterward.  I discovered this issue by way of the following thread:

https://community.spiceworks.com/topic/2105650-eaton-5p-5px-battery-countdown

So, hunt around for the right firmware, and it may not be easy to find.  If you’re lucky enough to be on the same “02.08.0016” version that I am, here’s the link I found it at:

https://www.touslesdrivers.com/index.php?v_page=23&v_code=49143&v_langue=en

The file name is:  Eaton_5P_LVHV11_E0_V02.08.0016_TL00.zip

Here’s my local copy from the above site but obviously you should try to get it from Eaton direct as I won’t vouch for it:  Eaton_5P_LVHV11_E0_V02.08.0016_TL00

Their install tool does perform some level of validation.  Next, you need both the current version (26) and the firmware install tool:

https://powerquality.eaton.com/Support/Software-Drivers/Downloads/5P-UPS-firmware.asp?cx=55

On the off chance Eaton ultimately removes the tools, or releases a version that can’t be downgraded from, here are my local copies of things:

Version 35 (unfortunately never saved 26, which got the job done):

https://www.ispcolohost.com/publicshare/eaton_5p_lvhv11_e0_v03_18_0035_tl00.zip

Update guide (ignore the part about DO NOT DOWNGRADE lol):

https://www.ispcolohost.com/publicshare/Eaton_setUPS_5P_firmware_upgrade__rev_03-637.pdf

Upgrade Utility for Win 7/10/11:

https://www.ispcolohost.com/publicshare/setups_win_2_1_0822.zip

So, upgrade your failed UPS to the current version, then make use of the battery lifecycle reset option to clear that countdown.  Replace the batteries at the same time, regardless of why you’re doing this, given they’ll fail eventually anyway and you can replace them for barely more than $100 as the six packs only cost $15-25 each depending on where you source them.  We’re not done yet.  Now downgrade it back to the prior version.  Perform a factory reset of the UPS via front panel.  Power it off.  Power it back on.  Reconnect it exactly as it was connected before, i.e. serial to the same XtremIO Storage Controller.  You should now have a working UPS that reports as good to the XIO SC.  If it is not charged to at least 70%, the XIO will still refuse to boot, so that’s fun.  Wait for it to charge, and then finally your first storage controller should come back online.

Now do this same process on the second UPS.

If you think you’re going to be slick and just bypass the UPS’s, you can’t.  EMC won’t let you power your own array up in the manner you want.  They feel a cheap poorly designed rackmount UPS is going to be more reliable than modern data center power.  The SC’s must have at least one UPS, reporting a good status, and known firmware, with 70% charge, to one SC, or it will refuse to boot the array up.

So there you  have it, a poorly designed UPS timebomb can take your entire storage array down.  Thanks EMC!

13 Replies to “Fix for EMC XtremIO complete failure due to Eaton BBU / UPS issues”

  1. Another Victim

    You saved us today man. Exact same issue and your solution worked like a charm.

    EMC wake up guys! Seriously they should be sued.

    Reply
  2. Behroozi

    Hi,
    i upgraded the UPS 5p 1550i to the version 02.14.0026 and i connected to it through USB but i could not find any counter about battery life time, would you please explain more specifically about this parameter? I used Eaton SetUPS tool.
    Best Regards
    Behroozi

    Reply
    • Your Mom Post author

      I’m not sure if the counter is visible; did you try the “Battery lifecycle reset” menu choice? Should clear the counter either way, then you downgrade back to the .0016 version.

      Reply
      • Your Mom Post author

        You have to use version 2.1 build 0822 of their upgrade utility; the later version 2.5 and beyond have version validation that does not allow you to downgrade. Version 2.1 just blindly installs whatever you tell it to.

        Reply
  3. Chuck Norris

    Hi,
    Thank you for the post, i was able a few years ago to restart an xtremio and retrieve data.
    I saved the firmware 26, if you want to add it in your post, just send me a link and i will upload it.
    This time we will change the battery pack inside the BBU’s.
    I will reset counter after Battery change.
    This will save our Lab XtremIO :)

    Regards

    Reply
  4. T.Minkov

    Hello,
    I hit exactly the same experience – both BBUs down. Sadly that I didnt found this post earlier but thats the life.
    I will summarize the situation:
    1. Every new/replaced/refurbished BBU comes with FW 02.08.0016 (see 2* why)
    2. FW 02.08.0016 have these “features”
    – * the only working with XtremIO monitoring system
    – 4 years timebomb (actually 48 Months and 20 days)
    – you can find the remaining days with Eaton SetUPS tool, menu Reports
    – you can reset the timer in any of three ways:
    = * reflash 02.08.0016
    = @02.08.0016 _AND_ Active-Alarm – fully shutdown/disconnect and reconnect/start BBU
    = upgrade to latest FW (3.18 will not reset timer) _AND_ Active-Alarm – just hit [Ok] on just warning on the BBU screen. Then downgrade (SetUPS 2.1) to 02.08.0016

    UltraSummarize:
    So every new/replaced/refurbished BBU comes with (reflashed) 02.08.0016 because it resets timer and it is the only FW working with XtremIO monitoring.

    Bad experience:
    If two BBUs came with close “manufacturing” date of flash then you appear with both two BBUs alarmed and later shut down. And as you know, with standard procedure, you cant replace one BBU when both of these are in alarm state – “please replace first another BBU” misunderstanding. You shoud replace manual one BBU (very risky) and then replace second by standard procedure.

    Good experience:
    If the XtremIO cluster is shut down you can freely reflash 02.08.0016 to reset timer and/or replace battery only. This should be possible (very risky) with started cluster but BBUs one by one.

    Reply
  5. Laura Funk

    Fix worked perfectly. Only difference was that the counter option disappeared in the new firmware. The roll back still worked fine even without changing anything on the new firmware.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *