Replacing a failed Thales nShield Connect HSM

Recently had a Thales nShield Connect 6000 fail.  Replacing them is fairly easy, but there are a large number of steps, so figured I’d write them out here in case anyone else needs to do this and hasn’t yet.

How do you know if your HSM has failed?  Obviously if it’s simply unreachable that’s a good start.  In my case the actual HSM portion of the nShield Connect failed, not the outer ‘box’; I believe the HSM is a PCI card inside the box.  The unit began syslogging info like this:

Jan 3 04:05:34 nethsm hardserver[1111]: nFast server: Serious error, trying to continue: Operating system call failed: write from device #1 nFast PCI device, bus 2, slot 0. failed, No such device or address

Jan 3 04:05:42 nethsm hsglue: Displaying ‘HSM Failed’ message on the front panel.

If you were to run the ‘enquiry’ command, you’d see an unreachable or failed unit.

I received the new unit, so time to replace the failed one.  If you’re running redundant HSM’s and you’ve installed in the typical, and ‘recommended’ manner, you will probably need to make some network or application changes to facilitate replacement without an outage.  The recommended manner is where your Remote File System (RFS) server is also a transaction client.  The issue is that the new HSM needs to be connected to the RFS and have some work done before you actually want transactions to be hitting it, and if your RFS is also a transaction server, transactions are going to be hitting the new HSM between when it’s connected and when it’s actually ready.  So if that’s your setup, make sure to isolate the RFS server from transaction traffic until it’s ready.  My ‘how to install the nShield Connect 6000’ article documents how I set up, which is having a third system act as RFS.  So, take care of this part first, if applicable, then continue.

  1. Step one is to disconnect the dead HSM from all clients.  My clients are linux, but I believe the commands are similar on Windows so you can probably figure it out.  To do this, first run enquiry to ensure you know the correct module number for the failed HSM, then you can get the details you need to run the removal.  It will look something like this; my dead module is #1:
    Module #1:
     enquiry reply flags Failed
     enquiry reply level Six
     serial number unknownunknown
     mode operational
     version 0.0.0
     speed index 0
     rec. queue 0..0
     level one flags none
     version string unknown
     checked in 0000000000000000 Wed Dec 31 19:00:00 1969
     level two flags none
     max. write size 0
     level three flags none
     level four flags none
     module type code 0
     product name unknown
     device name unknown
     EnquirySix version 3
     impath kx groups
     feature ctrl flags none
     features enabled none
     version serial 0
     connection status InProgress
     connection info esn = 1234-ABCD-4567; addr = INET/192.0.2.1/9004; ku hash = asdfasdfasdfasdf, mech = Any; time-limit = 24h; data-limit = 8MB
     image version unknown
     hardware status unknown
    

    From the above, you’ll notice it’s clearly different than an online good unit:

    Module #2:
     enquiry reply flags none
     enquiry reply level Six
     serial number 9876-EFGH-5432
     mode operational
     version 2.61.2
     speed index 4512
     rec. queue 19..152
     level one flags Hardware HasTokens
     version string 2.61.2cam2 built on Sep 3 2015 16:01:11, 3.34.1cam3
     level two flags none
     max. write size 8192
     level three flags KeyStorage
     module type code 7
     device name Rt2
     EnquirySix version 6
     impath kx groups DHPrime1024 DHPrime3072
     feature ctrl flags LongTerm
     features enabled StandardKM
     version serial 26
     connection status OK
     connection info esn = 9876-EFGH-5432; addr = INET/192.0.2.2/9004; ku hash = erqerqwqvwqvewvq5234234234, mech = Any; time-limit = 24h; data-limit = 8MB
     image version 12.23.1cam3
     max exported modules 3
     rec. LongJobs queue 18
     SEE machine type PowerPCSXF
     supported KML types DSAp1024s160 DSAp3072s256
     using impath kx grp DHPrime3072
     hardware status OK

    So we know module #1 is the one we want gone. From the enquiry command, we need to know its electronic serial number, 1234-ABCD-4567, old IP address, 192.0.2.1, and its hash, asdfasdfasdfasdf.  Run this command to remove it from each client system:

    nethsmenroll -f -r 192.0.2.1 1234-ABCD-4567 asdfasdfasdfasdf

    The -f is for force, which is necessary when a given module is not reachable.  The -r tells it to remove the module.

    After removal, enquiry should report:

    Module #1:
    Not Present
  2. Perform the above on the RFS system as well, if it’s independent of the transactional systems.
  3. On the RFS, I’m going to clean out the old data related to the failed HSM because I intend to re-use the same IP address for the new HSM.  This cleanup includes:
    1. Remove references to the old HSM’s ESN (1234-ABCD-4567) and IP address (192.0.2.1) from the /opt/nfast/kmdata/config/config file.  There will be probably about ten blocks of settings related to the failed HSM, and not all will contain either the ESN or the IP, so you need to look for both.
    2. Remove the /opt/nfast/kmdata/hsm-1234-ABCD-4567 directory.
  4. Go ahead and install the new hardware.  In my case, I’m re-using the same IP address as the failed unit so no firewall changes are needed.  Also, specific to me, I make use of the auto push remote upgrade features, so I need to enable those via the front panel.  I then add the IP address of the RFS server to the new nShield via front panel.
  5. If the network is trusted between HSM and RFS, run
    anonkneti 192.0.2.1

    to get back the ESN and KNETI hash from the new HSM.  If the network is not trusted, get the ESN and hash out of the HSM front panel.

  6. Now you can permit the new HSM to connect to your RFS via:
    rfs-setup --force 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f

    where that’s obviously the IP, ESN, hash.

  7. Now enroll the new HSM (include the -p flag!):
    nethsmenroll -p 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f

    again, IP, ESN, hash.  The -p tells it privileged mode, which is only needed from your RFS server, and which you’d previously input on the front panel.  Your regular nodes you won’t use the -p.

  8. Pull the new config from the front panel of the HSM, or, using the front panel, turn on auto update config.
  9. The new HSM probably doesn’t have the same firmware as your other(s), and enquiry will tell you that via ‘unsupported firmware’, so you may need to do an upgrade.  If that’s the case, use:
    nethsmadmin --list-images 192.0.2.100

    where that IP is NOT the HSM, it’s your RFS.  Pick the image name you want, and BE SURE TO SET THE MODULE NUMBER IN THE FOLLOWING PROPERLY:

    nethsmadmin -m 1 -i nethsm-firmware/12.23.1cam3/nCx3N.nff 

    So in this case, I’m telling it to install the firmware on module #1.  Alternatively, from the new HSM front panel, you can have it list images and pick the one to upgrade to.  Keep in mind this process takes a very long time; could be 30 minutes or more, and it will appear to be unresponsive during that time.  It will remind you of this when pushing out the image:

    Image upgrade completed. Please wait for appliance to reboot.
    Please wait for approximately half an hour for the appliance to internally upgrade.
  10. Switch the HSM to pre-init mode, either via front panel, or command from the RFS privileged client USING THE CORRECT MODULE NUMBER:
    nopclearfail -I -m 1
  11. Now round up a quorum of cards from your administrative card set, which may mean having a bunch of staff called in to stand around the HSM.  From the front panel, menu 3-2-2, load your security world.  It will ask you to start feeding it cards, and once a quorum has been reached, you’re good.  It will write its config back to the RFS at this point.
  12. On the RFS, copy the new /opt/nfast/kmdata/hsm-NEW-ESN/config file to config.new in the same directory, add your other authorized clients in in the [hs_clients] section.  Or add them via front panel of HSM and it will update the file.  If you choose to do the update the file method, after changing your config.new, use the cfg-pushnethsm to push the config out to it with:
    cfg-pushnethsm --address=192.0.2.1 -f -n /opt/nfast/kmdata/hsm-NEW-ESN/config/config.new

    then remove the config.new once the config shows as having been updated, since the HSM will re-push the config after it has been changed.

  13. On your other HSM clients, now run rfs-sync -U to have them update their /opt/nfast/kmdata/local dat to reflect the new module_ESN file.
  14. Use nethsmenroll on the other HSM clients to connect them to the new HSM now that it has the security world loaded:
    nethsmenroll 192.0.2.1 `anonkneti 192.0.2.1`

    or of course the long version will work too:

    nethsmenroll 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f
  15. Check that enquiry looks good.  You should be done.

One Reply to “Replacing a failed Thales nShield Connect HSM”

Leave a Reply

Your email address will not be published. Required fields are marked *