Recently had a Thales nShield Connect 6000 fail. Replacing them is fairly easy, but there are a large number of steps, so figured I’d write them out here in case anyone else needs to do this and hasn’t yet.
How do you know if your HSM has failed? Obviously if it’s simply unreachable that’s a good start. In my case the actual HSM portion of the nShield Connect failed, not the outer ‘box’; I believe the HSM is a PCI card inside the box. The unit began syslogging info like this:
Jan 3 04:05:34 nethsm hardserver[1111]: nFast server: Serious error, trying to continue: Operating system call failed: write from device #1 nFast PCI device, bus 2, slot 0. failed, No such device or address Jan 3 04:05:42 nethsm hsglue: Displaying 'HSM Failed' message on the front panel.
If you were to run the ‘enquiry’ command, you’d see an unreachable or failed unit.
I received the new unit, so time to replace the failed one. If you’re running redundant HSM’s and you’ve installed in the typical, and ‘recommended’ manner, you will probably need to make some network or application changes to facilitate replacement without an outage. The recommended manner is where your Remote File System (RFS) server is also a transaction client. The issue is that the new HSM needs to be connected to the RFS and have some work done before you actually want transactions to be hitting it, and if your RFS is also a transaction server, transactions are going to be hitting the new HSM between when it’s connected and when it’s actually ready. So if that’s your setup, make sure to isolate the RFS server from transaction traffic until it’s ready. My ‘how to install the nShield Connect 6000’ article documents how I set up, which is having a third system act as RFS. So, take care of this part first, if applicable, then continue.
- So, we need to disconnect the dead HSM from all clients. My clients are linux, but I believe the commands are similar on Windows so you can probably figure it out. To do this, first run enquiry to ensure you know the correct module number for the failed HSM, then you can get the details you need to run the removal. It will look something like this; my dead module is #1:
Module #1: enquiry reply flags Failed enquiry reply level Six serial number unknownunknown mode operational version 0.0.0 speed index 0 rec. queue 0..0 level one flags none version string unknown checked in 0000000000000000 Wed Dec 31 19:00:00 1969 level two flags none max. write size 0 level three flags none level four flags none module type code 0 product name unknown device name unknown EnquirySix version 3 impath kx groups feature ctrl flags none features enabled none version serial 0 connection status InProgress connection info esn = 1234-ABCD-4567; addr = INET/192.0.2.1/9004; ku hash = asdfasdfasdfasdf, mech = Any; time-limit = 24h; data-limit = 8MB image version unknown hardware status unknown
From the above, you’ll notice it’s clearly different than an online good unit; data is missing, shows Failed, etc.:
Module #2: enquiry reply flags none enquiry reply level Six serial number 9876-EFGH-5432 mode operational version 2.61.2 speed index 4512 rec. queue 19..152 level one flags Hardware HasTokens version string 2.61.2cam2 built on Sep 3 2015 16:01:11, 3.34.1cam3 level two flags none max. write size 8192 level three flags KeyStorage module type code 7 device name Rt2 EnquirySix version 6 impath kx groups DHPrime1024 DHPrime3072 feature ctrl flags LongTerm features enabled StandardKM version serial 26 connection status OK connection info esn = 9876-EFGH-5432; addr = INET/192.0.2.2/9004; ku hash = erqerqwqvwqvewvq5234234234, mech = Any; time-limit = 24h; data-limit = 8MB image version 12.23.1cam3 max exported modules 3 rec. LongJobs queue 18 SEE machine type PowerPCSXF supported KML types DSAp1024s160 DSAp3072s256 using impath kx grp DHPrime3072 hardware status OK
We now know module #1 is the one we want gone. Keep in mind the module number may differ from one client system to the next, based on how you installed them and if you specified the module number, so do NOT assume commands involving a module number will be the same on all of your client systems.
From the enquiry command, we need to use its electronic serial number, 1234-ABCD-4567, old IP address 192.0.2.1, and its hash, asdfasdfasdfasdf. Run this command to remove it from each client system:
nethsmenroll -f -r 192.0.2.1 1234-ABCD-4567 asdfasdfasdfasdf
The -f forces it since the target is currently offline, the -r tells it remove.
If something goes really bad like the enquiry command is hanging because the remote system has gone completely offline, you can find this information in the /opt/nfast/kmdata/hsm_SERIAL/config/config file on a client or the RFS.
- After removal, the enquiry command should report:
Module #1: Not Present
Code language: CSS (css)
Perform the above on the RFS system as well, if it’s independent of the transactional systems.
- On the RFS, I’m going to clean out the old data related to the failed HSM because I intend to re-use the same IP address for the new HSM. This cleanup includes:
- Remove references to the old HSM’s ESN (1234-ABCD-4567) and IP address (192.0.2.1) from the /opt/nfast/kmdata/config/config file. There will be probably about ten blocks of settings related to the failed HSM, and not all will contain either the ESN or the IP, so you need to look for both.
- Rename the /opt/nfast/kmdata/hsm-1234-ABCD-4567 directory. Don’t remove it since you may want to refer to that config file in the future.
- Install the new hardware. In my case, I’m re-using the same IP address as the failed unit so no firewall changes are needed. Also, specific to me, I make use of the auto push remote upgrade features, so I need to enable those via the front panel. I then add the IP address of the RFS server to the new nShield via front panel.
- If the network is trusted between HSM and RFS, run
anonkneti 192.0.2.1
Code language: CSS (css)
to get back the ESN and KNETI hash from the new HSM. If the network is not trusted, get the ESN and hash out of the HSM front panel.
- Permit the new HSM to connect to your RFS via:
rfs-setup --force 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f OR rfs-setup --force 192.0.2.1 `anonkneti 192.0.2.1`
I’m using the force option in the above because the RFS will likely still have config related to the previous serial numbered HSM using that IP address.
- Enroll the new HSM (include the -p flag since this is your privileged client):
nethsmenroll -p 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f OR nethsmenroll -p 192.0.2.1 `anonkneti 192.0.2.1`
again, IP, ESN, hash. The -p tells it privileged mode, which is only needed from your RFS server, and which you’d previously input on the front panel. Your regular nodes you won’t use the -p.
Pull the new config from the front panel of the HSM, or, using the front panel, turn on auto update config. If you’ve turned on auto push, the HSM should then write its new config back to the RFS. If you do not have auto push enabled, use the front panel to push the config back.
- I personally like auto push on, because it provides confirmation that an HSM has received and committed a new config I push from my privileged client (via RFS), since it will write the new config back afterward. So, the way I do it is auto push on, remote push enabled from the privileged client, and I keep a config.new file in the same directory as the HSM’s config file, which will be /opt/nfast/kmdata/local/hsm_SERIAL/config/
Before I make any HSM changes, I diff the config vs config.new to make sure there are not discrepancies. If there are, I will typically put config.new back to matching, make a non-impactful change to it and push it out to make sure it comes back the same. If it comes back different, diff the two to figure out what has gone wrong. The change I typically use to do this is enabling the HSM power button and then turning it back off.
The way you push your new config to the HSM is via the cfg-pushnethsm command:
cfg-pushnethsm --address=192.0.2.1 -f -n /opt/nfast/kmdata/hsm-ABCD-1234-EF56/config/config.new
- The new HSM probably doesn’t have the same firmware as your other(s), and enquiry will tell you that via ‘unsupported firmware’, so you may need to do an upgrade. This is also an ideal time to upgrade it, because you haven’t loaded the security world yet, and firmware upgrades destroy the security world. If that’s the case, first use the nethsmadmin command to list the images available on your RFS. So, the IP below will be your RFS IP, not an HSM IP:
# v12.40 and older Security World client software: nethsmadmin --list-images 192.0.2.100 # v12.80 and probably some versions in between: nethsmadmin -s 192.0.2.100 -l
- The result of the above will look like this:
Initiating RFS nethsm image check on 192.0.2.100... Checking the nethsm-firmware directory on the RFS. nethsm-firmware/12.23.1cam3-fips/nCx3N.nff nethsm-firmware/12.23.1cam3/nCx3N.nff nethsm-firmware/12.40.2cam1-fips/nCx3N.nff nethsm-firmware/12.40.2cam1/nCx3N.nff nethsm-firmware/12-80-4-latest/nCx3N.nff Images were successfully found on the RFS (192.0.2.100).
Pick the image name you want, and BE SURE TO SET THE MODULE NUMBER IN THE FOLLOWING PROPERLY:
I cannot stress eough the importance of being careful here. Updating the firmware on an nCipher / Thales nShield HSM breaks the Security World installed on it, so if you accidentally flash firmware on a production unit by getting the module number wrong, you’ll take that unit down until you can reinstall the Security World via your administrative card set, which could mean a complete outage, and/or travel if the HSM is not local and you don’t use remote admin cards.
nethsmadmin -m 1 -i nethsm-firmware/12-80-4-latest/nCx3N.nff
- In the above case, I’m telling it to install the firmware on module #1. Alternatively, from the new HSM front panel, you can have it list images and pick the one to upgrade to. Keep in mind this process takes a very long time; could be 30 minutes or more, and it will appear to be unresponsive during that time. It will remind you of this when pushing out the image:
Image upgrade completed. Please wait for appliance to reboot. Please wait for approximately half an hour for the appliance to internally upgrade.
- Switch the HSM to pre-init mode, either via front panel, or command from the RFS privileged client USING THE CORRECT MODULE NUMBER:
nopclearfail -I -m 1
- Now round up a quorum of cards from your administrative card set, which may mean having a bunch of staff called in to stand around the HSM. From the front panel, menu 3-2-2, load your security world. It will ask you to start feeding it cards, and once a quorum has been reached, you’re good. It will write its config back to the RFS at this point.
- On the RFS, copy the new /opt/nfast/kmdata/hsm-NEW-ESN/config file to config.new in the same directory, add your other authorized clients in in the [hs_clients] section. Or add them via front panel of HSM and it will update the file. If you choose to do the update the file method, after changing your config.new, use the cfg-pushnethsm to push the config out to it with:
cfg-pushnethsm --address=192.0.2.1 -f -n /opt/nfast/kmdata/hsm-ABCD-1234-EF56/config/config.new
then remove the config.new once the config shows as having been updated, since the HSM will re-push the config after it has been changed.
- On your other HSM clients, now run rfs-sync -U to have them update their /opt/nfast/kmdata/local data to reflect the new module_ESN file.
- Use nethsmenroll on the other HSM clients to connect them to the new HSM now that it has the security world loaded:
nethsmenroll 192.0.2.1 `anonkneti 192.0.2.1`
or of course the long version will work too:
nethsmenroll 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f
- Check that enquiry looks good. You should be done.
Very good article, it helped me a lot.
Thank you
Thanks for such detailed article on replacing HSM. But I would like to know why do you suggest to replace the failed HS with new one instead of trying to reboot the failed HSM? Does the error message ´Module has failed´ mean that the system needs to be replaced? I have tried rebooting the device and it worked fine but failed again after 2 months. Could you please me know the reason for such failure?
I’ve had five or six of these fail over the past few years and have never had any recover with a power cycle; the same error just immediately returns.