Recently had a Thales nShield Connect 6000 fail. Replacing them is fairly easy, but there are a large number of steps, so figured I’d write them out here in case anyone else needs to do this and hasn’t yet.
How do you know if your HSM has failed? Obviously if it’s simply unreachable that’s a good start. In my case the actual HSM portion of the nShield Connect failed, not the outer ‘box’; I believe the HSM is a PCI card inside the box. The unit began syslogging info like this:
Jan 3 04:05:34 nethsm hardserver[1111]: nFast server: Serious error, trying to continue: Operating system call failed: write from device #1 nFast PCI device, bus 2, slot 0. failed, No such device or address
Jan 3 04:05:42 nethsm hsglue: Displaying ‘HSM Failed’ message on the front panel.
If you were to run the ‘enquiry’ command, you’d see an unreachable or failed unit.
I received the new unit, so time to replace the failed one. If you’re running redundant HSM’s and you’ve installed in the typical, and ‘recommended’ manner, you will probably need to make some network or application changes to facilitate replacement without an outage. The recommended manner is where your Remote File System (RFS) server is also a transaction client. The issue is that the new HSM needs to be connected to the RFS and have some work done before you actually want transactions to be hitting it, and if your RFS is also a transaction server, transactions are going to be hitting the new HSM between when it’s connected and when it’s actually ready. So if that’s your setup, make sure to isolate the RFS server from transaction traffic until it’s ready. My ‘how to install the nShield Connect 6000’ article documents how I set up, which is having a third system act as RFS. So, take care of this part first, if applicable, then continue.
- Step one is to disconnect the dead HSM from all clients. My clients are linux, but I believe the commands are similar on Windows so you can probably figure it out. To do this, first run enquiry to ensure you know the correct module number for the failed HSM, then you can get the details you need to run the removal. It will look something like this; my dead module is #1:
Module #1: enquiry reply flags Failed enquiry reply level Six serial number unknownunknown mode operational version 0.0.0 speed index 0 rec. queue 0..0 level one flags none version string unknown checked in 0000000000000000 Wed Dec 31 19:00:00 1969 level two flags none max. write size 0 level three flags none level four flags none module type code 0 product name unknown device name unknown EnquirySix version 3 impath kx groups feature ctrl flags none features enabled none version serial 0 connection status InProgress connection info esn = 1234-ABCD-4567; addr = INET/192.0.2.1/9004; ku hash = asdfasdfasdfasdf, mech = Any; time-limit = 24h; data-limit = 8MB image version unknown hardware status unknown
From the above, you’ll notice it’s clearly different than an online good unit:
Module #2: enquiry reply flags none enquiry reply level Six serial number 9876-EFGH-5432 mode operational version 2.61.2 speed index 4512 rec. queue 19..152 level one flags Hardware HasTokens version string 2.61.2cam2 built on Sep 3 2015 16:01:11, 3.34.1cam3 level two flags none max. write size 8192 level three flags KeyStorage module type code 7 device name Rt2 EnquirySix version 6 impath kx groups DHPrime1024 DHPrime3072 feature ctrl flags LongTerm features enabled StandardKM version serial 26 connection status OK connection info esn = 9876-EFGH-5432; addr = INET/192.0.2.2/9004; ku hash = erqerqwqvwqvewvq5234234234, mech = Any; time-limit = 24h; data-limit = 8MB image version 12.23.1cam3 max exported modules 3 rec. LongJobs queue 18 SEE machine type PowerPCSXF supported KML types DSAp1024s160 DSAp3072s256 using impath kx grp DHPrime3072 hardware status OK
So we know module #1 is the one we want gone. From the enquiry command, we need to know its electronic serial number, 1234-ABCD-4567, old IP address, 192.0.2.1, and its hash, asdfasdfasdfasdf. Run this command to remove it from each client system:
nethsmenroll -f -r 192.0.2.1 1234-ABCD-4567 asdfasdfasdfasdf
The -f is for force, which is necessary when a given module is not reachable. The -r tells it to remove the module.
After removal, enquiry should report:
Module #1: Not Present
- Perform the above on the RFS system as well, if it’s independent of the transactional systems.
- On the RFS, I’m going to clean out the old data related to the failed HSM because I intend to re-use the same IP address for the new HSM. This cleanup includes:
- Remove references to the old HSM’s ESN (1234-ABCD-4567) and IP address (192.0.2.1) from the /opt/nfast/kmdata/config/config file. There will be probably about ten blocks of settings related to the failed HSM, and not all will contain either the ESN or the IP, so you need to look for both.
- Remove the /opt/nfast/kmdata/hsm-1234-ABCD-4567 directory.
- Go ahead and install the new hardware. In my case, I’m re-using the same IP address as the failed unit so no firewall changes are needed. Also, specific to me, I make use of the auto push remote upgrade features, so I need to enable those via the front panel. I then add the IP address of the RFS server to the new nShield via front panel.
- If the network is trusted between HSM and RFS, run
anonkneti 192.0.2.1
to get back the ESN and KNETI hash from the new HSM. If the network is not trusted, get the ESN and hash out of the HSM front panel.
- Now you can permit the new HSM to connect to your RFS via:
rfs-setup --force 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f
where that’s obviously the IP, ESN, hash.
- Now enroll the new HSM (include the -p flag!):
nethsmenroll -p 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f
again, IP, ESN, hash. The -p tells it privileged mode, which is only needed from your RFS server, and which you’d previously input on the front panel. Your regular nodes you won’t use the -p.
- Pull the new config from the front panel of the HSM, or, using the front panel, turn on auto update config.
- The new HSM probably doesn’t have the same firmware as your other(s), and enquiry will tell you that via ‘unsupported firmware’, so you may need to do an upgrade. If that’s the case, use:
nethsmadmin --list-images 192.0.2.100
where that IP is NOT the HSM, it’s your RFS. Pick the image name you want, and BE SURE TO SET THE MODULE NUMBER IN THE FOLLOWING PROPERLY:
nethsmadmin -m 1 -i nethsm-firmware/12.23.1cam3/nCx3N.nff
So in this case, I’m telling it to install the firmware on module #1. Alternatively, from the new HSM front panel, you can have it list images and pick the one to upgrade to. Keep in mind this process takes a very long time; could be 30 minutes or more, and it will appear to be unresponsive during that time. It will remind you of this when pushing out the image:
Image upgrade completed. Please wait for appliance to reboot. Please wait for approximately half an hour for the appliance to internally upgrade.
- Switch the HSM to pre-init mode, either via front panel, or command from the RFS privileged client USING THE CORRECT MODULE NUMBER:
nopclearfail -I -m 1
- Now round up a quorum of cards from your administrative card set, which may mean having a bunch of staff called in to stand around the HSM. From the front panel, menu 3-2-2, load your security world. It will ask you to start feeding it cards, and once a quorum has been reached, you’re good. It will write its config back to the RFS at this point.
- On the RFS, copy the new /opt/nfast/kmdata/hsm-NEW-ESN/config file to config.new in the same directory, add your other authorized clients in in the [hs_clients] section. Or add them via front panel of HSM and it will update the file. If you choose to do the update the file method, after changing your config.new, use the cfg-pushnethsm to push the config out to it with:
cfg-pushnethsm --address=192.0.2.1 -f -n /opt/nfast/kmdata/hsm-NEW-ESN/config/config.new
then remove the config.new once the config shows as having been updated, since the HSM will re-push the config after it has been changed.
- On your other HSM clients, now run rfs-sync -U to have them update their /opt/nfast/kmdata/local dat to reflect the new module_ESN file.
- Use nethsmenroll on the other HSM clients to connect them to the new HSM now that it has the security world loaded:
nethsmenroll 192.0.2.1 `anonkneti 192.0.2.1`
or of course the long version will work too:
nethsmenroll 192.0.2.1 ABCD-1234-ABCD v09873h9c0398u098uf30f
- Check that enquiry looks good. You should be done.
Very good article, it helped me a lot.
Thank you