This is just a little diary entry for my first install of some NCS5501-SE devices. I continue to edit it as I find more and more that is broken on this platform. Initially I said there’s pros and cons to going with it compared to an Arista equivalent, but after a series of probably eight edits, now there’s simple no reason at all, that I can think of, to purchase this mostly broken NCS5501 / NCS5501SE over Arista equivalents, and even more so knowing now what’s on the Arista horizon in early 2018. Ugh; feels like waking up with a hangover. First, the summary of my complaints, then explanations of them, and the initial experience diary entry.
- 20171203 – initial
- 20171205 – uRPF update #1
- 20171205 – VRRP on BVI update
- 20171206 – MC-LAG not active-active
- 20171215 – BFD broken
- 20180103 – Version numbering issue
- 20180226 – uRPF update #2
- 20180328 – storage issue w/core dumps, more upgrade failures
- 20180413 – SNMP missing Cisco’s cbgpPeer2RemoteAddr BGP4-MIB OID
- Minimum 2x higher cost than the currently shipping (Jan 2018) Arista 7280R2, while physically having two less 100gig ports (4 vs 6), (40) 1/10 SFP+ ports vs Arista’s (48) 1/10/25 SFP+ ports, and license-wise, only having (8) 1/10 ports active in the base price.
- Potentially as much as 4x higher cost than Arista 7280R2 if you license all the ports; 100gig ports are $12k/ea retail, as are 8-count 10gig licenses.
- NO active/active MC-LAG or vPC functionality; so half the participants in your multi-chassis port channels / bundles are going to sit in a standby state costing you money and doing nothing until a failure occurs.
- uRPF is completely broken unless you intentionally disable the TCAM optimizations for route scale. This means if you’re using this thing with internet route tables, the base 5501 can’t do uRPF and a full table, and the ‘scale’ SE model just went from the highly touted 2.7M route scale to less than the 1.4M+ the Arista can do. So your choice is, if you’re using this thing as an internet router, then you either permit bogons into your network, or you intentionally downgrade the capacity; and if you’re just discarded the main selling point of the SE model, what’s the point of buying it? Confirmed with TAC.
- VRRP is the only first hop redundancy protocol supported (which I’m fine with), but it doesn’t function on bridge virtual interfaces (BVI’s). Summary version is if you’re using these for any switching, and have a pair serving the same bridge domain (think vlan from IOS) for redundancy, they can’t also do VRRP, so they can’t be the first hop. You’re going to need to buy switches to handle your layer 2 stuff and attach these via physical interfaces to do VRRP; so your deployment just got more expensive and complex.
- Broken BFD if you use bundle interfaces (aka combining multiple interfaces, e.g. port channels from IOS, LAG from Brocade). If you use BFD for rapid link failure detection to help your routing protocols converge quicker, and you use bundle interfaces, you have to enable ‘bfd multipath’, which on this platform gets you the wonderful error
!!% 'bfd_api' detected the 'fatal' condition 'unsupported request'
You could also just try enabling it at the BGP level; similar problem, it will take the config, but if you look at your BGP neighbor detail, you’ll see crap like this:
BFD disabled (interface type not supported)
So basically you can use BFD if you’re running nothing but physical single links, otherwise you’re screwed by this platform, yet again.
- Broken BFD if you use VRRP – even if you buy switches and work around the no VRRP on bridge interfaces issue, you still can’t do BFD to gain rapid failover for your VRRP sessions.
- No support for any multi-chassis active/active forwarding (like Arista’s VARP), but I guess that’s a given since there’s no support for an active-active bundle interface.
- MST is the only spanning tree flavor supported.
- Broken SFTP software installer if software source is in a VRF; have to revert to insecure FTP.
- Rack rails, not included
- NEBS kit, to seal the front of the rack around the device’s weird sloped-down front face, not included.
- SNMP implementation is missing the cbgpPeer2RemoteAddr OID (from the CISCO-BGP4-MIB mib), so no way to query for the remote peer address of configured IPv6 peers to build dynamic monitoring rules from.
Anyway, I wanted some nice 1U dual-stack full-table edge routers at a reasonable price, but with two caveats; one, I needed a little bit of layer 2 mixed in there, and two, I wanted route scale that would guarantee me it would be happy for at least three years from end of 2017 when the full internet tables are roughly 670k IPv4 + 45k IPv6 routes. As IPv4 becomes harder to obtain, I anticipate the quantity of advertised /24’s continuing to shoot up as people buy and sell ever-smaller blocks, so I wanted something that I’d have no worries about handling 1.2M+ routes plus whatever IPv6 is at at the time.
With Cisco SE suggestions, I went with this model, in the -SE designation, because of its ~2.7M route FIB thanks to the large external TCAM; it was the only device they have which has that scale in a 1U format and price range I was targeting. The next step(s) up in the Cisco lineup to give you similar route scale and port density would be something like Nexus 7k or ASR9k, and we’re talking couple hundred grand higher list price.
Arista has a better device from a port-count and 1U perspective, with much lower price, however, as of December 2017 it’s only going to get you “1.4M+” routes, with no real expectation if the + will actually turn into greater scale later. If 1.4M routes is fine though, you probably already have a better solution from Arista for reasons I’ll explain further down. Additionally, if you make some NCS5501-SE config changes to get around the broken uRPF, you will actually end up being able to support slightly LESS routes than the Arista can already do, and don’t even mention the standard 5501, you’ve got a useless box there if you need uRPF and internet route tables; see below. Bottom line; if you REQUIRE a functional uRPF, along with high route scale, the NCS5501 (regular and SE) is useless.
Now, what I’ve heard, is that in latter Q1 2018, we’re going to see shipping start for a Arista 7280CR2K, which is a 2M route 30x100Gbe device. Supposedly, the same chipset that allows it to do that is going to trickle its way down to a (48) 1/10/25 + (?) 100gig stackable device. In comparison to the NCS, this means you get 48+ ports that don’t require a license, they do 25gig if that matters, you can do all layer 2 paths active simultaneously via MLAG (vs Cisco NCS5500-series implementation of MC-LAG which puts the second chassis’ piece of the bundle in standby), and you can do all layer 3 forwarding paths active simultaneously via VARP (which Cisco has no counter for in the NCS line, you have to go to NX-OS-based devices that are several hundred thousand higher list price to get the same layer2+layer3 active/active via vPC+VRRP/HSRP). If I had time to wait for those, there’d be no reason to use the NCS unless there’s some features Arista hasn’t implemented that I haven’t come across yet. Well, I won’t go quite that far, the Cisco RPL support in IOS-XR is pretty awesome, but, I’d live without RPL to spend a quarter of the price.
First my gripes:
- Nickel & Diming pt 1: The license fees from Cisco on the NCS55xx series are simply stupid, you get nickel and dimed to death even if you negotiated a reasonable discount up front. There are 5+ add-on software licenses depending on what you want to use your hardware for, then, port license. Yes, barf, port licenses. Cisco charges you roughly $12k list to use blocks of eight 10gig ports, or per 100gig port, so you better negotiate what you think you’ll need up front if you don’t have another purchase to piggy back it onto in the future where you have more negotiating room. Arista gives you a switch that, imagine this, actually lets you use all its ports without paying extra; what a concept!!
- For top of rack or leaf-spine, Arista’s MLAG and VARP stuff is so much easier to deal with than Cisco’s equivalents; you can have an active-active redundant layer 2/3 setup going in a few minutes. You don’t need a complex bundle + add interfaces to bundle + iccp + mpls ldp + mlacp (MC-LAG) + neighbor + add mlacp (MC-LAG) & iccp settings back to bundle + bridge group + bridge domain + interfaces to bridge + add BVI + add BVI to bridge. By the time you finish getting it going you’re like wtf have I been doing for the past half hour. And…. next point is a continuation from here.
- Plain and simple, Arista’s MLAG does multi-chassis LAG in an active-active fashion, like Cisco’s virtual port channel, which this platform doesn’t support. If you buy these, and spend a bunch of money on the port licenses, and had planned to use multi-chassis LAGs, well, you just spent a bunch of money for half your ports to go unused until there’s a failure, at which point the other half will be used.
- Speaking of layer 2, the NCS5501 only supports MST, so if you have a simple deployment and were hoping to keep using PVST because MST is more complex then your environment requires, well, too bad.
- (Note: I had an enlightening call with some folks at Cisco outside of TAC and there may be alternatives to get around this issue; will update soon. -20180403) Let’s move on to Layer 3. This platform has NO first hop redundancy options available if you’re trying to use it in a combination mode of layer 2 and 3. Specifically, if you are trying to use this device where you have more than one port in a bridge domain for switching, then you add a BVI interface (think VLAN interface if you’re not used to this model), you can’t do VRRP on it (and HSRP/GLBP not supported at all platform-wide). So, you can do switching, or you can do routing, but you can’t do both if you expected this router to act as a high availability default gateway for downstream devices.
- uRPF is broken. This thing is billed as a ‘scale’ device designed to hold full internet route tables. Well, that being the case, you’d think common security features like uRPF would work as where else would you use that other than on an internet router. I turned it on and my ports began dropping all traffic on the external-facing interfaces even though they had full BGP tables running and all relevant routes installed. Well, to fix that (per TAC), you need to turn off all external TCAM optimizations via:
hw-module fib ipv4 scale host-optimized-disable hw-module fib ipv6 scale internet-optimized-disable hw-module tcam fib ipv4 scaledisable
Also confirmed by TAC is that this reduces your route capacity by half. Fantastic; a core feature that has been enabled on nearly any edge router for years turns your scale router into commodity crap.
- BFD – broken. Tried to make use of it. Added to OSPF config, didn’t come up. Added to BGP config, didn’t come up. Added to bundle ethernet, hey, it came up:
RP/0/RP0/CPU0:rtr2#show bfd session Mon Feb 26 21:42:39.300 UTC Interface Dest Addr Local det time(int*mult) State Echo Async H/W NPU ------------------- --------------- ---------------- ---------------- ---------- Te0/0/0/12 192.0.2.1 0s(0s*0) 450ms(150ms*3) UP Yes 0/0/CPU0 Te0/0/0/13 192.0.2.1 0s(0s*0) 450ms(150ms*3) UP Yes 0/0/CPU0 Te0/0/0/15 192.0.2.1 0s(0s*0) 450ms(150ms*3) UP Yes 0/0/CPU0 BE1 192.0.2.1 n/a n/a UP No n/a
Well, except for the fact that the routing protocols can’t actually use it. At the OSPF neighbor level, you’ll see this:
Neighbor BFD status: Waiting to create BFD session create
At the BGP neighbor level, you’ll see this:
BFD disabled (interface type not supported)
Dig a little deeper and you’ll find that IOS XR needs bfd multi path to work on bundle interfaces. If you try to add the relevant config, you’ll get a failure, explained as follows:
bfd multipath include location 0/0/CPU0 !!% 'bfd_api' detected the 'fatal' condition 'unsupported request' !
So yeah, if you’re a normal person who uses link bundles for redundancy, guess you won’t be using BFD on this platform. WTF; does anything work on here?
- Nickel & Diming pt 2: Like some of the UCS Fabric boxes, these NCS5501’s slope down from top edge to ports. In the data centers where I have these, this creates a problem because it allows an air gap between the NCS and the area above it if there’s no equipment above it to close that gap. Data centers employing heat containment may forbid these gaps, as does NEBS. How to solve? Buy a NEBS kit for it NCS-1RU-NEBS-KIT. Yes, we’ll make a device with a weird shape and then sell you a kit to convert its profile back to normal.
- Nickel & Diming pt 3: Finally, it doesn’t even include the rack mount kit. Is there a contingent of customers installing these things on someone’s desktop?! Add on NCS-1RU-ACC-KIT.
- If Cisco’s developers mess up code version numbering, who cares, roll it out anyway. At the time of this writing, 6.2.3 is newer than 6.2.25, because they had meant for it to be 126.96.36.199, but since it was ready to go and already numbered, they just rolled it out anyway.
Okay, lets get these bad boys installed. Two units bought at the same time, of course they don’t come with the same IOS XR or firmware, so check it and don’t assume multiple devices will be matching. This was also my first foray into IOS XR, so flying by seat of the pants with this one. So far, I really like it; blank config without a million dumb things to disable before getting into the config. One thing that initially caught me off guard was a combination of no longer needing to copy running config to startup config (but you still should for backup purposes) AND interfaces resetting to an admin-down state if they have no other configuration present. So, when just testing layer 2, I flipped all the interfaces on, did some global config work, reloaded, and all my ports went back down. I wasted some time thinking the device was not saving my config when I really just needed to add a description, or any other setting, to the ports I had ‘no shut’ and then they stayed up through reboots before real config was put on.
Before upgrading XR, do yourself a favor and “fpd auto-upgrade enable”
Okay, so first issue I ran into is trying to upgrade IOS XR when the source interface for pulling the new image is in a VRF. This Cisco page:
4) File Server in a VRF? This is how an install add is performed when the file server is reachable inside a VRF, in this example the VRF name is “management”.
A9K-PE3(admin)#install add source ftp://user:firstname.lastname@example.org;management/ asr9k-px-5.1.3.CSCef12345.pie asr9k-px-5.1.3.CSCab67890.pie activate
Well, that didn’t work. tcpdump confirmed the attempt never even went out. What I instead found would work is explicitly setting the directive (using their vrf example name management) ssh client:
ssh client vrf management
Then re-running the install command without specifying the VRF succeeded:
install add source sftp://email@example.com:/home/ncsupgrade/ ncs5500-mini-x-6.2.25.iso
where this did NOT work:
install add source sftp://firstname.lastname@example.org;management:/home/ncsupgrade/ ncs5500-mini-x-6.2.25.iso
I could find no solution to getting the install to run from a non-standard (not 22) port, so had to move the sftp server to accommodate IOS XR. And, as luck would have it, not out of the wood work on this upgrade issue yet either. One of my NCS’s came broken, so I had to replace it before even getting started. The replacement unit, since it was at a depot, did not have the k9 image on it, so no support for SSH. The above trick of specifying the vrf for ssh did NOT work for ftp. I did “ftp client vrf management passive” then re-ran:
install add source ftp://email@example.com:/home/ncsupgrade/ ncs5500-mini-x-6.2.25.iso
and no connection attempt was made to the source server. I tried the copy command too, no bueno. Finally, I figured out that while the install command didn’t honor the VRF designation, the copy command did, so I was able to:
copy ftp://user:firstname.lastname@example.org;management harddisk:
then filled in the answers for paths and files interactively to get the file onto harddisk:. But wait, it’s not that easy, don’t give it a full path as that seems to confuse things. It will prompt you like this:
Address or name of remote host [192.0.2.1]? Source filename [/ftp:]?ncs5500-mini-x-6.2.25.iso Destination filename [/harddisk:/ncs5500-mini-x-6.2.25.iso]?
Notice how in the above, it defaulted to /ftp as source and I gave it a filename without a path? The real path is /home/user/ncs5500-mini-x-6.2.25.iso but giving it that didn’t work. Specifying just the raw filename, and having the file in the remote user’s home directory seemed to be the key to solving this problem.
If I tried to put the vrf and paths in the command, it would screw up the copy somehow.
Well, of course both upgrades failed, why not; if you spend thousands on equipment, upgrades shouldn’t be easy.
#Dec 13 02:28:13 Install operation 11 aborted
RP/0/RP0/CPU0:Dec 13 02:28:13.947 : sdr_instmgr: %INSTALL-INSTMGR-3-OPERATION_ABORT : Install operation 11 aborted
This was due to my trying to install the .iso file with the extension included, since the docs are not clear. You really want to do a ‘show install repo’ and then install the mini ISO file without the extension, as will be presented in that ‘show install repo’ list, for example:
install prepare ncs5500-mini-x-6.2.25
Well, that will run a lot longer, minutes, but will still fail. Feeling a trend here?
RP/0/RP0/CPU0:Dec 13 03:00:50.694 : sdr_instmgr: %INSTALL-INSTMGR-3-OPERATION_ABORT : Install operation 13 aborted
‘show install log’:
Dec 13 02:28:12 Error! The following package(s) is/are required to be activated as part of this operation: ncs5500-mpls ncs5500-mgbl ncs5500-mpls-te-rsvp ncs5500-ospf ncs5500-isis ncs5500-k9sec
Okay, so, now we need to copy all the other rpm files that came out of the downloaded tar file up to the router, even though the directions on Cisco’s site beg to differ. So, from first broken router, the one with SSH available:
install prepare ncs5500-mini-x-6.2.25 ncs5500-isis-188.8.131.52-r6225.x86_64 ncs5500-mgbl-184.108.40.206-r6225.x86_64 ncs5500-k9sec-220.127.116.11-r6225.x86_64 ncs5500-ospf-18.104.22.168-r6225.x86_64 ncs5500-mpls-22.214.171.124-r6225.x86_64 ncs5500-mpls-te-rsvp-126.96.36.199-r6225.x86_64
and from second broken router, the one reliant on ftp, do the same copy command to get the six missing rpm’s onto the router’s hard drive. Then you can re-run the install prepare like the above. Finally, you can run install activate.
install prepare ncs5500-mini-x-6.2.25 ncs5500-mpls-188.8.131.52-r6225.x86_64 ncs5500-mgbl-184.108.40.206-r6225.x86_64 ncs5500-ospf-220.127.116.11-r6225.x86_64 ncs5500-mpls-te-rsvp-18.104.22.168-r6225.x86_64 ncs5500-isis-22.214.171.124-r6225.x86_64 ncs5500-k9sec-126.96.36.199-r6225.x86_64 Dec 13 03:47:53 Package list: Dec 13 03:47:53 ncs5500-mini-x-6.2.25 Dec 13 03:47:53 ncs5500-mpls-188.8.131.52-r6225.x86_64 Dec 13 03:47:53 ncs5500-mgbl-184.108.40.206-r6225.x86_64 Dec 13 03:47:53 ncs5500-ospf-220.127.116.11-r6225.x86_64 Dec 13 03:47:53 ncs5500-mpls-te-rsvp-18.104.22.168-r6225.x86_64 Dec 13 03:47:53 ncs5500-isis-22.214.171.124-r6225.x86_64 Dec 13 03:47:53 ncs5500-k9sec-126.96.36.199-r6225.x86_64 RP/0/RP0/CPU0:router1#install activate Wed Dec 13 03:59:32.259 UTC Dec 13 03:59:33 Install operation 18 started by admin: install activate This install operation will reload the sdr, continue? [yes/no]:[yes]
Oh Emm Gee; we have an upgraded device!
RP/0/RP0/CPU0:router1#show hw-module fpd Wed Dec 13 04:38:42.696 UTC FPD Versions ================= Location Card type HWver FPD device ATR Status Running Programd ------------------------------------------------------------------------------ 0/RP0 NCS-5501-SE 1.1 Bootloader CURRENT 1.15 1.15 0/RP0 NCS-5501-SE 1.1 CPU-IOFPGA CURRENT 1.14 1.14 0/RP0 NCS-5501-SE 1.1 MB-IOFPGA CURRENT 1.07 1.07 0/RP0 NCS-5501-SE 1.1 MB-MIFPGA CURRENT 1.02 1.02 RP/0/RP0/CPU0:router1#sh ver Wed Dec 13 04:50:28.537 UTC Cisco IOS XR Software, Version 6.2.25 Copyright (c) 2013-2017 by Cisco Systems, Inc. Build Information: Built By : ahoang Built On : Thu Sep 28 20:01:45 PDT 2017 Build Host : iox-lnx-057 Workspace : /auto/srcarchive12/production/6.2.25/ncs5500/workspace Version : 6.2.25 Location : /opt/cisco/XR/packages/ cisco NCS-5500 () processor System uptime is 13 minutes
If everything is cool, verify, commit and then remove the older packages from the devices:
install verify packages install commit install remove inactive all
If you had to do any updates via copy command, then you need to remove the source rpm and iso files from harddisk: too.
Update for March 2018; some co-workers have been having issues ssh’ing into these devices, and after much trying, that ultimately turned into this being logged:
0/RP0/ADMIN0:Mar 27 18:52:21.379 : mediasvr: %MEDIASVR-MEDIASVR-4-PARTITION_USAGE_ALERT : High disk usage alert : /misc/disk1 exceeded 93%
Thought that was kind of weird. I needed to update these to 6.2.3 from 6.2.25 (yes, 25 is older than 3, see above) anyway, so figured I’d just do that. Well, upgrades failed too, with:
Mar 27 21:35:30 No space left on device: (No space in /misc/disk1/tmp_staging/2/ to download pkgs from xr (default-sdr) to admin) ERROR! No enough space to proceed with ADD operation. 1. Please ensure that the free space in 'harddisk:' of Sysadmin and SDR is at least twice the total size of all packages being added 2. Please ensure that the free space in 'rootfs' of Sysadmin and SDR is enough to hold the total size of all the packages being installed Please consider following steps to proceed : 1. 'install remove' unwanted inactive packages and ISOs. 2. Delete old core files from 'harddisk:' on Sysadmin and SDR . 3. Remove unnecessary data from 'rootfs' and 'harddisk:' of Sysadmin and SDR
Well I’d be inclined to believe the above if the device’s own commands matched:
RP/0/RP0/CPU0:router2#show media Tue Mar 27 21:37:17.347 UTC</pre> Tue Mar 27 21:37:17.347 UTC Media Information for local node. ---------------------------------------------- Partition Size Used Percent Avail rootfs: 3.9G 1.2G 33% 2.5G apphost: 3.7G 106M 3% 3.4G /dev/sde 969M 361M 40% 542M harddisk: 5.6G 1.3G 24% 4.1G log: 459M 101M 24% 324M config: 459M 3.6M 1% 421M disk0: 2.0G 19M 1% 1.8G --------------------------------------------------- <pre>rootfs: = root file system (read-only) log: = system log files (read-only) config: = configuration storage (read-only)
Umm, nothing seems anywhere close to out of space, but especially not harddisk: where it’s trying to extract. Ultimately I figured out that the ‘show media’ output isn’t accurate; once I did ‘admin’ and then ‘run’ to get to a bash prompt, I was able to find several gigs worth of core dump files in /misc/disk1/ named with the format default-sdr–2.date-date.core.0_RP0.lxcdump.tar.lz4. I removed those, then upgrade was able to proceed.
Upgrade didn’t solve the SSH issue. However, again, from within admin->run, I was able to determine that every time someone running putty on Windows tried to ssh in, ssh would crash(?!) and leave a core dump:
411 -rw-r--r-- 1 104491 Mar 27 18:51 sshd_child_handler_832.by.11.20180327-185123.xr-vm_node0_RP0_CPU0.02a5b.core.txt
We determined agent forwarding had to be off, then the crashes stopped.
Next article will be creating bundles (e.g. port channels, MC-LAG, MLAG, LACP, whatever your preferred vendor calls them), locally and cross-chassis, and then coming up with a next hop redundancy config. BGP after that.
SNMP would normally be last after everything is happy, but ran into one issue so I’ll just mention it here. I build dynamic monitoring profiles for routers where they’re queried for their eBGP neighbors and then monitoring rules are configured based on the data returned. For IPv4, you can use the non-Cisco BGP4-MIB and query for , but for IPv6, you must use the Cisco-provided CISCO-BGP4-MIB with the cbgpPeer2RemoteAddr object. Well, this platform doesn’t implement that object, so only way to get your IPv6 peers’ remote addresses is to query some other value each peer would have, such as cbgpPeer2State.2.16 (the .2.16 = IPv6-specific peers), chop out the portion following that from the child OID’s returned, since the value itself is not what you’re looking for, then build the rest of your rules off of that. To display the actual address, you’ll need to convert the child OID data back to hex and format it back into IPv6.