Introduction

The idea of writing this post came after witnessing an unexpected behavior in how SD-WAN routers handle different operations that make the router reboot. In fact this was discovered “by accident” while performing software upgrades on a couple of SD-WAN routers and was not caught during failover testing. In today’s post I am going to compare how reloading a router through the reload command is treated differently than a reload triggered by a software upgrade.

Setup

A traditional dual-router SD-WAN setup, running BGP towards an external router on the service side providing connectivity to other sites. Something simple like this:

The BGP configuration is super simple and enough to exemplify this behavior. The SD-WAN routers have identical configurations.

Gateway#show run | s r b
router bgp 64002
 bgp log-neighbor-changes
 neighbor 172.16.1.1 remote-as 64001
 neighbor 172.16.1.2 remote-as 64001


BR1-1#sh run | s r b
router bgp 64001
 bgp log-neighbor-changes
 !
 address-family ipv4 vrf 1
  network 172.16.2.0 mask 255.255.255.0
  neighbor 172.16.1.15 remote-as 64002
  neighbor 172.16.1.15 activate
  neighbor 172.16.1.15 send-community both
  distance bgp 20 200 20
 exit-address-family

At a routing level, the Gateway router has two available entries to reach the web server, both equal and coming from BGP. With the current configuration, the Gateway picks BR1-1 as primary and BR1-2 will remain as backup.

Gateway# sh ip route | b Gateway
Gateway of last resort is not set

      172.16.0.0/16 is variably subnetted, 3 subnets, 2 masks
C        172.16.1.0/24 is directly connected, Ethernet0/0
L        172.16.1.15/32 is directly connected, Ethernet0/0
B        172.16.2.0/24 [20/1000] via 172.16.1.1, 02:48:02


Gateway# sh ip bgp                 
BGP table version is 3, local router ID is 192.168.1.15
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
              t secondary path, L long-lived-stale,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *    172.16.2.0/24    172.16.1.2            1000             0 64001 i
 *>                    172.16.1.1            1000             0 64001 i

Behavior when using the reload command

Let’s start examining what the SD-WAN routers do when going down for reload. I will reload BR1-1 as it’s the primary router.

*18:08:35.380 UTC Wed Jul 30 2025
BR1-1#reload  
Proceed with reload? [confirm]

I will start a ping from the gateway router to verify is connectivity is affected

Gateway#sh clock                  
*18:08:42.345 UTC Wed Jul 30 2025
Gateway#ping 172.16.2.2 rep 10000 
Type escape sequence to abort.
Sending 10000, 100-byte ICMP Echos to 172.16.2.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<...>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
*Jul 30 18:09:01.490: %BGP-5-NBR_RESET: Neighbor 172.16.1.1 reset (Peer closed the session)
*Jul 30 18:09:01.490: %BGP-5-ADJCHANGE: neighbor 172.16.1.1 Down Peer closed the session
*Jul 30 18:09:01.490: %BGP_SESSION-5-ADJCHANGE: neighbor 172.16.1.1 IPv4 Unicast topology base removed from session  Peer closed the session
<...>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (10000/10000), round-trip min/avg/max = 1/1/32 ms

Gateway#sh ip route | b Gateway
Gateway of last resort is not set

      172.16.0.0/16 is variably subnetted, 3 subnets, 2 masks
C        172.16.1.0/24 is directly connected, Ethernet0/0
L        172.16.1.15/32 is directly connected, Ethernet0/0
B        172.16.2.0/24 [20/1000] via 172.16.1.2, 00:02:42


Gateway#sh ip bgp all sum | b Nei
Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
172.16.1.1      4        64001       0       0        1    0    0 00:00:19 Active
172.16.1.2      4        64001     193     197        4    0    0 02:51:15        1

We can see that the BGP session to BR1-1 was closed and the ping continued without issues through BR1-2.

Looking at the packet level, BR1-1 closed the BGP session right before reloading, preventing any connectivity issues.

This is the usual expected behavior and the way most of the failover testing is conducted.

Behavior when router reloads due to upgrade

After that first reload, BR1-2 is now marked as primary path.

BGP#sh ip route | b Gateway
Gateway of last resort is not set

      172.16.0.0/16 is variably subnetted, 3 subnets, 2 masks
C        172.16.1.0/24 is directly connected, Ethernet0/0
L        172.16.1.15/32 is directly connected, Ethernet0/0
B        172.16.2.0/24 [20/1000] via 172.16.1.2, 00:28:48

This time I will use the Manager to upgrade BR1-2 to see what happens.

Ping from the Gateway router fails once BR1-2 goes down for reload.

Gateway# ping 172.16.2.2 rep 1000000
Type escape sequence to abort.
Sending 1000000, 100-byte ICMP Echos to 172.16.2.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!............................................

Interestingly, the BGP neighbor is still up and, as a consequence, the route is still pointing to BR1-2 that is no longer available!

Gateway# sh ip route | b Gate
Gateway of last resort is not set

      172.16.0.0/16 is variably subnetted, 3 subnets, 2 masks
C        172.16.1.0/24 is directly connected, Ethernet0/0
L        172.16.1.15/32 is directly connected, Ethernet0/0
B        172.16.2.0/24 [20/1000] via 172.16.1.2, 00:35:09

Traffic is blackholed for the entire duration of the BGP hold time

Gateway# ping 172.16.2.2 rep 5
Type escape sequence to abort.
Sending 10, 100-byte ICMP Echos to 172.16.2.2, timeout is 2 seconds:
.....
*Jul 30 15:17:10.575: %BGP-3-NOTIFICATION: sent to neighbor 172.16.1.2 4/0 (hold time expired) 0 bytes
Success rate is 50 percent (5/10), round-trip min/avg/max = 1/1/2 ms
*Jul 30 15:17:10.575: %BGP-5-NBR_RESET: Neighbor 172.16.1.2 reset (BGP Notification sent)
*Jul 30 15:17:10.575: %BGP-5-ADJCHANGE: neighbor 172.16.1.2 Down BGP Notification sent
*Jul 30 15:17:10.575: %BGP_SESSION-5-ADJCHANGE: neighbor 172.16.1.2 IPv4 Unicast topology base removed from session  BGP Notification sent

Gateway#ping 172.16.2.2 rep 5
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.16.2.2, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/3 ms

Common Misconceptions

One common misconceptions in network operations is that all forms of reboot behave the same. It’s easy to assume that whether a router reboots because of a reload command, power cycle, or software upgrade, the result should be consistent: BGP session drops, routes get withdrawn, traffic fails over.

This case study proves otherwise.

In Cisco SD-WAN, a reload initiated through the Manager during an image upgrade does not gracefully close control plane sessions like BGP. Unlike a CLI-based reload where the router sends TCP FIN packets to terminate sessions cleanly, the upgrade-triggered reload skips that step — keeping the BGP session “alive” on the peer side, even when the router is already offline.

This subtle behavior is rarely documented, and even less frequently tested. Many engineers rely solely on reload or interface flap tests to validate failover behavior. As a result, they may believe their setup is resilient, when in fact it’s vulnerable to specific upgrade workflows.

Being aware of this distinction is key to designing a more robust and predictable network.

Analysis

Inspecting the Manager’s log we can see the following netconf command getting applied to BR1-2.

30-Jul-2025 19:34:51,103 UTC INFO  [6a486f9b-e3d8-42f7-982b-89701bfeab58] [Manager01] [ChangePartitionActionProcessor] (device-action-change_partition-1) || Change partition request XML <activate xmlns="http://cisco.com/ns/yang/Cisco-IOS-XE-install-rpc">
  <uuid xmlns="http://cisco.com/ns/yang/Cisco-IOS-XE-install-rpc">change_partition-fc8b8de6-5f34-470d-b8a0-40e0f6513686%C8K-D4CE7174-5261-7E6F-91EA-4926BCF4C2DD%1800000</uuid>
  <version xmlns="http://cisco.com/ns/yang/Cisco-IOS-XE-install-rpc">17.15.03a.0.176</version>
</activate>

The SD-WAN equivalent of the command is:

request platform software sdwan activate <version>

So, in reality, this is not an apple to apple comparison, but it’s natural to expect the BGP session being closed just as it happened with the reload command.

Solution

To overcome this situation, we can simply configure BFD and tie it to the BGP session:

On the SD-WAN devices we should use a CLI add-on.

Config on the Gateway:

bfd-template single-hop t1
 interval min-tx 250 min-rx 250 multiplier 3

interface eth0/0
 bfd template t1

router bgp 64002
 bgp log-neighbor-changes
 neighbor 172.16.1.1 remote-as 64001
 neighbor 172.16.1.1 fall-over bfd
 neighbor 172.16.1.2 remote-as 64001
 neighbor 172.16.1.2 fall-over bfd

By doing so, the BGP session is tracked and issues can be detected in less than a second, preventing blackholing the traffic when upgrading the device.

Testing with BFD configured, only one ping is lost

Gateway#ping 172.16.2.2 rep 10000000
Type escape sequence to abort.
Sending 10000000, 100-byte ICMP Echos to 172.16.2.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
*Aug  1 09:50:08.876: %BFDFSM-6-BFD_SESS_DOWN: BFD-SYSLOG: BFD session ld:1 handle:1,is going Down Reason: DETECT TIMER EXPIRED
*Aug  1 09:50:08.876: %BGP-5-NBR_RESET: Neighbor 172.16.1.2 reset (BFD adjacency down)
*Aug  1 09:50:08.876: %BGP-5-ADJCHANGE: neighbor 172.16.1.2 Down BFD adjacency down
*Aug  1 09:50:08.876: %BGP_SESSION-5-ADJCHANGE: neighbor 172.16.1.2 IPv4 Unicast topology base removed from session  BFD adjacency down.!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
*Aug  1 09:50:08.876: %BFD-6-BFD_SESS_DESTROYED: BFD-SYSLOG: bfd_session_destroyed,  ld:1 neigh proc:BGP, idb:Ethernet0/0 handle:1 active!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<...>
Success rate is 99 percent (22877/22878), round-trip min/avg/max = 1/1/70 ms

CLI vs Manager Behavior Summary

Here’s a quick comparison of what happens when a router reloads via CLI vs when it’s upgraded through the SD-WAN Manager:

Behavior CLI reload SD-WAN Manager Upgrade
BGP Session Handling Session closed gracefully (TCP FIN) Session remains open until timeout
Route Withdrawal on Peer Router Immediate Delayed (until BGP hold timer expires
Impact on Traffic Minimal (instant failover) High (traffic blackholed)
Seen During Traditional Testing? Yes Often overlooked
Can Be Mitigated with BFD? Optional Strongly recommended
Time to Convergence (w/o BFD) sub-second Minutes
Time to Convergence (with BFD) sub-second Sub-second

This table highlights why BFD is not just a “nice-to-have” but a critical part of SD-WAN deployments.

Conclusion

his case study reveals an important and often overlooked aspect of failover behavior in Cisco SD-WAN: not all reboots are treated equally. While a manual reload cleanly terminates BGP sessions—prompting immediate rerouting—an upgrade triggered via SD-WAN Manager does not always signal the control protocols before going offline. This subtle distinction can cause traffic being blackholed for the full duration of the BGP hold timer.

This is where BFD becomes a key ally. By reducing failure detection times from minutes to sub-second, BFD ensures that routing protocols can adapt quickly—even in cases where the router doesn’t properly close sessions. Tying BFD to BGP sessions is a simple and effective way to harden your topology against these behaviors.

Moreover, this behavior isn’t limited to software upgrades. You may also encounter similar issues in other “silent failure” scenarios—for example:

  • A fiber cut between peers that doesn’t result in a direct interface down.
  • Indirect failures within the transport or switching infrastructure.
  • A route processor (RP) swap

So, what’s the main takeaway? Don’t assume CLI behavior reflects all scenarios. Always test failover using the actual methods used in production—whether that’s a software upgrade or Manager-initiated task. And as a final recommendation, always deploy BFD. It’s a small investment with a big payoff.