Introduction

In my previous series of posts, I explored Application Aware Routing (AAR) in depth, a key SD-WAN technology that steers traffic over best-performing paths. While AAR has been a fundamental capability for years, the evolution of networking brought new ideas to enhance its effectiveness. This led to the introduction of Enhanced Application Aware Routing (EAAR).

AAR Limitations

Before diving into EAAR, let’s understand why it was created. 🤓

The current AAR implementation measures the path quality using BFD, sending probes at a defined interval (1s by default). Loss, latency and jitter are derived from those packets and these values are placed into rotating buckets to calculate an average tunnel health metric. This process would typically take between 10 to 60 minutes and with some configuration tweaks we could achieve times of 2 to 10 mins.

For those networks that require faster detection, some challenges arise:

  • Lowering the hello interval to less than 1s affects the device’s tunnel scale
  • Lowering bfd multiplier and poll intervals could lead to false positives, switching traffic even with transient network conditions.
  • Switch traffic back and forth, there is no mechanism to determine if a transport is stable again after a network degradation event.

Enhanced AAR

So, what is EAAR and how it improves its predecessor? 🤔

In a nutshell these are the advantages:

  • Uses inline data rather than BFD. In other words, the data plane packets are used to measure loss, latency and jitter.
  • Steer traffic in seconds rather than minutes
  • Dampening implemented for stability purposes, ensuring transports are stable before forwarding traffic through them.
  • More accurate measurements of loss, latency and jitter

Breaking It Down

When EAAR is enabled, data packets will be used to measure loss, latency and jitter. Let’s understand the key differences:

Loss measurement

The SD-WAN routers will use inline data along with IPSEC sequence numbers to measure loss.

There is an in-built mechanism that allows the routers to determine if the loss is local to the router

  • Local loss - typically due to QoS drops

Or external to the router

  • WAN loss - any packet loss outside the router

To calculate the local loss, the router will determine the amount of packets it generated against the amount of packets that actually were sent. To get the WAN loss, peer SD-WAN routers will report the amount of packets received and will use BFD (Path Monitor TLVs) to send this information back to the originating router.

Up to this point, there is an important improvement in how loss measurement is done, however, this could be further improved by leveraging per queue loss measurements. To achieve this, we need to associate an SLA class with an App Probe Class. Let’s see an example.

With this App Probe Class, the router will use (and generate BFD) packets with DSCP 18, mimicking less important traffic that will be subject to different rules and paths on the local and external routers. This will provide a more accurate measurement of loss for each type of traffic on the specified transports. If there is no inline data, BFD is used to get measurements.

Note If using GRE, per queue measurement is not available.

Here is a visual to better understand how loss will be measured depending on multiple factors

Encapsulation App Probe Class Measurement type Public tunnels Private Tunnels
IPsec Yes Per SLA Total WAN loss + local loss per queue WAN loss per queue + local loss per queue
IPsec No All SLAs Total WAN loss + total local loss Total WAN loss + total local loss
GRE - All SLAs Total WAN loss + total local loss Total WAN loss + total local loss

Latency

To measure latency, the router will simply calculate the time taken to send and receive packets between source and destination devices. Inline data is used and it can get to App Probe granularity.

Jitter

Another worth mentioning change is that the jitter is computed per direction (receive or transmit). The jitter is computed at the receiver and reported to the sender using BFD TLVs. Inline data is used and BFD is the fallback mechanism if no data traffic is available.

SLA Dampening

One of the benefits of EAAR is steering traffic in seconds rather than minutes, but what would happen if there are transient network conditions causing the transports to not meet the sla every few minutes? Traffic would be constantly switching between transports which is not a desirable scenario and the reason why dampening was introduced.

The general idea is that when a transport link goes out of compliance, traffic is rerouted to an alternate path. Once the transport becomes compliant again, the device does not immediately move traffic back. Instead, it starts a timer to ensure the link remains stable for a specified period before reusing it.

In the end, dampening helps prevent unnecessary traffic shifts, which could negatively impact performance due to transport instability

Configuring EAAR

To enable EAAR we have three predefined options:

Mode Poll Interval Poll Multiplier Dampening Multiplier
Aggressive 10s 6 (10s-60s) 120 (20 mins)
Moderate 60s 5 (60s-300s) 40 (40 mins)
Conservative 300s 6 (300s-1800s) 12 (60 mins)

Note To use custom timers, configuration needs to be done through CLI templates.

EAAR follows the same fundamental principle as AAR, using rotating buckets to calculate average loss, latency, and jitter. With the Aggressive mode, traffic would take between 10-60 seconds to shift, depending on how severe the impairment is.

The dampening window (poll interval x dampening multiplier) is 1200 seconds, meaning that before switching traffic back to a transport, it needs to be stable for 20 minutes.

In my lab, I am using Configuration Groups, however this is available through templates as well.

You can use a variable, instead of a global value, to account for devices that will not be running EAAR. In this case, EAAR enabled devices will fallback to AAR.

The following config is added to the devices:

  bfd enhanced-app-route enable
  bfd enhanced-app-route pfr-poll-interval 10000
  bfd enhanced-app-route pfr-multiplier 6
  bfd sla-dampening enable
  bfd sla-dampening multiplier 120

Let’s see how it works

Demo

In my lab, I use Manager 20.16.1 and my devices are running 17.15.1a

Note The minimum version required is 20.12/17.12

Let’s start with some verifications after pushing the configuration.

To check the configured timers and multipliers

BR10#show sdwan app-route params
Enhanced Application-Aware routing
    Config:                  :Enabled   
    Poll interval:           :10000     
    Poll multiplier:         :6         

App route 
    Poll interval:           :120000    
    Poll multiplier:         :5         

SLA dampening  
    Config:                  :Enabled   
    Multiplier:              :120    

To verify what BFD sessions are using EAAR, look for the FLAGS column

BR10#show sdwan bfd sessions alt

                                  SOURCE TLOC      REMOTE TLOC                     DST PUBLIC    DST PUBLIC                                  
SYSTEM IP   SITE ID   STATE       COLOR            COLOR            SOURCE IP      IP            PORT        ENCAP  BFD-LD    FLAGS   UPTIME    
-------------------------------------------------------------------------------------------------------------------------------------------------
1.1.1.20    200       up          biz-internet     biz-internet     30.1.10.2      30.1.20.2     12406       ipsec  20006     EAAR    0:00:20:31 
1.1.1.20    200       up          mpls             mpls             30.2.10.2      30.2.20.2     12366       ipsec  20002     EAAR    0:00:20:38 
1.1.1.20    200       up          private1         private1         30.3.10.2      30.3.20.2     12366       ipsec  20003     EAAR    0:00:20:37 

To get more details about a specific tunnel.

BR10#show sdwan app-route stats summary 
Generating output, this might take time, please wait ...
app-route statistics 30.1.10.2 30.1.20.2 ipsec 12386 12406
 remote-system-ip         1.1.1.20
 local-color              biz-internet
 remote-color             biz-internet
 sla-class-index          0,1,2
 fallback-sla-class-index None
 enhanced-app-route       Enabled
 sla-dampening-index      None
 app-probe-class-list None
  mean-loss    0.000
  mean-latency 0
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      0             0             0        0        0             0             0             0             
1      64            0             0        0        0             0             0             0             
2      0             0             0        0        0             0             0             0             
3      0             0             0        0        0             0             0             0             
4      0             0             0        0        0             0             0             0             
5      64            0             0        0        0             0             0             0             

Notice that there is no traffic between my devices, thus the total packet count on each bucket is low.

Same information is available through the Manager’s UI, using the Real Time dashboard and using the App Routes Statistics Device Option.

Scenario 1 - Slight impairment

This is my lab’s topology

For this first test, I will use the following SLA parameters:

SLA_Real-Time 
Loss: 3%
Latency: 150ms
Jitter: 100ms

The AAR policy from the Controller instructs:

  • Use mpls as primary path
  • If no color meets the SLA and private1 is available, use it.
  • If private1 is not available, load-balance among all remaining colors.

I am matching traffic between 172.16.10.0/24 and 172.16.20.0/24.

BR10#show sdwan policy from-vsmart
from-vsmart app-route-policy app_route_AAR
 vpn-list vpn_Corporate_Users
  sequence 1
   match
    source-data-prefix-list      BR10
    destination-data-prefix-list BR20
   action
    backup-sla-preferred-color private1
    sla-class       SLA_Real-Time
    no sla-class strict
    sla-class preferred-color mpls
  sequence 11
   match
    source-data-prefix-list      BR20
    destination-data-prefix-list BR10
   action
    backup-sla-preferred-color private1
    sla-class       SLA_Real-Time
    no sla-class strict
    sla-class preferred-color mpls

Initial state without network issues

BR10#show sdwan app-route stats summary | i color|damp|mean         
 local-color              biz-internet
 remote-color             biz-internet
 sla-dampening-index      None
  mean-loss    0.000
  mean-latency 1
  mean-jitter  0
 local-color              mpls
 remote-color             mpls
 sla-dampening-index      None
  mean-loss    1.212
  mean-latency 1
  mean-jitter  0
  mean-loss    1.212
  mean-latency 1
  mean-jitter  0
 local-color              private1
 remote-color             private1
 sla-dampening-index      None
  mean-loss    0.000
  mean-latency 0
  mean-jitter  0
  mean-loss    0.000
  mean-latency 0
  mean-jitter  0

Notice that the number of packets per bucket increased dramatically

BR10#show sdwan app-route stats remote-color mpls summary  
Generating output, this might take time, please wait ...
app-route statistics 30.2.10.2 30.2.20.2 ipsec 12366 12366
 remote-system-ip         1.1.1.20
 local-color              mpls
 remote-color             mpls
 sla-class-index          0,1
 fallback-sla-class-index None
 enhanced-app-route       Enabled
 sla-dampening-index      None
 app-probe-class-list None
  mean-loss    1.176
  mean-latency 0
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      131136        1501          0        0        100084        20846         0             0             
1      131072        1438          0        0        95287         19605         0             0             
2      131072        1400          0        0        100985        20937         0             0             
3      131072        1781          0        0        85553         18271         0             0             
4      64            0             0        0        72618         15942         0             0             
5      131072        1589          0        0        65198         14226         0             0             
          

Traffic is using mpls as primary transport

BR10# show sdwan policy service-path vpn 10 interface gigabitEthernet 4 source-ip 172.16.10.10 dest-ip 172.16.20.10 protocol 6 all       
Number of possible next hops: 1
Next Hop: IPsec
  Source: 30.2.10.2 12366 Destination: 30.2.20.2 12366 Local Color: mpls Remote Color: mpls Remote System IP: 1.1.1.20

I will introduce 3% packet loss on the mpls transport and see how long it takes to switch traffic. Since there is around 1% loss already, 3% should be enough to trigger a change.

The mpls transport has more than 3% loss

BR10#show sdwan app-route stats remote-color mpls summary 
Generating output, this might take time, please wait ...
app-route statistics 30.2.10.2 30.2.20.2 ipsec 12366 12366
 remote-system-ip         1.1.1.20
 local-color              mpls
 remote-color             mpls
 sla-class-index          0
 fallback-sla-class-index 1
 enhanced-app-route       Enabled
 sla-dampening-index      None
 app-probe-class-list None
  mean-loss    3.125          <<<<<<<<<<
  mean-latency 0
  mean-jitter  0

After 56 seconds, the traffic shifted and any of the remaining compliant transports could be used

BR10# show sdwan policy service-path vpn 10 interface gigabitEthernet 4 source-ip 172.16.10.10 dest-ip 172.16.20.10 protocol 6 all
Number of possible next hops: 2
Next Hop: IPsec
  Source: 30.3.10.2 12366 Destination: 30.3.20.2 12366 Local Color: private1 Remote Color: private1 Remote System IP: 1.1.1.20
Next Hop: IPsec
  Source: 30.1.10.2 12386 Destination: 30.1.20.2 12366 Local Color: biz-internet Remote Color: biz-internet Remote System IP: 1.1.1.20

If I take the loss away we can see the dampening mechanism gets activated. So, if the transport is stable for 20 minutes, it will be used again as preferred path.

BR10#show sdwan app-route stats remote-color mpls summary 
Generating output, this might take time, please wait ...
app-route statistics 30.2.10.2 30.2.20.2 ipsec 12366 12366
 remote-system-ip         1.1.1.20
 local-color              mpls
 remote-color             mpls
 sla-class-index          0
 fallback-sla-class-index 1
 enhanced-app-route       Enabled
 sla-dampening-index      1        <<<<<<<<<<
 app-probe-class-list None
  mean-loss    0.000
  mean-latency 0
  mean-jitter  0

Scenario 2 - Greater impairment

In this case, I will introduce 10% packet loss to biz-internet transport, making private 1 the only compliant transport.

After around 45 seconds, loss for biz-internet was 12%

BR10#show sdwan app-route stats local-color biz-internet summary         
Generating output, this might take time, please wait ...
app-route statistics 30.1.10.2 30.1.20.2 ipsec 12386 12366
 remote-system-ip         1.1.1.20
 local-color              biz-internet
 remote-color             biz-internet
 sla-class-index          0
 fallback-sla-class-index 1
 enhanced-app-route       Enabled
 sla-dampening-index      None
 app-probe-class-list None
  mean-loss    12.500
  mean-latency 0
  mean-jitter  0

Traffic shifted to private1 only

BR10#show sdwan policy service-path vpn 10 interface gigabitEthernet 4 source-ip 172.16.10.10 dest-ip 172.16.20.10 protocol 6 all
Number of possible next hops: 1
Next Hop: IPsec
  Source: 30.3.10.2 12366 Destination: 30.3.20.2 12366 Local Color: private1 Remote Color: private1 Remote System IP: 1.1.1.20

After removing the packet loss, biz-internet has the dampening mechanism activated

BR10#show sdwan app-route stats local-color biz-internet summary 
Generating output, this might take time, please wait ...
app-route statistics 30.1.10.2 30.1.20.2 ipsec 12386 12366
 remote-system-ip         1.1.1.20
 local-color              biz-internet
 remote-color             biz-internet
 sla-class-index          0
 fallback-sla-class-index 1
 enhanced-app-route       Enabled
 sla-dampening-index      1         <<<<<<<<<
 app-probe-class-list None
  mean-loss    1.562
  mean-latency 0
  mean-jitter  0

In this case, the time to switch traffic was reduced as a consequence of a greater impairment.

Scenario 3 - Multiple App Probe Classes

For this final scenario, let’s see how to get the most benefit out of EAAR.

The configuration is more complex as it involves QoS, App Probe Classes and AAR Policy.

  • QoS is required to classify and send traffic out on different queues.
  • App Probes Classes to measure loss, latency and jitter on each of those queues, independently.

My Qos configuration has 3 queues and queue 2 will handle the least important traffic.

  • Queue 0 for control traffic
  • Queue 1 for Real-Time Traffic (marked DSCP 46)
  • Queue 2 for Transactional traffic (marked DSCP 18)

I use a data policy to match traffic on the service side, mark it with the right DSCP and put it on the right forwarding class. I also created a shaper on my mpls interface

To demo things out, I will have two data transfers:

  • HTTP GET (port 8000)
  • SCP copy (port 22)

My SLAs have the following configurations:

SLA Class Name Loss Latency Jitter
SLA_Real-Time 3 % 150 ms 100 ms
SLA_Transactional 5 % 45 ms 150 ms

The first thing to note is that my two app probes classes are measured independently. See how the mean loss for Transactional-Probe_Class is 1, whereas for Real_Time_Probe_Class is 0.

BR10#show sdwan app-route stats local-color mpls summary 
Generating output, this might take time, please wait ...
app-route statistics 30.2.10.2 30.2.20.2 ipsec 12366 12366
 remote-system-ip         1.1.1.20
 local-color              mpls
 remote-color             mpls
 sla-class-index          0,1,2
 fallback-sla-class-index None
 enhanced-app-route       Enabled
 sla-dampening-index      None
 app-probe-class-list None
  mean-loss    0.000
  mean-latency 1
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      16384         0             0        0        35827         111807        0             0             
1      49152         0             1        0        37788         118236        0             0             
2      49216         0             1        0        36859         115551        0             0             
3      16384         0             1        0        23894         77280         0             0             
4      32768         0             1        1        33179         103759        0             0             
5      32768         0             1        0        21485         71702         0             0             

 app-probe-class-list Real_Time_Probe_Class
  mean-loss    0.000
  mean-latency 0
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      0             0             0        0        -             -             -             -             
1      32768         0             1        0        -             -             -             -             
2      32768         0             0        0        -             -             -             -             
3      0             0             0        0        -             -             -             -             
4      32768         0             1        2        -             -             -             -             
5      0             0             0        0        -             -             -             -             

 app-probe-class-list Transactional-Probe_Class
  mean-loss    0.000
  mean-latency 1
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      16384         0             0        0        -             -             -             -             
1      49152         0             1        0        -             -             -             -             
2      49216         0             1        0        -             -             -             -             
3      16384         0             1        0        -             -             -             -             
4      32768         0             1        1        -             -             -             -             
5      32768         0             1        0        -             -             -             -             

Now, I have lowered my shaper. EAAR was quick to detect a change in latency for the transactional SLA, it is now 53 ms.

BR10# show sdwan app-route stats local mpls summary
Generating output, this might take time, please wait ...
app-route statistics 30.2.10.2 30.2.20.2 ipsec 12386 12366
 remote-system-ip         1.1.1.20
 local-color              mpls
 remote-color             mpls
 sla-class-index          0,1,2
 fallback-sla-class-index None
 enhanced-app-route       Enabled
 sla-dampening-index      None
 app-probe-class-list None
  mean-loss    0.000
  mean-latency 53
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      2048          0             54       0        4396          8513          0             0             
1      1024          0             55       0        4461          8534          0             0             
2      8256          0             49       0        4429          8549          0             0             
3      16384         0             54       0        4411          8549          0             0             
4      0             0             53       0        4440          8549          0             0             
5      0             0             54       0        4443          8549          0             0             

 app-probe-class-list Real_Time_Probe_Class
  mean-loss    0.000
  mean-latency 0
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      0             0             0        0        -             -             -             -             
1      0             0             0        0        -             -             -             -             
2      8192          0             0        0        -             -             -             -             
3      16384         0             0        0        -             -             -             -             
4      0             0             0        0        -             -             -             -             
5      0             0             0        0        -             -             -             -             

 app-probe-class-list Transactional-Probe_Class
  mean-loss    0.000
  mean-latency 53    <<<<<<<
  mean-jitter  0
       TOTAL                       AVERAGE  AVERAGE  TX DATA       RX DATA       IPV6 TX       IPV6 RX       
INDEX  PACKETS       LOSS          LATENCY  JITTER   PKTS          PKTS          DATA PKTS     DATA PKTS     
-------------------------------------------------------------------------------------------------------------
0      2048          0             54       0        -             -             -             -             
1      1024          0             55       0        -             -             -             -             
2      8256          0             49       0        -             -             -             -             
3      16384         0             54       0        -             -             -             -             
4      0             0             53       0        -             -             -             -             
5      0             0             54       0        -             -             -             -             

Now that my Transactional SLA with a maximum latency of 45 ms is not met , I will use NWPI to understand how traffic is getting sent out. Let’s examine HTTP traffic using port 80000

Notice that, on the upstream direction, the local and remote color is set to private1, indicating that traffic has moved away from mpls and its 53 ms latency. Just what we expected ✅

Now, let’s see how traffic on port 22 is flowing

Again, take a look at the upstream local and remote color, notice how mpls is still in use for this traffic, as there are no path issues detected.

In summary, the traffic with dscp 46 is working perfectly fine on the mpls transport, however, traffic with DSCP 18 was having more latency than the configured SLA, so traffic was moved to Private1 as it complies with the SLA.

We can confirm we are measuring and taking routing decisions on a per queue basis, this is a huge difference 🤯 !

Lessons learned

  • Using inline data, the number of samples increases dramatically compared to BFD sample size. 📈
  • EAAR can steer traffic in seconds, rather than minutes. ⏩
  • EAAR delivers the greatest benefits on transports with QoS, such as MPLS. 🚀
  • Even on transports without QoS, inline data measurements increases sample size and accuracy. ⏳
  • The dampening timer is useful to ensure transports are stable before marking them as valid. ✅
  • Interoperability between devices running EAAR and devices running AAR is possible 🔄

I hope you have learned something useful! See you on the next one 👋