routing packets & turning pedals


 
Belgium Blitherings Code ic.ac.uk netnod Tech Apple Geek Cycling Route RFID rob.sh Work Me Crime London ISP LINX Food londonfgss Rollapaluza Photography IPv6 RIPE Cisco MPLS SR Code NANOG JunOSe BGP OpenConfig YANG SDN python IOS RSVP JunOS Thoughts MPLS_TE Grupetto IETF UKNOF Presentations UKNOF UK
 

At IETF96 in Berlin, the chairs of the NETMOD working group, and Operations Area Director (Benoit Claise) published a statement to say "Models need not, and SHOULD NOT, be structured to include nodes/leaves to indicate applied configuration". Now, this might seem a pretty innocuous statement, but it actually has a number of implications for the data models for network configuration and state that are being produced in the industry.

What is applied configuration?

The first question to an uninitiated reader might be, what is "applied configuration"? It's not a term that has been in the common network nomenclature - and hence does need some further explanation. To define it, we need to look at the way that configuration is changed on a network element.

In general, when a configuration is changed, the entity (be it human, or a machine) that is interacting with the device interacts with a management daemon that is responsible for running the interface that the change is being made through. This management daemon is then responsible for updating the various elements of the system (either directly, or through an interim configuration database). A basic overview of this separation is shown in the diagram above (which we'll refer to later).

This separation means that there can be a difference between what the operator wanted the system to be running (the intended state of the system), and what the system is actually running (the applied state). The difference might be for a number of reasons, for example:

  • There may be some contention in the CPU of the network element such that the management daemon shown is not able to communicate configuration changes to the BGPd process.
  • Some elements of the system might have bottlenecks in their programming time - for example, the green linecard TCAM (LC TCAM) might have a specific rate at which ACL entries can be installed.
  • A large number of configuration changes may be queued by the management daemon to be applied to other elements in the system, such that a particular change in value is queued behind others in the system.
  • Dependencies for the configuration to become applied are not present (e.g., the linecard that a referenced interface is on is not actually installed).

Operationally, it's useful to be able to determine what the system is doing - for example, if I am expecting that packets are filtered at the edge with an iACL, then it's useful to know that iACL has actually been programmed by the system into the linecard.

When we're thinking about humans changing configuration of the network, there are some systems that we might say don't really have the concept of some of the issues that we're discussing here.

In a system such as the one shown above, then a "candidate" configuration is edited - usually by simply opening an edit session on the configuration. This may create some form of lock (rate-limiting the number of changes that the system may apply). This candidate configuration is then applied using some form of "commit" operation - which may also involve communicating the change to the daemons responsible (especially if those daemons are partially responsible for validation of the changes in intended state that are being communicated). In theory, such systems would not have a view of an "applied" configuration, because one would expect that the "commit" operation ensures that the configuration change has been applied.

However, such a simplification isn't robust - since a number of the reasons for the box having an intended configuration that differs from that which is applied still exist - for example, the lack of presence of a particular hardware element, or the limited programming bandwidth of a particular hardware element.

In addition, this heavy-weight commit process means that the rate of change on the device is limited. For humans, this perhaps does not matter (although some vendors have implemented light-weight commit systems to overcome this issue, since commits could take many minutes) - but when a machine might be making changes, then creating locks and heavy-weight commit logic is something that may hugely increase the complexity of the overall NMS and network element system, especially where there are multiple writers to an individual network element.

An Overview of an Network Management Architecture

OpenConfig aims to support a network management architecture whereby there can be rapid changes made to an individual network element, by multiple writing systems. Such a system makes use of the fact that the intended and applied configurations can be differentiated from one another.

To understand this, let's talk through a basic work-flow of how a single writer might make a change to the state of the device.

  1. The NMS comes online and subscribes to the parameters of the system that it is interested in. Let's say that this NMS is specifically an ACL writer, and doesn't care about any of the other configuration. In this case, it might use the openconfig-acl model to subscribe to /acl. Using an RPC that implements the generic OpenConfig RPC specification it can choose how it wants this subscription to occur - for example, asking for a subscription with a sample-interval of 0, such that it receives updates only when the values within the path it is interested in change. At this point, the network element sends updates to the NMS as requested, informing it of changes in the values that it has indicated interest in.
  2. The NMS then wishes to make some change to the configuration of the device - it stages a number of changes together, and sends then using a Set RPC call. This RPC can specifically be requested to be transactional, such that the changes in the Set message share fate, or they can be individual. No lock is requested by the NMS, since it will not address the network element through multiple messages - and rather has already constructed a "candidate" configuration itself. As the network element receives the update to the intended configuration from the NMS, it also pushes a telemetry update back to it, to indicate that the intended values have changed.
  3. After some processing, the device updates the actual configured value - e.g., the TCAM on the linecard is programmed with a new ACL entry - and then pushes a telemetry update to the system to indicate that it has actually been applied. At this point, the listening NMS system can validate that the change it has made has actually been applied by the system.

In this case, the presence of applied configuration is fundamental to ensuring that the NMS can actually validate that the system is running the configuration that it pushed to it. The Set operation can remain relatively light-weight, meaning that it is possible for other systems to make changes to the intended state of the system. This has particular advantages when one considers locking granularity - the OpenConfig model only implies a lock that is held during processing of an individual Set operation; alternate models that can allow candidate configurations to edit may have relatively long lived global locks, or a complex series of locks on different parts of the data tree to ensure consistency of a candidate configuration when it is actually committed.

This architecture also allows for two systems that are writing the same set of configuration to remain in synchronisation about what configuration is actually running on the network.

In the above scenario - we have two NMSes, A and B, writing to the same network element. It may be that they are writing the same paths, or different sets, but each within the other's "interest domain". If both A and B do Set operations towards the NMS, then there is a problem of re-synchronising the view of the data tree that A or B used such that subsequent data instances that are generated can be validated against the expected config of the network element (e.g., to validate leafrefs, or determine whether elements of a service need to be configured). In this case, we care about what the network element is actually running. Through subscribing directly to the applied configuration, as soon as the network element has updated a value, then a telemetry notification is sent to both NMSes, such that it is possible for NMS A and NMS B to maintain an eventually consistent view of the configuration of the device, without needing to poll it directly.

The advantage of using the applied configuration over the intended in this case, is that if configuration has been set by one NMS that for some reason does not become applied, other systems are able to determine that the device isn't actually running that configuration - and therefore attempt to set it how they see fit, to make their required change.

What does this have to do with model structure?

To support the above use cases, we need to make both the intended and applied configuration of the network element addressable to external clients. Additionally, we need to make it simple for those systems to be able to determine how the intended configuration relates to the applied - since they will be writing to the intended, and potentially then observing the value of the applied.

OpenConfig's solution for this is to utilise the structure of the model to be able to indicate these things. For example, if we have an administrative state of an interface, OpenConfig will create a particular container for the interface, and within it have a config branch which contains the configurable values, and a state branch which contains the state that relates to that entity - including the applied config (which is, after all, state). Along with the applied config are the values that are derived from how that entity interacts with other elements - in our interface example, the counters that relate to it, the actual operational status of the interface, etc.

The OpenConfig model layout therefore has a structure similar to the following:

interfaces
   interface[name=eth0]
        config
            admin-status:               up/shutdown
        state
            admin-status:               up/shutdown
            operational-status:             up/down
            counters
                pkts-in:                integer
                ...

This means that we have a path of /interfaces/interface[name=eth0]/config/admin-status which can be written to, setting the intended state of that interface. The actual running admin-status (i.e., whether it is shutdown or not) can be found using the /interfaces/interface[name=eth0]/state/admin-status value. Clearly, this is very easy to relate to the intended value, since one simply substitutes the "config" container that surrounds the leaf with the "state" one.

Additionally, if an NMS were then interested in all the operational state that relates to this interface it can retrieve the contents of the "state" container under the interface, where all state that relates to the interface it configured is located.

This layout is consistent throughout the models - i.e., it's possible to guarantee that there is a "state" leaf for each "config" leaf, and that the state that is related to that entity can be directly retrieved by using the partner "state" container. OC models are validated using tooling that ensures that this rule is consistent, such that those interpreting data coming from network elements using this schema can rely on it - and subscriptions for particular paths can consistently get the right thing (for example, subscribing to /interfaces/interface[name=eth0]//state yields all state for each entity associated with an interface).

Where does this leave the IETF?

The IETF NETMOD working group has essentially rejected the approach that OpenConfig proposed in December 2014, leaving it with a number of questions to address:

  1. How will applied configuration actually be represented? OpenConfig's approach works with a protocol that presents a single view of the data, as well as those that want to provide some divisions of the data tree (if leaves were annotated they could be presented in different "views"). NETMOD would like to pursue a solution that does not support single-view implementations - and hence uses the NETCONF "data store" concept for modelling applied configuration. At the current time, whilst there are abstract proposals for how this would look, there is no running code that represents this.
  2. How usable will models that the IETF produces actually be? Currently, the IETF BGP and MPLS-TE models adopt the convention that OpenConfig uses, but other than this, there is little consistency as to how state and configuration data should be represented in their models. There is a real danger for the IETF that it produces models that have no consistency between them - this means trading the Cisco/Juniper/ALU config differences for a set of differences between the way that the model that you configure your IGP works, and then the one that you configure MPLS or BGP with work. The IETF's allergy to architecture and having a top-level view of what to build and how to build it means that this consistency is very difficult to achieve.
  3. When will IETF models actually be published? The decision NETMOD has made, along with there being no clear solution for the representation of applied config in IETF models has further implications. Vendors are already implementing OpenConfig - and for those models that are also IETF models, they now have some duplication of development efforts if they want to support IETF models too. Additionally, new efforts in the IETF are required to refactor those models that adopted the OpenConfig convention (given that some of these are actually written by OpenConfig authors, then there is the question of who does this work). Building a coherent set of models that allow operators to configure real functions on their network is likely to need significant efforts, has taken some time already.

Where does this leave OpenConfig?

OpenConfig's approach to interaction with the IETF after the first 6 months of the discussion (which is just reaching 18 months old) was to suggest that operational experience of the approach that was suggested is crucial. This experience allows us to determine the solution's efficacy, and work through any issues that become evident. This iteration process is invaluable - since it means that both the network elements and NMS implementations can really be scoped out. Since that suggestion was made back at IETF 94, multiple NMS and vendor implementations have emerged - such that we should be able to report back on progress in the relatively near future.

It does mean, however, that OpenConfig is unlikely - for some time at least - to converge with the IETF models. The IETF will need to solve the issues above, and negatively impacting the industry building knowledge of model-driven interaction with network devices, and the complexities of supporting non-native schemas appears a huge downside to "waiting for alignment".

Tagged in: Tech, IETF, OpenConfig, YANG

Mark Townsley and Jean-Louis Rougier again invited me to come and lecture at École Polytechnique this year. Their course there focuses on analysing the success of network protocols - using the (fantastic) framework laid out in RFC5218. Given that I'd spoken about SR for the last couple of years in my lecture there, and was giving a (slightly) updated version of the SR lecture at Telecom ParisTech for JLR's 'Future Internet' course earlier in the week, I decided to shift the focus of my lecture at X this year to the management plane. Particularly, looking at some of the issues with SNMP, and how these have pushed adoption of alternative management approaches, and what this has fundamentally meant for the way that we build network management today. I then shifted to explaining what we are doing in OpenConfig, and how we might address some of those issues - again, using the framework in 5218.

I wanted to post the slides here -- to share them both with Mark and JLR's class, but also the wider community. Questions, comments and other correspondance relating to this is very welcome via e-mail.

(PDF slide deck)

I've talked a little on this site before about what we're trying to achieve with OpenConfig. However, one of the observations that it's easy to make is that YANG models alone don't really achieve anything in terms of making the network more programmable. To make the network more programmable, we need to have tooling that helps us create instances of those modules, manipulate them, and then serialise the into a format that can be used to transmit data that conforms to the model to a device.

It might not be immediately clear what I mean here so, let me explain a little more. A YANG model is a definition of the schema for a set of data. It tells a system the rules that lets it validate whether particular data is valid or not. If we have a particular 'leaf' value that is specified to have a type of string, and 'pattern' of a.*, then all this tells the system is that the value must be a string, and it must start with the letter "a". To do anything useful with that schema, we need to be able to create instances of the data - that is to say documents that contain actual values that are compliant when validated against the schema. We can't express these documents in YANG, so we need an encoding for that data. NETCONF uses XML, but more recently there are approaches that are using JSON, and YAML.

We could create these documents by hand, but this would be akin to writing out CLI config, so we want to programatically create them. This raises one of the fundamentals around YANG:

  • The reader of a YANG module is a machine - which takes the schema it describes - preferably including the human-readable description parts - and converts it into something that a program can manipulate.

This is what pyangbind - and the goyang project that the awesome folks over at Google published recently do. They allow you to programmatically interact with, or validate, data against a YANG model.

OpenConfig has recently pushed quite a few more models. Including one that covers interfaces. I wanted to take a little time to show how this module works, and how current configurations map into this model. I'm going to use some Cisco, Alcatel-Lucent and Juniper configuration snippets to show this off. These are taken from real networks, but they are not copy-and-pastes. I'll focus on what the openconfig-interfaces output looks like in JSON, and then mention a little about how the instance documents are generated.

Cisco IOS XR: A core-facing IP/MPLS port.

If we consider a core-facing port that has the following IOS XR configuration:

interface TenGigE0/4/2/0
 description type=eth:cid=1042:remote=P2#Te0/0/0
 mtu 9188
 ipv4 address 192.0.2.100 255.255.255.254
 service-policy input CORE__IN
 service-policy output CORE_OUT
 carrier-delay up 2000 down 0

The first thing that we should note is not all this configuration is actually pure interface configuration. The service-policy statements are really QoS configuration that happens to be instantiated on the interface. This raises an important point about how configuration data is structured. For this kind of configuration, either:

  1. each model maintains its own list of interfaces, or
  2. each model 'augments' (adds configuration options to) the existing interfaces model,

In the openconfig-interfaces approach, currently, for those elements that relate to behaviour directly of the interface (and not behaviour of a protocol when it establishes an adjacency over the interface), option 2 is taken. Thus, the base interfaces module only has physical characteristics defined, whereas IP addressing is defined in an extension openconfig-if-ip module.

When build this configuration according to the OpenConfig interfaces modules, a JSON serialised instance of the data would look like this:

{
    "interfaces": {
        "interface": {
            "TenGigE0/4/2/0": {
                "hold-time": {
                    "config": {
                        "up": 2000
                    }
                },
                "config": {
                    "description": "type=eth:cid=1042:remote=P2#Te0/0/0",
                    "name": "TenGigE0/4/2/0",
                    "mtu": 9188
                },
                "name": "TenGigE0/4/2/0",
                "subinterfaces": {
                    "subinterface": {
                        "4": {
                            "index": "4",
                            "config": {
                                "index": 4,
                                "description": "autogen=default-ipv4-subint"
                            },
                            "ipv4": {
                                "address": {
                                    "192.0.2.100": {
                                        "ip": "192.0.2.100",
                                        "config": {
                                            "ip": "192.0.2.100",
                                            "prefix-length": 31
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

One could comment that it is somewhat longer than the XR configuration, but partially that is because of the JSON encoding adding curly-braces. The core layout of the config is familiar. There is a list of interfaces, which subsequently has other configuration options within it, which can be set as per the original configuration. The QoS configuration, which isn't yet modelled within the OpenConfig model set, is omitted.

Juniper JUNOS: An access-facing port with a IPv4 and IPv6 subinterface specified.

On a Juniper device, we might have a port that faces an a customer, where we use 802.1q encapsulation. In this case, we might configure two subinterfaces (unit constructs), such that one carries IPv4 and the other carries IPv6 traffic. In this case, the JUNOS config might look like:

ge-0/1/10 {
    description “CustomerA”;
    vlan-tagging;
    mtu 4484;
    hold-time up 4000 down 0;
    gigether-options {
        no-auto-negotiation;
    }
    unit 3044 {
        description "CustomerA-IPv4";
        vlan-id 3044;
        family inet {
            mtu 4000;
            address 192.0.2.0/31;
        }
    }
    unit 3046 {
        description "CustomerA-IPv4";
        vlan-id 3046;
        family inet6 {
            mtu 4000;
            address 2001:DB8::1/112;
        }
    }
}

In this case, this config maps to the following instance of openconfig-interfaces:

{
    "interfaces": {
        "interface": {
            "ge-0/1/10": {
                "hold-time": {
                    "config": {
                        "up": 4000
                    }
                },
                "config": {
                    "description": "CustomerA",
                    "name": "ge-0/1/10",
                    "mtu": 4484
                },
                "name": "ge-0/1/10",
                "subinterfaces": {
                    "subinterface": {
                        "3044": {
                            "index": "3044",
                            "vlan": {
                                "config": {
                                    "vlan-id": 3044
                                }
                            },
                            "config": {
                                "index": 3044,
                                "description": "CustomerA-IPv4"
                            },
                            "ipv4": {
                                "config": {
                                    "mtu": 4000
                                },
                                "address": {
                                    "192.0.2.0": {
                                        "ip": "192.0.2.0",
                                        "config": {
                                            "ip": "192.0.2.0",
                                            "prefix-length": 31
                                        }
                                    }
                                }
                            }
                        },
                        "3046": {
                            "index": "3046",
                            "vlan": {
                                "config": {
                                    "vlan-id": 3046
                                }
                            },
                            "config": {
                                "index": 3046,
                                "description": "CustomerA-IPv6"
                            },
                            "ipv6": {
                                "config": {
                                    "mtu": 4000
                                },
                                "address": {
                                    "2001:db8::1": {
                                        "ip": "2001:db8::1",
                                        "config": {
                                            "ip": "2001:db8::1",
                                            "prefix-length": 64
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

A couple of things should be mentioned here:

  • In openconfig-interfaces the default is that 'auto-negotiation' is disabled, so we don't need to specify this (the leaf is an empty leaf, so auto-negotiation is only turned on if we have this leaf present in our schema).
  • Based on the approach that OpenConfig is taken for op-state, we have config containers where we store all read-write configuration. Whilst it might look like there's some duplication of configuration options, the elements that are directly in the list objects (and act as their key) are actually leafref values, so they simply reflect the value of the corresponding leaf within the config container.
  • Since the hold-time for the interface going down is zero in the input (and really, it's just there because JUNOS makes us specify it), then the tooling can specify that it does not want to introduce this value into the OpenConfig model instance - and hence it is omitted from this JSON document.
  • Because the OpenConfig models are mapped to a vendor's internal schema, where there is no real operational advantage of including a leaf, it can be omitted. For example, consider vlan-tagging in the above JUNOS configuration. The fact that there are two sub-interfaces that have 802.1q configuration on them means that we should expect that this interface is 802.1q tagged, and hence in the JUNOS mapping for this module, it should be possible to determine from this that the JUNOS config database's vlan-tagging flag should be set to 'true'.

Cisco IOS: A switch with a trunk port carrying multiple VLANs, and SVI interfaces.

OpenConfig interfaces isn't just applicable to routed ports, it's also applicable to switched interfaces that can carry multiple VLANs (trunks), or single VLANs (access) ports. It also covers cases where the same device has a Layer 3 interface, within the VLAN that it is switching (a Cisco SVI). If we consider the following IOS configuration:

interface GigabitEthernet3/20
 description core-switch#Gig2/47
 switchport
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 2,69,243,282-292,388-397,559
 switch port mode trunk
!
interface Vlan2
 ip address 10.0.4.3 255.255.255.0
 standby 2 ip 10.0.4.1
 standby 2 priority 210
 standby 2 preempt
!
interface Vlan388
 ip address 10.0.5.3 255.255.255.240
 standby 15 ip 10.100.5.19
 standby 15 priority 210
 standby 15 preempt

The configuration maps to the following openconfig-interfaces instance:

{
    "interfaces": {
        "interface": {
            "VLAN388": {
                "config": {
                    "name": "VLAN388"
                }, 
                "name": "VLAN388", 
                "routed-vlan": {
                    "config": {
                        "vlan": 388
                    }, 
                    "ipv4": {
                        "address": {
                            "10.0.5.3": {
                                "ip": "10.0.5.3", 
                                "vrrp": {
                                    "vrrp-group": {
                                        "15": {
                                            "config": {
                                                "priority": 210, 
                                                "virtual-address": [
                                                    "10.0.5.19"
                                                ], 
                                                "virtual-router-id": 15
                                            }, 
                                            "virtual-router-id": "15"
                                        }
                                    }
                                }, 
                                "config": {
                                    "ip": "10.0.5.3", 
                                    "prefix-length": 28
                                }
                            }
                        }
                    }
                }
            }, 
            "VLAN2": {
                "config": {
                    "name": "VLAN2"
                }, 
                "name": "VLAN2", 
                "routed-vlan": {
                    "config": {
                        "vlan": 2
                    }, 
                    "ipv4": {
                        "address": {
                            "10.0.4.3": {
                                "ip": "10.0.4.3", 
                                "vrrp": {
                                    "vrrp-group": {
                                        "2": {
                                            "config": {
                                                "priority": 210, 
                                                "virtual-address": [
                                                    "10.0.4.1"
                                                ], 
                                                "virtual-router-id": 2
                                            }, 
                                            "virtual-router-id": "2"
                                        }
                                    }
                                }, 
                                "config": {
                                    "ip": "10.0.4.3", 
                                    "prefix-length": 24
                                }
                            }
                        }
                    }
                }
            }, 
            "GigabitEthernet3/20": {
                "ethernet": {
                    "vlan": {
                        "config": {
                            "trunk-vlans": [
                                2, 
                                69, 
                                243, 
                                "282..292", 
                                "388..397", 
                                559
                            ], 
                            "interface-mode": "TRUNK"
                        }
                    }
                }, 
                "config": {
                    "description": "core-switch#Gig2/47", 
                    "name": "GigabitEthernet3/20"
                }, 
                "name": "GigabitEthernet3/20"
            }
        }
    }
}

Since the OpenConfig model is structured such that an interface with an IP address must always have a subinterface (as can be seen in the Cisco XR example above), then our config instance generates a 'default' subinterface that carries the IPv4 address that is specified, along with the associated VRRP groups.

This example also shows how the OpenConfig model handles VLAN trunks. A Layer 2 interface can have a certain interface-mode set, which reflects whether it will carry 802.1q tagged traffic (or indeed double-tagged QinQ traffic) - and subsequently can have a list of VLANs specified. This list supports range values - which are marked as-per common YANG syntax as lower..upper. It is debatable whether such ranges are really needed if one is configuring the device through a programatic means (does it matter if we have all VLANs in a range specified separately?), but these are supported for cases where this may be advantageous - for example, if VLANs 10-3000 are required. Where ranges are used, it is the application that generates the document's job to determine how to split them up if it is later required.

Routed VLANs are also treated as a special interface type, with the routed_vlan container only being configurable when the type of the interface is set to the IANA identity value of l3ipvlan, which indicates a Layer 3 interfaces within a VLAN. This type may be set by the user based on a certain naming of interface, or implied by the device.

Alcatel-Lucent SROS: A mixed-mode port carrying a routed subinterface and a L2 VLAN.

Alcatel-Lucent's configuration is perhaps the least natural to map to the OpenConfig interface's model structure. This is essentially because of the way that SROS structures itself around services (it is 'service-centric') for interface configuration in general. There has been much debate as to whether such a 'service centric' approach (referred to a 'VRF-centric' in the IETF), or a 'protocol-centric', or even 'interface-centric' view of the world should be taken for YANG modelling. OpenConfig is in general trying to adopt an approach where configuration that relates directly to how an interface works (including IP on that interface) is specified in the interface structure. Protocols that might add other functions onto an interface maintain their own list of the interfaces to add other new, non-IP protocols.

ALU's use in L2 and L3VPN networks makes it quite common to have mixed-mode configuration on an interface. For example, a routed subinterface terminated into a VPRN (L3VPN) service, with a L2 PWE service that sits alongside it (using the same port).

If we consider the following configuration:

port 2/2/2
    description "cust=CustA:v=X~2"
    ethernet
        mode access
        encap-type qinq
        mtu 9112
        hold-time up 2 down 2
        no autonegotiate
    exit
    no shutdown
exit

epipe 3599 customer 1 create
    sap 2/2/2:15.* create
        description "epipe-svc=3599"
    exit
    ...
exit

vprn 3791 customer 2 create
    interface "custA" create 
        description "t=infra:l3mgmt"
        address "192.0.2.1/30"
        sap 2/2/2:1000.100 create
            description "vprn-svc=3791:
        exit
    exit
exit

In this case, to map the Alcatel-Lucent configuration, we need to mix the L2 functions of the OpenConfig model with the L3 ones. That is to say, we create a "TRUNK" port that supports the Layer 2 switched VLAN - as per SAP 2/2/2:15.*; and a subsequent OpenConfig subinterface that supports the 2/2/2:1000.100 SAP:

{
    "interfaces": {
        "interface": {
            "2/2/2": {
                "ethernet": {
                    "vlan": {
                        "config": {
                            "trunk-vlans": [
                                "15.*"
                            ], 
                            "interface-mode": "TRUNK"
                        }
                    }
                }, 
                "hold-time": {
                    "config": {
                        "down": 2000, 
                        "up": 2000
                    }
                }, 
                "config": {
                    "description": "cust=CustA:v=X~2", 
                    "name": "2/2/2", 
                    "mtu": 9112
                }, 
                "name": "2/2/2", 
                "subinterfaces": {
                    "subinterface": {
                        "3791": {
                            "index": "3791", 
                            "vlan": {
                                "config": {
                                    "vlan-id": "1000.100"
                                }
                            }, 
                            "config": {
                                "index": 3791, 
                                "description": "custA"
                            }, 
                            "ipv4": {
                                "address": {
                                    "192.0.2.1": {
                                        "ip": "192.0.2.1", 
                                        "config": {
                                            "ip": "192.0.2.1", 
                                            "prefix-length": 30
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

In this case, the trunk-vlans configuration lists the VLANs that are to be switched, and the subinterface construct contains a vlan-id statement indicating that the frames received with this tag should be routed to the subinterface, or egress frames should be tagged with this value. This hybrid case, allows the same approach to having mixed services as the above ALU SROS config.

Some conclusions...

It's certainly possible to map a bunch of different types of interface config to the OpenConfig interfaces model, and cover a number of use cases for interfaces whilst doing this. However, it's of course still a pain if one needs to create this by hand. This is where the tooling comes in. The examples above were taken by mapping a JSON-based input to the model, through some relatively simple (c.80 lines) Python. Whilst in this case, it was a simple example input that has close to 1:1 mapping of data, this input could be some specifications as to which type of interface 'service' should be created, such that it is possible to request such an instance of that service and have it automatically mapped to the relevant interface configuration. The intent of pyangbind, and similar tools, it to give a means by which input instances of a model can be loaded into a hiearchy that can be manipulated, and such transformations to OpenConfig instances performed. I'll add some examples of how this can be done when I get the time to blog again!

However, coming back to the key point of this post, I think we're making good progress with the OpenConfig models, and we are, of course, still iterating on them. I'm particularly keen to hear from more operators on their use cases, to ensure that OpenConfig is usable to as many folks as possible. Questions, comments, queries or other correspondance to the usual address (rjs@rob.sh).

Tagged in: Code, Tech, Work, ISP, IETF, python, OpenConfig, YANG

I noted that at NANOG64 this week in San Francisco, there are talks (both from Juniper) about both SPRING/Segment Routing and RSVP-TE. These are both protocols/technology approaches (since one can't really call SR a protocol) that I've been involved in the evolution of over the last few years. A question that I've been asked more times than I'd like is why we chose to look at a new approach (SR) rather than go with a technology that exists RSVP-TE.

The simple facts of the matter are that we aren't backing just one of these technologies - we have networks that run RSVP-TE today; and we have networks where we don't. To understand why one might want to consider either, we need to look at what the use cases for explicit paths are:

To allow bandwidth-aware routing of paths in the network: Where we want to place certain demands in the network according to the available resources on a certain link, then it is obvious that someone needs to be aware of where demands are currently placed. To do this, that someone needs to maintain state of LSPs. In cases where the placement is relatively static, or needs global optimisation, then often those paths can be pre-computed, and provisioned onto the network. In other cases, the demand of those paths may have significant temporal variation (think applications that use auto-bandwidth) - and local optimisation of path placement may be OK.

In the former case (especially where we are are concerned with global optimisation), one must rely on some element which is external to the ingress PE to calculate the path - and it stands to reason that this device must know about the placement of the existing paths in the network (or the utilisation of the links). At this point, there is very little value in maintaining reservation state on a per-hop basis - since the computing entity (usually an on- or offline PCE) has already done this. At this point, deploying RSVP-TE, and refreshing soft-state doesn't make any sense -- it's simply work that the network is doing that doesn't help anybody. SR helps you place these demands on the network - and adds very little overhead in doing so.

In the second case, moving the path computation out of the head-end PE doesn't buy anything -- there is no better path computation happening if we are happy with local placement. Consider the case where we have N ECMPs between two different devices, and we simply want to fill them so that the bandwidth is equally shared across them. At this point, least-fill will do a very good job, without needing to have any external entity. In such cases, keeping the state in the network lets one achieve the particular application that is required - without any external machinery. To get the same effect for SR, an external stateful PCE would be required.

Simply - it depends on what your deployment model is as to whether you need state in the network. if you do, then you probably want RSVP-TE. If it didn't add anything, then SR does you a bunch of favours.

Disjoint path placement: This is a use case that I have a lot of interest in. Two services need to be placed on the network where they have no shared fate. Again, it depends on the deployment architecture that you have as to how one might want to consider deploying such a case.

Where there is a need to consider SRLGs across multiple layers (e.g., shared fibre ducts, same subsea cable system), then quickly it can become impractical to encode all this information into the IGP extensions available. Equally, where more complex path routing requirements are needed ('in the core, these two services may not be in ducts within 3 kilometers of each other'), then it's simply not possible to encode the right information into the IGP - let alone implement the placement algorithm on the ingress LER (iLER). In other cases, there isn't the information available to the iLER to make the placement decision, or it might not be possible to place a service with locally optimal routing - particularly these cases can occur with path diversity where two paths originate at different ingress LERs. These cases lend themselves very well to placement with SR -- one already has to maintain an element with global awareness, which must keep state (if A-B and C-D need to be diverse, then the computing entity needs to know where A-B is to place C-D...; and needs to react to failures that impact the placement of A-B to ensure that it remains diverse to C-D) - so there's very little value of having state in the network as well as in this entity.

Other cases, where a diverse service might start at same iLER (e.g., path-protect FRR paths) - then RSVP-TE with XRO objects can suffice, and one can rely on the in-network state to ensure that paths are placed diversely, since the head-end has all the knowledge of the other services that are required. In this case, state in the network being maintained - and the single point of convergence for both the paths means that they can be re-placed where required - and RSVP-TE does a fine job of this.

Service paths rather than infrastructure paths: The work that I've shared previously concentrated on issues observed with RSVP-TE in networks with full mesh RSVP-TE (without diffserv TE). If one is considering such an architecture in a SP network today, then we are discussing 40,000 tunnels for a network of 200 PEs. If we consider DS-TE and an architecture with 3 different core classes, then 120,000. Whilst pinch points due to fibre routing between regions tend to drive up mid-point scale, we are still talking tens of thousands of LSPs for a single device. However, if we consider tunnels that need to be routed according to service demands, then a similar network with 200 PEs in it might support many, many more connections. Orders of thousands of services on an individual node (bear in mind that a 10-slot device can likely support 500+ edge ports) are not unheard of. At this point, having soft-state that might need to be resignalled during failures can be significantly painful -- and result in message flooding loads that cause significant pain to pinch-point midpoints. At this point -- taking the state away from the router CPU - and giving some means to be able to add additional computational resource/schedule how LSPs are re-routed is advantageous. SR lets you do this relatively easily; whereas RSVP-TE requires that we keep path-setup on-board with the network elements themselves. In these cases, RSVP-TE also consumes a label per service at each mid-point element, whereas SR has the nice property that the number of labels consumed per device is # of devices in the network + # of local adjacencies - significantly lower than the number of midpoint LSPs that might traverse an individual device.

Explicitly placed multicast: Multicast (from my perspective) is state in the network; we want to exploit the topology of the network to say that between some source, and some destination, there is no real need to carry N copies of a certain packet, we can carry many fewer, and replicate only when we need to. To do this, we either need to carry the information about the topology of the network in the packet (it's not there right now...) or we need to let the network know something about the paths that it are being carried over it. The latter is essentially what we do with P2MP RSVP-TE. We signal to the network that there are paths going over it, and determine where the right point to branch those are (based on S2L sub-LSPs). In this case, RSVP-TE gives us a way to ensure that we exploit the topological information about the network that we already have in the IGP - to meet constriants, and efficiently deliver packets. SR has no real way to deliver this at all -- since it simply tries to avoid that state. Unless one has many thousands of multicast groups - then keeping this state doesn't seem hugely problematic, so RSVP-TE continues to seem a very sane choice.

As with most new approaches, there is always a bunch of buzz that surrounds the new kid on the block - and there's a need to give it a push, such that solutions make it out into the real world. It seems that segment routing is getting there - and that's great news, and certainly something that we're intending to deploy. However, it's also given the folks deploying RSVP-TE a bit of a kick (that I must admit, we were struggling to motivate before) such that we're starting to see solutions to some of the problems that we saw in the real-world c.5 years ago emerge (see: draft-ravisingh-teas-rsvp-setup-retry and draft-beeram-mpls-rsvp-te-scaling-01).

I'm glad that we're not abandoning RSVP-TE. There are cases, such as those discussed above, where it makes a better tool for the job than SR. However, there are cases where it's not well suited, and we need SR. Giving operators the choice to pick the right solution for their problem space is always a good thing to achieve. Some networks will roll SR, some will roll RSVP-TE. Some will roll both. As long as it's fixing problems, and the network is robust and operable, that's all good.

Either way, non-IGP placed paths have significant complexities in terms of operational model, and need consideration when one is used to the IGP-congruent world - one to discuss another day.

Tagged in: Work, MPLS, RSVP, MPLS_TE, IETF, SR
Both the Hawaii IETF meeting (IETF91) and the subsequent meeting we had a few weeks ago in Dallas were somewhat YANG-heavy. Following work to move towards YANG as a standard modelling language for network configuration, and the subsequent IESG statement effectively deprecating SNMP as the way that we present network configuration - the IETF, and especially the routing area, has dived head-first into YANG.

Indeed, I've been occupied somewhat with some really great collaborative work with a number of awesome engineers from Google, Microsoft, AT&T, Level3, Yahoo!, Facebook, Cox, Verizon and others on the OpenConfig initiative. We're trying to take an operator and use-case driven approach to developing YANG modules for both configuration and defining the schema for telemetry. This work has turned up a few times in the press, and I should probably write something separate about it in the near future.

However, one observation that a number of people have made, is that there's really limited tooling available to work with YANG modules. We have (the rather excellent) pyang, which provides a validation tool for YANG modules and the corresponding JNC plugin that creates Java classes -- but after that, options start to run pretty dry for what one might use, other than commercial products such as tail-f NCS. In some cases, the way that these modules work is also a bit esoteric, requiring quite a lot of care around what the YANG types are in the consuming code.

To drive adoption of YANG and NETCONF for making the network more programmable -- we need to make it easy to program the network. To this end, I started some work, with the aim of:
  • Automatically generating a set of Python classes from a YANG module, that hang together exactly as per the configuration hierarchy described in a YANG module - such that a developer can take the YANG modules that they get from a vendor, or ideally standardised like the OpenConfig models - and generate bindings that they can use in their application.
  • Ensuring that the generated bindings act like native types in Python -- it should not be that a developer needs to learn something specific to be able to use these modules - something that looks like a dict should work just like a normal dict - so methods like keys() and iterators just work transparently.
  • Supporting enough of the YANG types that real modules can be worked with. The aim is to provide something that at the very least works with the OpenConfig models -- but ideally more. 
After some hacking over the last week or so, I've got to a stage where I have a reasonably solid prototype of this code -- and I just wanted to show what might be possible with something like this (using the OpenConfig model).

Essentially, to generate your classes, one just uses pyang:


[~/code/openconfig-pyangbind/yang/bgp(master*)]

(22:10 - s002) corretto> pyang -p ../policy --plugindir ~/Code/pyangbind/btplugin -f bt -o oc_bgp.py bgp.yang bgp-multiprotocol.yang bgp-operational.yang  bgp-types.yang 
 


Following this, you end up with a module that can be directly consumed within a Python application:


[~/code/openconfig-pyangbind/yang/bgp(master*)]

(22:10 - s002) corretto> python                                                                                                                                                      


>>> from oc_bgp import bgp 

>>> oc = bgp()
 

 
Then, referring to the  OpenConfig 
BGP model you can configure a peer - just as you'd do building any other data structure in Python:


>>> oc.bgp.global_.config.as_ = 2856

>>> oc.bgp.global_.config.router_id = "10.152.0.4"

>>> 

>>> oc.bgp.neighbors.neighbor.add("192.168.1.2")

>>> oc.bgp.neighbors.neighbor["192.168.1.2"].config.peer_as = 5400

>>> oc.bgp.neighbors.neighbor["192.168.1.2"].config.description = "a fictional transit session" 
 


 Where there are restrictions imposed in the YANG model, then these are also implemented in the Python classes, so if you try and deviate from the model -- a set of Python errors are used to indicate this:


>>> oc.bgp.neighbors.neighbor["192.168.1.2"].config.peer_type = "An Invalid Value"

...

TypeError: peer_type must be INTERNAL or EXTERNAL

>>> oc.bgp.neighbors.neighbor["192.168.1.2"].config.peer_type = "EXTERNAL" 
 



The tool also tracks what has changed from the initial values (which can be populated from any source) - and has an output that can be serialised in a fashion that it could be used as input to a NETCONF or RESTCONF library to commit to a router:


>>> pp.pprint(oc.get(filter=True))

{   'bgp': {   'global': {   'config': {   'as': 2856,

                                           'router-id': '10.152.0.4'}},

               'neighbors': {   'neighbor': {   '192.168.1.2': {   'config': {   'description': 'a fictional transit session',

                                                                                 'peer-as': 5400,

                                                                                 'peer-type': 'EXTERNAL'},

                                                                   'neighbor-address': '192.168.1.2'}}}}}
 


Clearly, there are some baby steps happening here --  as such, this just gives the data structures that one might interact with to be able to build policy - but configuring peers for any platform using loops like this is definitely something that starts to make programming the network easier from my perspective!


  global_config = {"my_as": 2856,}

  peer_group_list = ["groupA", "groupB"]

  peers = [("1.1.1.1", "groupA", 3741), ("1.1.1.2", "groupA", 5400,),

          ("1.1.1.3", "groupA", 29636), ("2.2.2.2", "groupB", 12767)]


  bgp = openconfig_bgp_juniper()


  bgp.juniper_config.bgp.global_.as_ = global_config["my_as"]

  for peer_group in peer_group_list:

    bgp.juniper_config.bgp.peer_group.add(peer_group)


  for peer in peers:

    bgp.juniper_config.bgp.peer_group[peer[1]].neighbor.add(peer[0])

    bgp.juniper_config.bgp.peer_group[peer[1]].neighbor[peer[0]].peer_as = peer[2]



 
 There's some work to go -- and as Dave Freedman and Ignas Bagdonas noted back at RIPE69 - it'd be great to have some abstraction away from the base configuration. However, as long as one can express that higher-level abstraction in something that can be written in Python - it should be possible to transform from the abstracted view, into the base configuration, with a set of fairly simple transformation models (or templates)... more coming on that as I've committed the code.

Hopefully, I'll get to the stage where I can release this code to the wider world -- and encourage its use. The focus on the management plane of the network has been lacking for years - and we finally have a chance to be able to fix it.

I'll leave this post with a link to the talk that Anees Shaikh did at facebook's networking@scale event. Anees did an awesome job of explaining what we're trying to do with OpenConfig, and gives some cool insight into what the guys at Google are working on too:


 As usual -- thoughts/comments are very welcome to rjs@rob.sh :-) 
Tagged in: Code, IETF, SDN, python
March 2013 - December 2014 on TfL

Oyster usage data from TfL for my Oyster card, March 2013 - December 2014. Data visualisation is with D3.js - mouse-over a station to isolate journeys to and from that location. Larger version.

Tagged in: Code, London

After my presentation at UKNOF on SR, Mark Townsley asked me whether I'd be interested in presenting to his class at the École Polytechnique in Paris, around the thinking (from an ops perspective) of delivering the 5218 concept of "net positive value" through the SR technology, and how the existing protocols that are available might measure up against the criteria that 5218 gives us to consider. We managed to co-ordinate logistics, and I presented to INF566 on Wednesday afternoon, which was a really cool experience. It's always nice to see how networking is taught, and hear from students in such a high-ranking uni. I've included the slides below for posterity - Mark filmed the presentation, so perhaps there'll be video at some point in the future!

SPRING Forward(ing)
I recently gave a talk at UKNOF relating to Segment Routing/SPRING and the operational challenges that we are trying to resolve through it. You can see it on YouTube below - or the slides are on this site - SPRING Forward(ing) - UKNOF27

  
Almost Two Years On: Where is SDN?
Almost two years ago I wrote a post on this site entitled Some Initial Thoughts on the SDN. Clearly, since then the SDN concept gained some more legs (and entered a new stage of the hype cycle) - so, where are we right now?

Firstly, I think its fair to say that the concept presented by Scott Shenker of having a single centralised computational element controlling COTS OpenFlow-speaking switches has fallen out of favour somewhat (based on the discussions with other network architects, engineers, and implementors that I have had). Somewhat as predicted, there are real challenges with this approach within high-scale, distributed networks:
  • Survivability - through centralising a network controller, suddenly we introduce a single point at which centralised computation needs to be performed - which implies that the network controller has a real-time view of the network's state and infrastructure, and is able to react to changes to keep all paths working. As any operator of a network has observed, failure modes and communication of failures even within a single node is not necessarily a reliable process - hence, to remove the ability for nodes to act autonomously to calculate paths, and observe path liveliness seems a clear barrier to providing networks with the availability required of modern IP applications (e.g., linear TV and voice).
  • Scalability - whilst within 'steady state' operation, a centralised controller is very likely to be able to keep up with processing requests for new paths, and programming elements, this is unfortunately the "easy" part of the controller's job. From an operational perspective, there is a requirement to scale the control-plane such that it can deal with the worst case failure within acceptable time bounds. When we consider failure modes that result in large numbers of paths needing to be recomputed and programmed, then the scalability of the centralised model becomes very questionable. Centralising computation in this case negatively impacts scalability and network performance, rather than enhancing it. 
One point that has been raised to me when I've expressed these thoughts is that transport networks have tended to use centralised computation for many years. However, this is not directly analogous to the SDN controller concept. Transport networks that rely on centralised computation tend to perform "set and forget" computation where an A and B path are programmed once, and in-band OAM chooses which path is used, should the A path fail, it is not recomputed, hence avoiding the challenge of needing to scale to large numbers of path computations, and resulting in worse survivability than an IP network.

The other fundamental challenge around the controller concept is the fact that networks of any scale are inherently inter-domain -- even the smallest networks I have worked in have utilised different domains to separate operational elements (e.g., confederations), and the medium and large ones have had multiple platforms, as well as legacy platforms that need to interoperate. 

However, clearly, these approaches might have applicability where one constrains the scope and scale of the network -- particularly, utilising this concept within closed datacentre environments might have some applicability (especially where global optimisation is desired). 

So -- if the centralised control-plane/COTS forwarding-plane looks somewhat shaky as a view of the "SDN", is there any future? My answer, yes, there definitely should be, but perhaps it won't be the revolution that was originally predicted, and in my personal opinion will be centred around two key concepts that we can take from the use cases that are being mooted for "SDN":
  • Network programmability - one of the frustrations that is being aired through SDN is the way that it is hard to interact with the network in order for it to be more dynamic. Looking at the datacentre use case, how much of this would be a non-issue if the interfaces through which configuration of edge devices were programmed weren't somewhat clunky (CLI-based screen-scraping...) or very non-standard (SNMP MIBs tend to be the least "standard" standards). This is a traditional SP problem too -- what would be called orchestration within the datacentre context is really just provisioning of new services, or sub-elements of services. A movement towards "SDN" concepts giving us better external programmability of the network would be advantageous to network operation, without requiring large amounts of infrastructure to be removed from the network (a business case that never really stacks up). Starting with extending existing services (e.g., provisioning of forwarding paths through technologies like PCE), or adding new ephemeral state to devices (really extending the on-demand provisioning achievable through RADIUS for subscriber management interfaces to be more general, and not just at authentication time) would give these kind of wins, and start to tease out more use cases where better orchestration/more dynamic provisioning of the network enhances service capabilities.
  • Global optimisation/orchestration - a few years ago (wow, 4 years ago!) I wrote something around Visualising MPLS-TE Networks, reflecting on the means by which TE-LSP placement and management could be achieved through off-line tools. MPLS-TE is one of those cases where it is possible to achieve some level of global optimisation of resource utilisation (such that we consider forwarding paths on a global network view, rather than having each individual network element be greedy when they are selecting paths), and whilst this behaviour is not always of utility, for a subset of services such overall optimisation is of advantage - yet SPs cannot really use this today. My feeling is that, with the work that we're doing on Segment Routing in the IETF, if we can solve one of the key issues with RSVP-TE (the fact that large amounts of mid-point state is not conducive to simple mid-point devices, and causes scaling issues during large network events), then the idea of having global controllers that are able to select more optimal (non-SPT) forwarding paths, or stitch multiple forwarding paths together is something that we can exploit. Again, it seems to me that starting this by exploiting some of the path calculation tools that we've used before (PCE again!) would give us a way that we can derive some of those benefits of having resource-aware path placement, which may be globally computed, where we require it - exploiting a hybrid centralised and distributed control-plane for the network. If we develop this approach, and it is adopted in SP networks, then it seems to me that the next logical step is how we could consider non-forwarding resource utilisation within the network, to provide more globally efficient utilisation of these functions, and reduce overall unit cost.
Both of these concepts really result in more dynamic networks, which consider overall resource utilisation and efficiency to a greater extent. They're not new ideas -- but if SDN means that they are re-examined such that the way that we instantiate them within the network is thought about again, then perhaps it gives us a good way forward to increase the efficiency of networks and hence realise some economic benefits (the primary motivator for technology change). Better still (for operators), this could be achievable through evolving the current infrastructure and not require wholesale changes in infrastructure, and operational capabilities (albeit, the chosen evolution path may open the door to larger changes in subsequent investment cycles). 

I'm sure there are some thoughts that I am being overly pragmatic - and possibly even thoughts that I'm giving "SDN" too much credit. What I'd like to see is ways that we can use new technologies in ways that are realisable, and enhance either the quality of services delivered to users of networks, or simplify or reduce the cost of the infrastructure operated by SPs. To get there, we need evolution, rather than revolution - whilst a coup d'état might be exciting at the time, such revolutions are often bloody, and result in a degradation of experience which I don't feel operators can afford, or service consumers will tolerate.

 
Tagged in: Work, MPLS_TE, SDN
Speed of Internet Innovation.
A question that came up at an event I was at yesterday: How will the time between the first (commercial) deployment of a telephony service, and a regulated universal service obligation for telephony compare to that of the time between the first (commercial) Internet services being deployed and a USO for IP connectivity (e.g., Broadband)?

Based on this, is the cycle time of the telephony regulatory bodies, and mechanisms through which changes are implemented within these bodies suitable for Internet services?

Answers on a postcard please. 
Tagged in: Blitherings, Tech, Thoughts

rjs@rob.sh sip:rjs@rob.sh
previous posts
contact details
gps logs
 
 
CS Grupetto [cycling]
rss [rob.sh]
atom [rob.sh]
notebooks [rob.sh]
admin [rob.sh]
tumblr [tumblr]