Reimagining Network Devices
tech OpenConfig · Tech · YANG
Almost 10 years ago, there was a shift in the IP networking industry. The move towards SDN, and its adoption by hyperscalers as a means to break apart traditional network architectures had set the scene for disruption. The question of “how are we using SDN?” was on the lips of vendor and telco executives — and lead to many initiatives in the industry - both those that can be thought of “SDN” and those that were more incremental.
Looking back, it is easy to look at the litany of failed projects and abandoned open source initiatives, and consider that “SDN” did not change the industry outside of the private infrastructure of a few large networks. After all, there are still many networks running traditional routers with no control- and data-plane separation. However, taking this narrow view (in this author’s opinion) rather under appreciates the seismic shift that this trend has quietly driven incrementally.
Almost 5 years ago, at Google, I wrote an internal paper that painted at vision (drawing from work we were already doing) of network devices that could be treated like, and managed like, other infrastructure. This didn’t just involve having “network automation” able to drive these devices, but fundamentally impacted the devices that make up those networks. The vision we had was to be able to have network devices that could be thought of like other microservices in our ecosystem: consider them deployable units, that had well-defined, machine-driven APIs, which we could use to describe their functionality, and — critically — have automated test suites that validated their functionality, with a means to measure that test coverage. We envisaged that this approach would allow us to reduce our cycle time for qualification of new devices, ensure that we had repeatable velocity of evolution of those devices, and open new avenues to improve our operational capabilities. For example, we imagined being able to emulate those devices with decomposable reference implementations, rather than needing to use monolithic NOS images, allowing us to simplify the job of developing services that interact with those devices — whilst having complete fidelity with the real devices in the network.
We knew at the time that this ecosystem was still nascent. We had already spent a significant effort in developing the first piece of this puzzle — a set of data models that describe the configuration and telemetry for network devices, essentially, the core API contents. We had made good strides forward in driving the adoption of gRPC on network devices, as a huge move away from packing XML-over-SSH transport being the state of the art, but this had been limited to “streaming telemetry”. We knew the task of reaching this much broader vision was gargantuan. We also made it more challenging for ourselves by driving optionality as a key premise of our approach — we wanted to drive an ecosystem that wasn’t solely limited to open source products, but could be procured from third-parties — allowing the industry (and ourselves) to make build-vs.-buy decisions on a case-by-case basis (for instance, disaggregating for some parts of the network — e.g., some TOR deployments — and buying integrated solutions — e.g., chassis routers —- in others).
Now, in Q1 2023, we can reflect and say — with multiple production deployments across optical, wireless, and L2/L3 switching/routing in multiple networks around the world — that we have made some major steps forward with this vision. We’ve even gone beyond some of our original vision - taking some of the concepts that (for example) the i2rs IETF working group proposed in 2012 and extending our approach of clearly-defined, programmatic APIs beyond the management-plane and into the control-plane.
This work has been an incredible collaboration across the industry — I have been engaged in at three separate employers (BT, Jive Communications, and now Google). It has engaged smart, motivated engineers both within the operator community, within network equipment vendors, and independent software developers who were interested in the tooling that came with it. It’s fair to say that the mission isn’t complete — there continue to be parts of the ecosystem that aren’t optimal, or are carrying significant complexity, but it seems clear, the softwarification of network devices (prompted by the “SDN” revolution) has made major steps forward over the last decade.
The ecosystem: an illustration
Much of what is mentioned above is covered by a previous post I made here- but some have emerged since. To illustrate the approach we’ve taken — let’s walk through the lifecycle of a new device in this changed ecosystem.
- We decide that we have a new networking device requirement which has a corresponding set of features (both software and hardware). In order to define our requirements, we choose a set of feature profiles - which describe the API surface area that we want to consume. These feature profiles cover what you might think of as APIs to network devices - namely, control- and management-plane protocols and interfaces - but also some that are less obvious, say Ethernet ports (after all the public API of a networking device includes the protocols that define communication with its ports). The vendor-neutral APIs (in OpenConfig) give us a lingua franca to describe these features and APIs across different network deployments, and critically in a way that is abstracted from the underlying implementation. Using descriptions of the configuration and telemetry surface area is also significantly more granular than traditional RFP approaches, which themselves create ambiguity (e.g., which bit of RFC4271 were you referring to needing to be supported etc.).
- We take an implementation of a network device, and start to measure the compliance of our device with those requirements. The feature profile gives us a set of tests (“functional tests”) which ensure that not only is the subset of the API that we described present, but also whether it is behaviourally compliant with what we expect. Here, we can leverage a network-centric test framework (ONDATRA) as the means to drive these tests — both handling complexities such as mapping to different implementations (virtual vs. physical, or different vendors), and providing the primitives for us to develop new tests with functionality-centric libraries for accessing configuration and telemetry (powered by code generation from our APIs through ygnmi and ygot). Where we develop new tests, these become part of a public library of API compliance tests in featureprofiles - which today covers base router functions, configuration and telemetry (through gNMI), operational procedures (through gNOI), traditional routing protocols, and off-device control and forwarding APIs (P4Runtime and gRIBI for example). These tests can be run both within the operator’s environment, as well as by the vendor — meaning that rather than requiring lengthy responses to RFPs, we can simply demonstrate device suitability for a particular deployment through passing a specific set of test cases.
- As we proceed through our qualification work - we can leverage the open source ecosystem to reduce the cost of our testing. OTG gives us a container- and hardware-based “reference client” and packet generator; KNE gives us a means to easily spin up complex topologies using virtual implementations in environments ranging from developers’ workstations to k8s in Cloud environments; if we are developing new functionality we can quickly prototype it in lemming - allowing us to shift to test-driven-development for real implementations, with tests that we know are compliant with the functionality that we expect.
- The suite of integration tests we build against both small single-DUT and topologies — as well as physical and virtual environments — become part of the nightly CI/CD that we run — along with the functional tests in featureprofiles to qualify new builds of our software and hardware as the ecosystem progresses. We can use the definitions of the APIs that feature profiles provide us to measure what our tests should cover (the configuration, telemetry and API parameters), and to report test coverage (as percentage of the paths tested, just as we would percentage LOC in traditional test coverage) against the external APIs of devices — which themselves, imply coverage of the internal functionality.
This approach is one that we’ve been adopting as it has been built — and driving industry collaboration on. Multiple vendors and operators have contributed to almost all of the projects above. It has been used both for internally-developed services, as well as those that are open source, or solely third-party developed. The fact that we’re in a position where each of the parts of this story can be backed by a link to a real project that realises the functionality is testament to the progress we’ve made.
The progress is also testament to an amazing set of engineers, whose usernames, email addresses, and GitHub handles you can find all over the projects that are linked to in this post. It has been an amazing experience watching this collaboration quietly come together, and change the industry with the incremental progress that has been made, and it’s been a privilege to have been able to be so close to so much of it.
Next Steps
Traditionally, control- and management-plane interfaces have been developed that are tightly coupled to their applications - think, the BGP SR-TE SAFI, and BMP. These technologies are tightly coupled to their respective data models and use cases: a specific TE approach with SR-TE, and transporting BGP datagrams for BMP. The work discussed above makes a significant change to this approach. It decouples the interface to the device from the exact application that is being implemented.
For example, gRIBI uses the OpenConfig abstract forwarding table (AFT) model as its data model - which is a generic RIB-level description of the fundamentals of RIBs. This allows different applications (all of which need to be expressed as RIB entries at the end of the day) to be implemented on top of these features. (Note, a similar approach is true of OpenFlow, but it has not been pursued in the protocol space.) These re-usable concepts allow for much more flexibility of implementation of applications — for example, new telemetry (including data models that are not OpenConfig) can be transported over gNMI. gRIBI can be used to build different routing applications that require an API to the RIB. This leaves a significant amount of room for innovation — and flexible development.
Going forward, as network architectures continue to evolve, my belief is that we will see new applications adopting these APIs, and building novel infrastructure on top of them in a way that allows for new architectures to be realised more flexibly, and quickly.