Reimagining Network Devices

              · ·

Some reflections on the development of Software-Defined Networking, and how it has impacted the ongoing re-imagining of network devices that OpenConfig, and associated projects has driven.

Almost 10 years ago, there was a shift in the IP networking industry. The move towards SDN, and its adoption by hyperscalers as a means to break apart traditional network architectures had set the scene for disruption. The question of “how are we using SDN?” was on the lips of vendor and telco executives — and lead to many initiatives in the industry - both those that can be thought of “SDN” and those that were more incremental.

Looking back, it is easy to look at the litany of failed projects and abandoned open source initiatives, and consider that “SDN” did not change the industry outside of the private infrastructure of a few large networks. After all, there are still many networks running traditional routers with no control- and data-plane separation. However, taking this narrow view (in this author’s opinion) rather under appreciates the seismic shift that this trend has quietly driven incrementally.

Almost 5 years ago, at Google, I wrote an internal paper that painted at vision (drawing from work we were already doing) of network devices that could be treated like, and managed like, other infrastructure. This didn’t just involve having “network automation” able to drive these devices, but fundamentally impacted the devices that make up those networks. The vision we had was to be able to have network devices that could be thought of like other microservices in our ecosystem: consider them deployable units, that had well-defined, machine-driven APIs, which we could use to describe their functionality, and — critically — have automated test suites that validated their functionality, with a means to measure that test coverage. We envisaged that this approach would allow us to reduce our cycle time for qualification of new devices, ensure that we had repeatable velocity of evolution of those devices, and open new avenues to improve our operational capabilities. For example, we imagined being able to emulate those devices with decomposable reference implementations, rather than needing to use monolithic NOS images, allowing us to simplify the job of developing services that interact with those devices — whilst having complete fidelity with the real devices in the network.

We knew at the time that this ecosystem was still nascent. We had already spent a significant effort in developing the first piece of this puzzle — a set of data models that describe the configuration and telemetry for network devices, essentially, the core API contents. We had made good strides forward in driving the adoption of gRPC on network devices, as a huge move away from packing XML-over-SSH transport being the state of the art, but this had been limited to “streaming telemetry”. We knew the task of reaching this much broader vision was gargantuan. We also made it more challenging for ourselves by driving optionality as a key premise of our approach — we wanted to drive an ecosystem that wasn’t solely limited to open source products, but could be procured from third-parties — allowing the industry (and ourselves) to make build-vs.-buy decisions on a case-by-case basis (for instance, disaggregating for some parts of the network — e.g., some TOR deployments — and buying integrated solutions — e.g., chassis routers —- in others).

Now, in Q1 2023, we can reflect and say — with multiple production deployments across optical, wireless, and L2/L3 switching/routing in multiple networks around the world — that we have made some major steps forward with this vision. We’ve even gone beyond some of our original vision - taking some of the concepts that (for example) the i2rs IETF working group proposed in 2012 and extending our approach of clearly-defined, programmatic APIs beyond the management-plane and into the control-plane.

This work has been an incredible collaboration across the industry — I have been engaged in at three separate employers (BT, Jive Communications, and now Google). It has engaged smart, motivated engineers both within the operator community, within network equipment vendors, and independent software developers who were interested in the tooling that came with it. It’s fair to say that the mission isn’t complete — there continue to be parts of the ecosystem that aren’t optimal, or are carrying significant complexity, but it seems clear, the softwarification of network devices (prompted by the “SDN” revolution) has made major steps forward over the last decade.

The ecosystem: an illustration

Much of what is mentioned above is covered by a previous post I made here- but some have emerged since. To illustrate the approach we’ve taken — let’s walk through the lifecycle of a new device in this changed ecosystem.

This approach is one that we’ve been adopting as it has been built — and driving industry collaboration on. Multiple vendors and operators have contributed to almost all of the projects above. It has been used both for internally-developed services, as well as those that are open source, or solely third-party developed. The fact that we’re in a position where each of the parts of this story can be backed by a link to a real project that realises the functionality is testament to the progress we’ve made.

The progress is also testament to an amazing set of engineers, whose usernames, email addresses, and GitHub handles you can find all over the projects that are linked to in this post. It has been an amazing experience watching this collaboration quietly come together, and change the industry with the incremental progress that has been made, and it’s been a privilege to have been able to be so close to so much of it.

Next Steps

Traditionally, control- and management-plane interfaces have been developed that are tightly coupled to their applications - think, the BGP SR-TE SAFI, and BMP. These technologies are tightly coupled to their respective data models and use cases: a specific TE approach with SR-TE, and transporting BGP datagrams for BMP. The work discussed above makes a significant change to this approach. It decouples the interface to the device from the exact application that is being implemented.

For example, gRIBI uses the OpenConfig abstract forwarding table (AFT) model as its data model - which is a generic RIB-level description of the fundamentals of RIBs. This allows different applications (all of which need to be expressed as RIB entries at the end of the day) to be implemented on top of these features. (Note, a similar approach is true of OpenFlow, but it has not been pursued in the protocol space.) These re-usable concepts allow for much more flexibility of implementation of applications — for example, new telemetry (including data models that are not OpenConfig) can be transported over gNMI. gRIBI can be used to build different routing applications that require an API to the RIB. This leaves a significant amount of room for innovation — and flexible development.

Going forward, as network architectures continue to evolve, my belief is that we will see new applications adopting these APIs, and building novel infrastructure on top of them in a way that allows for new architectures to be realised more flexibly, and quickly.