Almost Two Years On: Where is SDN?

2013-08-10 imported Work · MPLS_TE · SDN

Almost two years ago I wrote a post on this site entitled Some Initial Thoughts on the SDN. Clearly, since then the SDN concept gained some more legs (and entered a new stage of the hype cycle) - so, where are we right now?

Firstly, I think its fair to say that the concept presented by Scott Shenker of having a single centralised computational element controlling COTS OpenFlow-speaking switches has fallen out of favour somewhat (based on the discussions with other network architects, engineers, and implementors that I have had). Somewhat as predicted, there are real challenges with this approach within high-scale, distributed networks:

Survivability - through centralising a network controller, suddenly we introduce a single point at which centralised computation needs to be performed - which implies that the network controller has a real-time view of the network’s state and infrastructure, and is able to react to changes to keep all paths working. As any operator of a network has observed, failure modes and communication of failures even within a single node is not necessarily a reliable process - hence, to remove the ability for nodes to act autonomously to calculate paths, and observe path liveliness seems a clear barrier to providing networks with the availability required of modern IP applications (e.g., linear TV and voice).
Scalability - whilst within ‘steady state’ operation, a centralised controller is very likely to be able to keep up with processing requests for new paths, and programming elements, this is unfortunately the "easy" part of the controller’s job. From an operational perspective, there is a requirement to scale the control-plane such that it can deal with the worst case failure within acceptable time bounds. When we consider failure modes that result in large numbers of paths needing to be recomputed and programmed, then the scalability of the centralised model becomes very questionable. Centralising computation in this case negatively impacts scalability and network performance, rather than enhancing it.

One point that has been raised to me when I’ve expressed these thoughts is that transport networks have tended to use centralised computation for many years. However, this is not directly analogous to the SDN controller concept. Transport networks that rely on centralised computation tend to perform "set and forget" computation where an A and B path are programmed once, and in-band OAM chooses which path is used, should the A path fail, it is not recomputed, hence avoiding the challenge of needing to scale to large numbers of path computations, and resulting in worse survivability than an IP network.

The other fundamental challenge around the controller concept is the fact that networks of any scale are inherently inter-domain – even the smallest networks I have worked in have utilised different domains to separate operational elements (e.g., confederations), and the medium and large ones have had multiple platforms, as well as legacy platforms that need to interoperate.

However, clearly, these approaches might have applicability where one constrains the scope and scale of the network – particularly, utilising this concept within closed datacentre environments might have some applicability (especially where global optimisation is desired).

So – if the centralised control-plane/COTS forwarding-plane looks somewhat shaky as a view of the "SDN", is there any future? My answer, yes, there definitely should be, but perhaps it won’t be the revolution that was originally predicted, and in my personal opinion will be centred around two key concepts that we can take from the use cases that are being mooted for "SDN":

Network programmability - one of the frustrations that is being aired through SDN is the way that it is hard to interact with the network in order for it to be more dynamic. Looking at the datacentre use case, how much of this would be a non-issue if the interfaces through which configuration of edge devices were programmed weren’t somewhat clunky (CLI-based screen-scraping…) or very non-standard (SNMP MIBs tend to be the least "standard" standards). This is a traditional SP problem too – what would be called orchestration within the datacentre context is really just provisioning of new services, or sub-elements of services. A movement towards "SDN" concepts giving us better external programmability of the network would be advantageous to network operation, without requiring large amounts of infrastructure to be removed from the network (a business case that never really stacks up). Starting with extending existing services (e.g., provisioning of forwarding paths through technologies like PCE), or adding new ephemeral state to devices (really extending the on-demand provisioning achievable through RADIUS for subscriber management interfaces to be more general, and not just at authentication time) would give these kind of wins, and start to tease out more use cases where better orchestration/more dynamic provisioning of the network enhances service capabilities.
Global optimisation/orchestration - a few years ago (wow, 4 years ago!) I wrote something around Visualising MPLS-TE Networks, reflecting on the means by which TE-LSP placement and management could be achieved through off-line tools. MPLS-TE is one of those cases where it is possible to achieve some level of global optimisation of resource utilisation (such that we consider forwarding paths on a global network view, rather than having each individual network element be greedy when they are selecting paths), and whilst this behaviour is not always of utility, for a subset of services such overall optimisation is of advantage - yet SPs cannot really use this today. My feeling is that, with the work that we’re doing on Segment Routing in the IETF, if we can solve one of the key issues with RSVP-TE (the fact that large amounts of mid-point state is not conducive to simple mid-point devices, and causes scaling issues during large network events), then the idea of having global controllers that are able to select more optimal (non-SPT) forwarding paths, or stitch multiple forwarding paths together is something that we can exploit. Again, it seems to me that starting this by exploiting some of the path calculation tools that we’ve used before (PCE again!) would give us a way that we can derive some of those benefits of having resource-aware path placement, which may be globally computed, where we require it - exploiting a hybrid centralised and distributed control-plane for the network. If we develop this approach, and it is adopted in SP networks, then it seems to me that the next logical step is how we could consider non-forwarding resource utilisation within the network, to provide more globally efficient utilisation of these functions, and reduce overall unit cost.

Both of these concepts really result in more dynamic networks, which consider overall resource utilisation and efficiency to a greater extent. They’re not new ideas – but if SDN means that they are re-examined such that the way that we instantiate them within the network is thought about again, then perhaps it gives us a good way forward to increase the efficiency of networks and hence realise some economic benefits (the primary motivator for technology change). Better still (for operators), this could be achievable through evolving the current infrastructure and not require wholesale changes in infrastructure, and operational capabilities (albeit, the chosen evolution path may open the door to larger changes in subsequent investment cycles).

I’m sure there are some thoughts that I am being overly pragmatic - and possibly even thoughts that I’m giving "SDN" too much credit. What I’d like to see is ways that we can use new technologies in ways that are realisable, and enhance either the quality of services delivered to users of networks, or simplify or reduce the cost of the infrastructure operated by SPs. To get there, we need evolution, rather than revolution - whilst a coup d’�tat might be exciting at the time, such revolutions are often bloody, and result in a degradation of experience which I don’t feel operators can afford, or service consumers will tolerate.