Monday, June 13th, 2022
Can we build a solution that “feels like cloud but runs at the edge?” Or, more specifically, can we bring the cloud native development and operations experience to the edge? Yes, we can – and we’re in the process of building a solution that we call “Great Bear”. But before jumping all the way to the solution, let’s first take a glimpse at the state of the industry right now.
The pendulum swings back
On-prem to cloud. Cloud to Edge. The pendulum swings back: you can’t talk about the edge without the cloud. Historically, we ran almost everything on-prem. By 2010, we started to move our workloads to the cloud. This trend saw an all-time high around 2015. And since then, we have seen more workloads returning to the edge. There are many reasons for that: security, regulatory requirements, and policy as one big bucket. Performance, latency, cost - another big bucket. But what we see in more and more conversations these days is data gravity and the ability to go and process workloads right at the edge using AI methods. AI is becoming the key driver for edge. Two or three years ago, the combination of AI and edge did not garner much interest in customer conversations. These days, about 50% of all edge conversations include AI. We expect it to be 100% in the near-term future. AI and Edge, that’s a match made in heaven. There will be more and more workloads at the edge. The pendulum swings back.
At the same time, we love the cloud. We love the cloud to the extent that we would love to keep cloud development and operations experience and bring them to the edge. The operational simplicity of the cloud and the ease of rolling out workloads. The ease of development. The speed to deploy workloads in seconds, rather than manually install and operate workloads at potentially thousands of locations – something that easily takes months. And we want to go and leverage everything that we’re familiar with - leverage and integrate with common tool chains. Let’s not build a specific “edge curriculum.”
We started to talk about “edge” – but which edge, given that there are so many edges? We can leverage the Linux Foundation Edge taxonomy with centralized data centers all the way on the right-hand side and very constrained small devices, like sensors, on the very left-hand side. There are two distinct approaches to addressing the edge: Gartner calls those “Cloud-out” and “Edge-in.” If we start the approach from the right-hand side, the “cloud out,” we treat the edge as an extension of the cloud: we use a cloud-like homogeneous deployment model. Our operating model is DevOps-style. Resources do not constrain us. “Cloud-out” caters up to a certain level. And at a certain level, things are so distributed, location-aware, and heterogeneous that we can't treat the edge as an extension of the cloud anymore. In addition, the users of the edge systems are domain experts but not IT experts. Knowing how to manage a retail store or a restaurant does not necessarily make you an IT admin. Those deployments consist of a set of small compute nodes that might well sit on some shelf and not in a rack. Those compute nodes often have limited CPU and memory, and their connectivity to the cloud and internet is also limited. I’m likely able to reach the cloud from these locations. Are we able to have the cloud reach these locations? Don’t take this as a given. We have just identified the tipping point where we can't treat the edge as an extension of the cloud anymore. We need to come up with a different architectural and operational principle: “Edge-in.” From our experience, we can find this tipping point at what the Linux Foundation Edge taxonomy calls “smart device edge” and “on-prem data center edge:” compact, user-facing compute. Where does it apply? Think of your favorite coffee shop and the chain that operates it. Think of a retailer with many shops. Think of a restaurant chain. Think of a car maker that has many factories worldwide. This is the type of edge where the approach of “treating the edge as an extension of the cloud” does not apply. It is the type of edge that this blog is focused on.
From cloud-native to edge-native
From cloud-native to edge-native, what does that mean? It basically means to rethink or evolve the 12 Factors we're familiar with. You may remember in 2011, the Heroku team gave us these 12 Factors. That was more than eleven years ago, but the foundational principles for cloud native development are still standing. The 12 Factors are a set of principles that describe a way of making software that, when followed, enables companies to create code that can be released reliably, scaled quickly, and maintained in a consistent and predictable manner. Check out 12factor.net if you feel like refreshing your memory.
Evolving the 12 Factors
The 12 Factors were formulated with the cloud in mind – but they still hold true if we swing to the edge. Key concepts like the separation of code and data, stateless processes, or the use of fully declarative formats remain unchanged. We need to evolve a few factors a little to include the edge specifics we discussed earlier. With location being a key quality of edge, location is a prominent aspect when evolving several of the 12 factors. Those factors that are location-dependent naturally evolve. We need to now define them across edge and cloud. For example, backing services. Services can be potentially remote, i.e., running in various locations in the cloud, but you’re still consuming them at the edge. And the scale model? We are not only scaling by the process model – which we do within a location, but we also scale by location. Similarly, we need to evolve the robustness model into one where locations are considered disposable entities. Another key attribute of an edge location is its connectivity. Connectivity becomes another defining aspect when evolving the 12 Factors. It is easy to picture that one would need to consider the available bandwidth when considering logging at an edge because we don’t want to saturate the link that connects the edge with the cloud with logs. We do want to leave some bandwidth for our applications.
Let’s highlight a few of the 12 Factors that need to evolve:
Concurrency / Scale: Scale by the site model and the process model.
Classic 12 Factor scales out via the process model. For edge, the set locations, we refer to these as “sites,” become another scaling dimension. Going back to our earlier examples, we can have thousands of retail shops or restaurants. Each of those would represent a “site.” Sites group a set of compute nodes. We will deploy applications to sites instead of individual compute nodes because a compute node is still considered disposable. It could fail at any point in time. While one would typically associate a “site” with a physical grouping of nodes, one might also use the concept for logical groupings of compute nodes.
Disposability: Robustness by site: Fast startup, graceful shutdown, declarative desired state.
Much like we scale by site, we also apply robustness principles to sites. Sites are cattle, as are the nodes that they contain. They’re disposable, can fail at any point in time, and can be started and stopped. Given that we can’t assume an IT expert at a site, we will never debug a site or an individual node within a site. In case of issues, we would ask someone to power-cycle or factory-reset one or all the nodes of a site. Anyone can perform that operation. We assume a fully declarative model for sites. Given that the state of a site and the nodes that make up the site are 100% declarative, a site or a node would boot into its declared state after a power cycle. This makes for a straightforward operational principle of an edge – and does not require personnel at the edge that is IT savvy.
Backing services: Treat any backing service as an attached, potentially remote resource
Backing services are services an app consumes over the network as part of its normal operation - no real change from the classic factor here. That said, we need to consider that the services an app depends on could be far away, i.e., they could be in the cloud. A site can fetch data and code from multiple locations, but we should not consider resources in other locations, or sites can push information to a site. Sites can reach the cloud, but we cannot assume that the cloud can reach a site. This is because edge sites are commonly connected through one or several gateways that perform N(P)AT operations. Rather than rely on complex NAT traversal techniques, which may or may not work in a specific setup, it is much easier to assume that any communication must be initiated from a site: pull from the cloud, push to the cloud. That’s it. This also rules out the use of any tunnel technology. Not using tunnels means we don’t have to worry about maintaining the tunnels.
In addition, sites might occasionally be offline and not have connectivity at all – think of a connected car or a mom-and-pop shop around the corner of your home that relies on a flaky DSL connection to connect to the internet.
Consequently, let us think of a site as an autonomous entity with explicitly declared external dependencies. Sites can reach the cloud to fetch declared state and instantiate this state at their leisure. And sites push state, like metrics, back to the cloud.
Logs: Treat logs and metrics as on-demand streams, and use metrics whenever feasible.
The original factor that covers the retrieval of system state information highlighted logs – and the fact that we handle logs as event streams. Put differently, we push operational information from the edge rather than poll the edge for information. The logic of “push to the cloud rather than pull from the edge” applies well to edge deployments. What has changed is the focus on logs. Since 2011, we’ve shifted to using metrics even for cloud applications. We only leverage logs if there are no meaningful metrics or we need to debug a system (in which case, we use logs on demand). Of course, we must ensure that log retrieval is feasible from a bandwidth and system load perspective. Metrics are typically more compact than logs and easier to consume by IT systems, thus the preference for using metrics wherever possible at an edge location.
I remember when a colleague, who was debugging a compute device at the smart device edge, said to me, "I had an issue. I turned on logging at the remote site – and lost contact to the remote site because the logs saturated the access link." Even with metrics, we must be smart about which metrics we stream. Systems can offer a lot of metrics. Consider an example: A modern Cisco router easily offers more than a million different counters. A while ago, we counted the number of different types of numeric counters a router running IOS XR 7.1.1 offers. We counted 126,407 different types of counters. Depending on the environment and the systems in use, you might not stream all available metrics. You might need to filter them right at the edge and only stream meaningful or actionable metrics to the user or management systems. Edges should send “information” and not just “data.” Cisco’s solution to intelligently filter metrics in a router is called “AI Driven Telemetry.” It is available with IOS-XR 7.3.1. If you would like to dive a little deeper into the topic of filtering metrics at the edge and automatically extracting those metrics most interesting for a particular situation or event, have a look at “Detecting State Transitions in Network elements” and “Semantic feature selection for network telemetry event description.”
Dependencies and Policies: Explicitly declare and isolate dependencies and policies – edge-local and cloud.
Similar to the “logs” factor, where we expanded the scope to include metrics – we also need to evolve the “dependency” factor. We not only need to declare and isolate dependencies on libraries and system-wide packages explicitly. We need to do the same with policy and regulatory dependencies. Policies – which might come in different shapes, like location-dependent regulatory policies or security policies – are among the key reasons workloads are deployed at the edge. An app never relies on the implicit existence of system-wide packages or services. It also never relies on implicit policies. Today, regulatory or policy constraints are often an afterthought when apps are developed, making policy enforcement a tedious effort. Considering the need to comply with different policy types at development time can ease the job of enforcing policy. If we properly instrument our code at build-time with the appropriate hooks, domain-specific policy control can be implemented more easily. Instrument your code at build time to allow for granular location-dependent enforcement at run time.
Applying the twelve factors
When people hear “cloud-native,” many think “Kubernetes.” Kubernetes and the associated ecosystems are the development and deployment experience we’d like to preserve. How about we apply our evolved 12 Factors to a scenario that leverages Kubernetes? Kubernetes was built for the cloud. Why would we consider it for edge? There are already quite a few edge solutions based on Kubernetes out there. Why? The common reason is “we’re very familiar with Kubernetes and the associated tools.” Remember our objective, “Feels like cloud, runs at the edge.”
Allow me to start with an oversimplified view of Kubernetes. At a very high level, a cluster consists of a set of compute nodes – leaders and workers, an API server to input intent, and a database (etcd) to store intent and state. Workers reach out to the database to instantiate the intent. A cluster is something you manage yourself, whereas workers can fail at any point in time and are handled by the Kubernetes scheduler. Regarding the frequently used pets and cattle analogy, clusters are pets, and workers are cattle.
Let’s apply our evolved 12 Factors to this setup while keeping the main design and operational principles intact. Have a look at the diagram on the right in the picture above. We scale by location and process, and we factor in disposability, backing services, logs, and dependencies: A site is comprised of one or several clusters. Sites are managed by a control plane – which offers the ability to insert intent for sites stored in a database. We suddenly have the sites as cattle, and as such, the clusters that make up a site are cattle. No more clusters as “pets.” We can keep the operational principle with an API used to feed intent across the entire system. We’re storing that intent in a database. The sites reach out to the cloud-hosted control plane to retrieve their desired state. They fetch the intent and realize that intent locally. In addition, sites push metrics to the cloud-hosted control plane. It is a cluster of cluster schemes. This approach allows us to horizontally scale across the globe from a location perspective. We’re doing process-level scaling within all these edge clusters. And beyond clusters, we’re scaling clusters horizontally with the very same principle that we know from Kubernetes. All of this is theory, of course.
The reality is that we’ve built a solution that builds upon the theory we discussed. The project is called Great Bear. It allows you to develop and deploy apps at the edge as a service. Or put differently, Great Bear is to bring the cloud experience to the edge – for those that develop apps, as well as for those that deploy and manage applications across many edge locations. If you are interested in learning more about Great Bear and being invited to participate in our early access program, check out eti.cisco.com/great-bear.
Enabling a transition
In summary, edge-native means that we enable a cloud-like development and operations experience at the edge. At the same time, we enable a transition. Suppose you look at the smart device edge today. In that case, most of the solutions at the smart device edge like smart mirrors, loss prevention systems, digital signage, and predictive maintenance are turnkey solutions focused on point problems: specific hardware, software, operating system management solution – all tightly integrated to solve one and only one problem. Solutions are siloed. If you deploy solutions at the smart device edge, you might be okay with one or two siloes. But if you need multiple solutions, you are unlikely to enjoy the fact that you need to deal with a slew of different management systems, spare parts, support models, etc. Customers often start with solving a single problem. That problem could, for example, be loss prevention at an automated checkout, a Covid safety problem, or automating a particular process in the factory. Every single time, after the first use-case is deployed, there are five more. You start with solving the automatic checkout problem, and soon after that, you explore behavioral analytics, inventory control of your shelves, etc. The deployment follows a classic “land and expand” pattern, which requires a holistic approach rather than a line up sequence of silos. A holistic approach means moving from the silos to a platform model. That is why we are building Great Bear as an edge-native platform offering a simplified operational layer to scale out to thousands of locations and offer a set of services that take care of the edge specifics. These services include tools and libraries for AI/inference at the edge, edge data management, data IO and rendering, etc. Solution providers can leverage these services. They no longer need to deal with all the edge complexities. The edge-specific qualities are hidden and dealt with. All a developer or operator sees is a cloud-like development and operations environment – just that it runs at the edge. Edge Native.
If you want to give Edge Native a test drive, check out Great Bear at eti.cisco.com/great-bear.