When you make a network change, where do you get your information from?
Often, network engineers rely solely on memory and previous hands-on understanding of the network when making changes. Here’s a scenario I heard from Michael Wynston, Director of Network Automation & Architecture at Fiserv, on our recent Packet Pushers episode:
“When an engineer needs to do something, add a new VLAN to a switch, you SSH to the 10 switches that you know of. And you find the VLAN number that’s not being used. Excellent.
“Well, there are 12 switches. And that’s just in that one pod. And, oh, by the way, the VLAN you picked happens to be a transit VLAN that you didn’t see on those 10 switches, because it’s only on these two.
“And now that you’ve added it to those other 10, those other 10 become transit, you created a spanning tree loop, and happy day, the wires are warm.
“That’s the kind of mistake that engineers make. And it’s called an incident caused by change. And there’s no way, with over 20,000 devices, any engineer can say, without knowing, that they actually know anything.”
How to Avoid Incidents Caused by Change
As networks scale up, these kinds of incidents can become more common unless teams adopt new strategies and tools. The example above came up as part of a conversation around automation and how engineers source needed information. In Fiserv’s case, with over 20,000 devices in their network, there’s literally no way a network engineer can hold the information required to avoid incidents in their head — and the same is true for the majority of organizations, especially as networks continue to expand.
How do you get around this?
It’s all about creating processes you can trust, where systems of record are always authoritative and are automatically updated when changes are made so they are always trustworthy. Engineers need the ability to access information they know is accurate. That doesn’t mean a single source of truth — as we’ve covered previously, attempting to build a single source of truth is not only difficult and time-consuming, but also creates more opportunities for data inaccuracy because you’re duplicating data from one source to another.
Instead, the focus must be on ensuring different systems of record are always accurate, always up-to-date, and always easily accessible to any network engineer, automated process, or network or IT system that must source authoritative data to execute a change.
Introducing Automation Means Standardizing How Processes Interact With Systems of Record
The push to introduce more automation in networking is driving teams to rethink how they interact with systems of record to make network changes. Automated network changes and self-serve delivery isn’t really automated or self-serve if a network engineer still needs to manually source information. The challenge is, information gathering for manual changes is often ad-hoc and can be different between different network engineers.
Inconsistent information gathering already creates incidents when teams do things manually. But it’s a problem that has been, in some ways, tolerated for a while now. However, as we move further toward an automation-driven model for networking, we need to ensure we’re addressing how processes can source authoritative data about all network devices, domains, cloud environments, etc.
Coming back to the example above, a trustworthy system of record for current VLAN allocation could have saved a nasty spanning tree outage. If the network engineer had been able to reference a system of record they knew would contain accurate information, they could have queried the database/system to determine the next unused VLAN or queried the system for the VLAN they had thought was unused to find that it was already allocated to use as a transit. Either way, they avoid the incident caused by human error.
And even when authoritative systems of record do exist, network engineers are often working under intense pressure and time constraints — bringing me back to that automation point. In a team that’s trying to keep up with manual activity, an engineer might decide, I’ll just verify what I can use from the router I’m already changing, because I’m already accessing it. That approach can lead to outages too. But, if processes are automated, and every process has access to authoritative systems of record, queries happen quickly as part of an automated workflow.
This is why standardization with regards to interacting with systems of record is so critical to success as networking continues to evolve.
Your Orchestration Platform Must Integrate With Every System of Record
All of this is to say, if you’re investing in automation and orchestration with the goal of orchestrating workflows that can drive network changes end-to-end, you need the orchestration solution to be able to interact with every system of record in your environment.
Itential offers a few key advantages over other platforms on the market that make our platform the ideal solution here. Itential’s integration model is based on consuming APIs, and the platform can auto-generate integrations with other systems based on API documents. It’s a vendor-agnostic approach that allows teams to incorporate any system that exposes an API into an Itential orchestration workflow. This allows network engineers to build workflows that always both pull information from systems of record and update those systems when changes are made, ensuring no “rogue” changes happen and keeping the systems of record accurate and up to date.
In addition, Itential allows for multi-domain orchestration across all of your network infrastructure. Therefore, the central platform which interacts with every system of record is also the only platform network and infrastructure teams need to use to orchestrate change processes.
The third major point I want to highlight is Itential’s ability to expose its API for consumption by other platforms and users, which enables the delivery of all orchestrated network changes for end users, pipelines, and platforms to self-serve.
This is crucial — in a manual, ticket-driven world, network changes aren’t standard. Sometimes, a network engineer will query the right system of record, another time, they won’t. Someone might forget to update a system of record when a change is made. Someone might skip steps when making a change if it’s late in the day and it’s urgent. All of this causes more risk.
If instead, the only way a network change can be made is when someone requests the same orchestrated service — wherever that’s being exposed—then it’s not possible for “rogue” changes to be made, it’s not possible for human error to create issues with systems of record, and it’s not possible for anyone to skip steps.
Let’s revisit the story from the beginning. In Fiserv’s case, they were able to use Itential to reduce the amount of different pathways engineers take to make changes while simultaneously delivering network services across the entirety of their organization in a standardized manner. This has reduced the risk of human error and lowered the number of incidents they face, even as their network continues to expand and they evolve toward a fully self-serve model. You can listen to this Packet Pushers episode to hear the full story.