operations

What 'Network Stability' Implies

August 17, 2019

My thoughts on what to do and not do if you have a goal of operating a stable network.

Human Infrastructure Magazine 71 – What Network Stability Implies by Ethan Banks

Several weeks ago, I tweeted...

As a network engineer, perhaps your most important responsibility is network stability. Much is implied by the word "stability" @ecbanks

What was I getting at when I said that, “Much is implied by the word stability”?

No science experiments

I’ve worked on several networks where it seemed like every feature some previous engineer read about in a certification book was enabled. A production network is not your lab. Don’t turn on (or off) a feature unless you have a specific business reason to do so.

Fix bad designs

Most networks have problems. Those problems will, if ignored, lead to unscheduled downtime. One of your jobs as an engineer is to anticipate the weaknesses in a network and, making the most of your allotted budget and available equipment, engineer those weaknesses out.

Standard, simple, replicable design

Networks that are built the same from pod to pod, closet to closet, and site to site tend to be more stable. Standardized designs have been standardized because they work. They tend to be as simple as possible, but no simpler. Standardized, simple designs lend themselves to being copied. The documenting of standard designs means that other engineers have the opportunity to enforce consistency across the network landscape.

No cowboys

As an engineer working on a production network, you are not a special snowflake. Fall in line. Build the network according to the standard. Stop making it up as you go like a gunslinging anti-hero. Networks don’t stay up if you make it up. Stable networks are the result of thoughtful design pondered over time and carefully considered in the context of the business and the rest of the IT stack.

A minimum of changes

Stop changing things whenever you’re in the mood. Every time you’re about to commit a change, ask yourself if it’s necessary. If yes, is this the appropriate time? Change windows exist for a reason. Do the right thing at the right time to mitigate the risk inherent in change.

Capacity monitoring and planning

A stable network is one that carries all of its traffic with consistent end-to-end latency and very little packet loss. Congested links are a form of network instability, because they result in undelivered traffic. Keeping ahead of bottlenecks is key to long-term network stability.

Wizard-like knowledge of traffic impact

While companies like Veriflow and Forward Networks get up to speed verifying changes against network models, the best network modeler is often the engineer making the change. You should understand the protocols you’re shooting from your keyboard laser pistol so thoroughly that you can predict what’s going to happen when you start making “pew pew” noises. If every new command results in a surprise, you’re going to cause downtime without meaning to.

Obvious security holes filled

Read a hardening guide and apply the bits you can. Change default credentials. Turn off unused services. Avoid silly things like “public” and “private” SNMP community strings, remnants from a bygone era. If you leave doors open, you’re going to get owned. Once the doors are shut, make sure the windows are latched. Stable networks make an attacker work to compromise the infrastructure.

Bug free code

Don’t run the latest release just because it’s out. New code can, and often will, introduce bugs. If the code you run now does what you need and has no known serious security vulnerabilities, stick with it. When possible, prefer maintenance trains and vendor recommended releases over code still dripping with developer sweat.

If I Had To Pick One Thing...

I’d pick “a minimum of changes” as the single greatest factor in network stability. Leave the network alone if it isn’t broken. When the network design does need to change, plan the change with excruciating caution, and schedule the change for a time that the business agrees to.

Never make changes on the sly, hoping no one notices. That can be hard when you’re an introverted type like I am, especially when you just want to get something done.

However, I’ve learned that no matter how capable you are or how unlikely the change is to cause a disruption, accidents happen. A mid-day outage is an introvert’s worst nightmare, especially if you’re the cause.

Characteristics of a Well Run Network

January 9, 2018

https://thenetworkcollective.com/2017/11/ep15-well-run-network/

Design

Keep it simple. Just because you can, doesn’t mean you should
Complexity is not just a networking attribute, it’s an overall system attribute
Proper design leads to simplicity (most of the time)
What technologies are simple? How do you recognize complexity? Is vendor lock-in one indicator? Network management software / API lock-in another growing one?
There are some pretty simple campus/user, datacenter, Internet Edge, and WAN approaches Big organizations can handle and may need a bit more complexity — or not
Modularity- too many moving parts that have to work together = complex

Transparent Network. It just works
Able to easily implement changes
Agnostic to both today’s needs and flexible to absorb tomorrow’s needs
Up-to-date diagrams and documentation matter
- Organized around OSI layers
- Documented naming conventions with fixed fields
- MTTR e.g. from NetMRI and some other tools
Up-to-date code
Common code across any given router or switch model Measurement
There is no way to tell how the network is performing currently as compared to previous performance unless data is collected at regular intervals
Ideally is done via more sophisticated tooling that not only gathers the data but also performs benchmarks and reports on anomalies
Network awareness among engineers
Interpersonal communication and team collaboration

Lab Testing

Used to be sacrificed as a CAPEX or OPEX that was cost prohibitive because of the vast resources needed to provide test points outside, on the edges, of the network
All network designers and operators should expect/select OEMs that provide a virtual edition of their hardware
These software-only solutions should be used to re-create at least a subset if not all of the infrastructure. Then this virtual environment can be used to rehearse upcoming changes and their possible effects/outcomes.
This should also be used to verify that backups are not only occurring but are actually useful. Much like an application or database backup should be regularly tested, so should the network backups.

Change Management

Wear a black concert T-shirt during all Maintenance Windows
Listen to Grunge music as IP truly came of money-making age during that era and the music soothes the network.
- If in a datacenter, noise-cancelling headphones and conf call helps. Sep call/webex with 30 minute updates for execs (prevents constant second-guessing and distraction), and liaison on both the tech and exec call, jotting notes for periodic exec update
Have an Implementation Plan
- Tabs in XLS, including contact info, pre-change configs, configlets to blow in, test plan, backout configlets. All SSH connections opened prior to change window.
Peer Review and Sign Off of Implementation Plan
Take Pre-Change Snapshots/Health Checks
Take Post-Change Snapshots/Health Checks
Insist upon user/app owner User Acceptance Testing to occur prior to and after completion of the change.
Use lab network
When selecting between OEM solutions you absolutely must conduct a Proof-of-Concept as reading of data sheets and white papers does not a sound decision make.
Regularly train the Ops team
User Acceptance Testing after changes are made