Building Operational Resilience for an Evolving Digital World

By Anil Kollipara, vice president of Product Management, Spirent Communications

How many large enterprises have suffered a significant outage due to a bad software update or security issue? You might be better off asking how many haven’t. In the past year alone, outages in major financial networks have shut down trading platforms, knocked out payment services, even locked customers out of their bank accounts. In the most severe cases, like the CrowdStrike outage, thousands of businesses were affected, causing hundreds of billions in direct financial losses.

The truth is, as the networks that the global economy depends on grow more dynamic and complex, software conflicts and security vulnerabilities become more likely. Indeed, modern networks have so many complicated software interdependencies, even seemingly minor problems can cause outages that impact millions of users. These issues have become so serious, they’ve spurred new regulations like the E.U.’s Digital Operational Resilience Act (DORA), which mandates rigorous resiliency standards and proactive testing for key industries like banking and insurance.

Fortunately, modern testing methodologies can help organizations fix most potential issues before they impact customers. By automating security and resilience testing, you can get ahead of the risk, meet evolving compliance requirements, and minimize disruptions for your customers and your business.

Evolving Threats Demand New Strategies

Growing regulatory focus on the resilience of digital infrastructure shouldn’t come as a surprise. It was likely inevitable given the long-running trends affecting modern businesses. Start with the fact that as organizations come to rely on digital tools for more of their day-to-day operations, they become much more vulnerable to outages. As networks evolve from physical to virtual to cloud-native, the infrastructures enabling those tools also keep growing more complex.

Modern networks encompass multiple vendors and APIs and third-party components, with constant updates released for every part of the stack. Different stacks are also increasingly nested, creating scenarios where problems in one vendor’s software can quickly spread. (The faulty CrowdStrike update, for example, didn’t just disrupt CrowdStrike software, it brought down the Windows OS on 8.5 million devices.) From a cybersecurity perspective, this complexity creates potential security gaps that adversaries could exploit, necessitating even more ongoing updates and patches. Each update represents a change in the network—and a new opportunity for something to break.

Now, governments and regulators are taking action to create a more stable and resilient foundation for critical digital services. DORA, which went into effect this January, requires financial institutions, cloud providers, and others to perform ongoing operational resilience testing—including proactively testing security mechanisms—or face fines up to 2 percent of total revenues.

In this environment, organizations can’t assume that any change is safe to promote in the network, even when it comes from a trusted vendor. Just as businesses carry ultimate responsibility for security breaches that affect their customers, they must start treating operational resilience the same way. It doesn’t matter which vendor in the stack issued a faulty update. If you’re the party providing digital services to customers, it’s now your responsibility to keep that infrastructure online.

Rethinking Testing

For enterprises moving to address these requirements, the only option is to thoroughly test and validate everything. Yet this change is not always easy, especially for organizations that still use manual testing processes designed for yesterday’s vertically integrated infrastructures. Those legacy approaches can’t keep up with modern software-driven, cloud-native environments, where millions of test cases may be needed to fully cover the network. In too many cases, current testing approaches are also:

Arbitrary, with human beings deciding when and what to test, potentially missing important issues
Siloed, with different teams (Security, Engineering, Operations) focusing only on their specific part of the stack
Incomplete, with validation often limited to basic functional testing (“Is this new node live?”), without investigating impact under peak loads, non-optimal conditions, and “rainy day” scenarios

In some cases, the more automated an organization’s DevOps processes are—the more advanced their continuous integration/continuous delivery (CI/CD) implementation—the more challenging these issues become. If organizations aren’t careful, they can over-optimize for speed of deployment at the expense of thorough testing and service resilience.

Proactive Assurance Drives Efficiency and Compliance

The only way to comply with new operational resilience mandates—and avoid the risks and costs of failures—is to perform more exhaustive, proactive testing. This testing should be:

Comprehensive, encompassing all network elements, software upgrades, and the attack surface
In-depth, extending beyond basic functional testing to investigate how changes and security threats could impact availability and quality
Proactive, using synthetic traffic and emulation of real-world scenarios to test under peak load and other stress conditions, and continually validating security and resiliency mechanisms
Flexible, using virtualized, federated lab and testing equipment that can be accessed from anywhere, improving capital efficiency as testing expands
Automated, as legacy manual approaches simply can’t execute the millions of test cases needed to protect modern business infrastructure and customers

The following figure illustrates what a more modern and effective approach to continuous testing, automation, and operational resilience looks like.

It starts with an Infrastructure Access abstraction layer, which allows continuous testing tools to access all servers, routers, switches, and security appliances in the environment. Next, Operational Resilience Testing Methodologies provide diverse test case libraries to measure operational resiliency, including under peak loads and failure conditions. With Lab Automation, virtualized lab equipment now functions as a flexible, automated resource pool. So if a new security patch comes in, for example, you can spin up a testbed for it—with the right topologies and configurations for all test cases—in minutes. Finally, with Test Automation, testbeds can execute testing across all categories of resilience and automatically route artefacts to the proper team if any test fails.

The result is a more comprehensive and automated testing environment. Now, you can avoid most customer-impacting issues because you’re exhaustively testing every change and proactively validating security defenses. You can innovate more quickly, with greater peace of mind, because you’ve automated end-to-end testing and verification within your CI/CD/CT framework. You’ve shortened the time spent testing new patches and software releases from months to days, achieving significant operational efficiencies. You’ve simplified compliance, with documentation at every stage of testing. Best of all, you can consistently deliver more reliable, higher-quality services to customers.

Tags: Spirent

Building Operational Resilience for an Evolving Digital World

Cisco Launches Open Source Cybersecurity AI Model

FCC Unlocks 600 MHz in 37 GHz Band

Jim Carroll

Related Posts

Keysight Completes £1.16 Billion Acquisition of Spirent Communications

VIAVI Acquires Spirent’s High-Speed Ethernet and Network Security Testing Business

Spirent Launches Cloud-Native Benchmarking for 5G SA

Blueprint: Realizing the Monetization Opportunities of 5G-Advanced

Spirent Integrates Telescent Optical Switches into Velocity Automation Portfolio

Spirent and Juniper Demo Ultra Ethernet Transport at Interop Tokyo

FCC Unlocks 600 MHz in 37 GHz Band

Categories

Archives