Site icon Converge Digest

Cloudflare Traces November 18 Outage to Bot Management Feature File Error

Cloudflare experienced a major global outage on November 18 that disrupted core CDN and security functions after a malformed Bot Management feature file propagated across its network. The failure began at 11:20 UTC when a routine database permissions update caused a ClickHouse query to output duplicate metadata rows, doubling the size of a configuration file used across Cloudflare’s Frontline (FL and FL2) proxy engines. The oversized file exceeded a hard-coded feature limit, triggering crashes in traffic-handling processes.

End users saw widespread HTTP 5xx errors as Cloudflare’s core proxy began failing intermittently every five minutes, depending on whether good or bad versions of the configuration file propagated. The oscillation led engineers to initially suspect a hyper-scale DDoS attack, further complicated by an unrelated outage affecting Cloudflare’s external status page. The root cause was ultimately traced to the malformed Bot Management file generated by a partially upgraded ClickHouse cluster.

The impact extended across Cloudflare’s stack. Core CDN traffic returned 5xx errors, Turnstile CAPTCHA failed to load, Workers KV experienced elevated error rates, and Access authentication systems recorded widespread login failures. The Cloudflare dashboard remained available but inaccessible to most users due to Turnstile failures. Downstream systems also saw latency spikes as debugging processes consumed CPU while services entered repeated fail states.

In a company blog, Cloudflare said it halted the propagation of new feature files at 14:24 UTC and pushed a last-known-good version into distribution. By 14:30 UTC, traffic flows recovered across most services, with full restoration completed at 17:06 after restarts across affected systems. The company plans to harden configuration ingestion, expand global kill switches, and prevent diagnostic subsystems from overwhelming system resources during failures.

• A database permissions change caused a ClickHouse query to return duplicate metadata

• Bot Management feature file doubled in size beyond a 200-feature system limit

• FL and FL2 proxy engines crashed, returning widespread HTTP 5xx errors

• Turnstile, Workers KV, Access, and dashboard login flows were significantly impacted

• Global rollback of the configuration file resolved core issues by 14:30 UTC

“We know we let you down today,” Cloudflare CEO Matthew Prince said. “Given Cloudflare’s importance in the Internet ecosystem, any outage of any of our systems is unacceptable.”

🌐  Analysis

This outage highlights the fragility introduced when frequently-propagated machine learning configuration files interact with strict memory-allocation limits in distributed real-time systems. Cloudflare’s multi-year migration from FL to FL2 shows both engines remained vulnerable to malformed configs, underscoring the need for networkwide schema validation. Other major providers, including Google and Fastly, have recently emphasized guardrails around configuration rollouts as AI-driven traffic classification grows. Cloudflare’s planned kill switches and validation layers follow industry trends toward safer, staged propagation of dynamic ML signals.

Exit mobile version