Welcome to The Long View—where we peruse the news of the week and strip it to the essentials. Let’s work out what really matters.
This week: Cloudflare suffers another huge outage while the FAA and FCC still disagree over 5G/NR near airports.
1. Another Week—Another Cloudflare Outage
First up this week: It was “only” an hour, but when so many services rely on Cloudflare’s infrastructure, you know there’s going to be wailing and gnashing of teeth.
Analysis: If it’s not DNS, it’s always BGP
The postmortem is in. And it makes for embarrassing reading from a service that aims to improve its customers’ availability. What can your Ops team learn?
Manish Singh: Cloudflare fixes outage that knocked popular services offline
Cloudflare said on Tuesday it resolved a “wide-spread” outage earlier in the day that affected a large number of services including FTX, Discord, Omegle, DoorDash, Crunchyroll, NordVPN and Feedly. The internet infrastructure firm resolved the issue roughly an hour after users began facing issues accessing some popular sites including Zerodha, Medium, [The] Register, Groww, Buffer, iSpirt, Upstox and Social Blade.
…
Users had also indicated that they were struggling to use Coinbase, Shopify and League of Legends, according to DownDetector. [Cloudflare] faced a similar outage in some parts of the world last week.
Cloudflare’s Tom Strickx and Jeremy Hartman ’fess up—Cloudflare outage:
Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately, these 19 locations handle a significant proportion of our global traffic. This outage was caused by a change that was part of a long-running project … to convert all of our busiest locations to a more flexible and resilient architecture … called Multi-Colo PoP (MCP).
…
While deploying a change to our [BGP] prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes. Due to this withdrawal, [we] experienced added difficulty in reaching the affected locations to revert the problematic change. … Even though these locations are only 4% of our total network, the outage impacted 50% of total requests. … [It] also caused our internal load balancing system … to stop working. … This meant that our smaller compute clusters … received the same amount of traffic as our largest clusters, causing the smaller ones to overload.
…
We clearly fell short of our customer expectations with this very painful incident. … We have identified several areas of improvement and will continue to work on uncovering any other gaps … to ensure this cannot happen again.
It’s a systemic problem across the industry, thinks jiggawatts:
The default way that most networking devices are managed is crazy in this day and age. [It] is something every network admin has to implement bespoke after learning the hard way.
…
I’ve personally watched admins make routing changes where any error would cut them off from the device they are managing and prevent them from rolling it back — pretty much what happened here. … Many devices still rely on “not saving” the configuration, with a power cycle as the rollback to the previous saved state. This is a great way to turn a small outage into a big one.
People in glass houses, etc. Here’s Claptrap314:
Getting this stuff right is **** hard. … This incident is another reminder that resilience happens at every level of the stack.
2. 5G: New Radio, Old Rules
Back in January, I told you about the last-minute hiccup in the truce between the FTC and the FAA over 5G/NR towers near airports. As you might recall, the FAA and airlines were worried about interference between 5G band 77 and radar altimeters installed in older aircraft.
Analysis: FCC loses patience
There’s no overlap between band 77 and the altimeters’ allocated frequencies—indeed, there’s a meaty “guard band” between the two. The FCC’s argument is that altimeters that suffer interference weren’t properly designed in the first place, so it’s unreasonable to delay the 5G rollout any longer. So the FAA is now demanding airlines install filters to mitigate the interference risk. The FAA wants this done by the end of the year.
Jon Gold: Telecom companies, FAA strike deal on 5G interference
New equipment for older airplanes is the latest step forward in the ongoing dispute between the major telecom companies and the [FAA], as regulators agree to further measures aimed at reducing perceived safety risks caused by 5G. … The frequencies used by radioaltimeter systems, which are an important safety feature for landing aircraft, are close to those used by some kinds of 5G.
…
A deal … outlines new requirements for operators … to add radio frequency filters to their aircraft, and sets a deadline of the end of 2022. … The telecom companies have been mitigating the potential interference by lowering transmission power at 5G access points. … The new deal will see the mitigations continue through the end of 2022.
…
The airline industry, however, was unimpressed by the terms of the deal … citing a lack of technical detail and regulatory approval of replacement altimeter devices.
International Air Transport Association SVP of operations, safety and security, Nick Careen tells Robert Silk he isn’t impressed:
IATA’s Careen said that it is unclear that aircraft manufacturers can even supply all the required altimeter filters by those deadlines. “The FAA claims there is consensus on this date, which there isn’t. … I’m willing to place money on it right now that … we will see massive disruptions.”
…
He called attention to the rules in France. Buffer zones there, he said, are [4.8x bigger than] the temporary concessions that AT&T and Verizon have agreed to in the U.S. In addition, France requires antennas to be tilted downward, while limiting the power of transmissions near airports to less than half of the power being broadcast by the U.S. telecom companies.
But Jon Brodkin can’t see what the fuss is about—Altimeter fixes will let AT&T and Verizon fully deploy 5G on C-Band spectrum:
The Federal Communications Commission in February 2020 approved mobile use in the C-Band, specifically from 3.7 to 3.98 GHz. As airplane altimeters rely on a spectrum from 4.2 GHz to 4.4 GHz, this left a 220 MHz guard band to protect altimeters. [It] is really 400 MHz in practice this year because AT&T and Verizon are not yet deploying above 3.8 GHz.
…
The FCC found that harmful interference to altimeters was unlikely to occur “under reasonable scenarios” given the size of the guard band and power limits the FCC required. … The FCC also urged the aviation industry to conduct more research … ”given that well-designed equipment should not ordinarily receive any significant interference.” … The aviation industry’s slowness … could result in new receiver regulations similar to the rules that already require wireless devices to transmit only in their licensed frequencies.
How did we get into this mess? drinkypoo adds perspective:
We have rules about frequency allocations specifically to avoid problems like these. The engineers who designed these altimeters and the managers who signed off on the designs are the ones at fault here. They collaborated to create a future problem, for profit.
It was part of the legal landscape at the time that frequency allocations could be resold for other purposes. Good engineering takes the regulatory landscape into account. … These devices [are not] so old that they predate frequency allocations.
Butbutbutbut … it’s safety critical! That argument doesn’t wash with dooferorg:
Airlines and aircraft manufacturers are lazy. … Get over yourselves, seriously. Should have designed things better from the start if they’re so ‘safety critical.’
The Moral of the Story:
Look like the innocent flower, but be the serpent under it
You have been reading The Long View by Richi Jennings. You can contact him at @RiCHi or tlv@richi.uk.