On 4 Jun 2024, at 8:53, Ron B via zanog-discuss wrote:
There were path failures to Amazon and IBM Cloud that failed within the Transit provider. Traces in both directions failed within the Transit provider. The policy preferred the routes via Transit B for Amazon and IBM Cloud. Not necessarily the rest of the Internet.
If the path eg. peering failed between the transit provider and as an example Amazon then Amazon would not receive your prefixes and not serve traffic via that path.
That is what I was noticing, the path was broken but the routes were still being advertises over that path.
I’m still not seeing how this is possible. If the transit provider’s link to another network is down the BGP session would be down and prefixes withdrawn.
We agree on that. My opinion is that its someone playing silly buggers with peering. Its intermittent but it happened for 20 minutes last night. Have seen it happen twice last year.
There are times when the cloud provider’s network is broken and they are still sending and receiving traffic over an IX but then it dies within the cloud’s network.
This does not sound the same thing. If you would like to share some of those traces either on or offlist I can look more into it.
Most tickets are dealt with as an immediate and visual fault but asking about root causation meets with a brick wall or some BS template response that is irrational in the context of the fault.
Might be worth escalating it with your transit provider if you feel you not getting a good enough response.
Like the submarine cable owners of 5 cables haven't explained how there is a single point of failure for the whole continent of Africa 250km off the Cote d'Ivoire.
Different thread but I will bite, we can only believe what the cable operators share with us in a public forum that 3 cables break due to rock slides.
If a network doesnt have diversity then yes it will isolate their African operations when a couple of cables break. It does seem that even with “local” presence and availability zones in ZA there is still huge reliance on international connectivity. We did discuss this at ZANOG-GP Workshop on network resilience.
I assume that the Transit provider advertises prefixes to the cloud providers regardless of having an internal fault as the route metric is the path between the two and not the full downstream path within the Transit provider?
If there is a BGP session then prefixes will be exchanged. BGP does not carry metrics like latency and packet loss and withdraw prefixes or weight prefixes based on that.
maybe it is time for SD-BGP and BaaS (BGP as a Service) :) remember to sprinkle some AI in there too