At approximately 5:00AM AEST Mammoth started receiving a high rate of errors from our various monitoring probes for traffic destined to our Xen VPS servers ("Xen network" for rest of this postmortem).
After investigating the fault we determined that while the Xen network was accessible from all Australian locations we tested, international traffic was being dropped at TPG router 202.7.173.230 . Here is an example traceroute:
1 199.87.228.65 (199.87.228.65) 0.802 ms 1.353 ms 1.712 ms
2 pdx-edge-rtr01.forked.net (199.87.231.25) 2.214 ms 2.571 ms 2.830 ms
3 v323.core1.pdx1.he.net (216.218.244.225) 3.591 ms 4.012 ms 4.222 ms
4 * * *
5 10ge10-20.core1.sjc2.he.net (72.52.92.157) 18.788 ms 19.364 ms 19.588 ms
6 tpg-internet-pty-ltd.10gigabitethernet12-1.core1.sjc2.he.net (64.62.194.114) 197.041 ms 197.205 ms 197.328 ms
7 203-219-35-129.static.tpgi.com.au (203.219.35.129) 222.355 ms 222.117 ms 222.190 ms
8 202.7.173.230 (202.7.173.230) 194.314 ms 194.251 ms 194.317 ms
9 * * *
By comparison, a working traceroute previously ended like this:
6 tpg-internet-pty-ltd.10gigabitethernet3-1.core1.sjc1.he.net (72.52.66.22) 148.588 ms 148.621 ms 148.616 ms
7 203-219-35-129.static.tpgi.com.au (203.219.35.129) 186.208 ms 186.483 ms 186.590 ms
8 202.7.173.230 (202.7.173.230) 186.779 ms 187.769 ms 187.336 ms
9 203.220.0.231.mammoth.net.au (203.220.0.231) 189.997 ms 190.253 ms 190.375 ms
(where hop 9 is the Xen network router)
Fault was raised with TPG via email at 5:20AM AEST. With no resolution we escalated the issue by phone at 6:40AM, where it was confirmed to Mammoth that it was a router fault and was being worked on.
At approximately 7:20AM TPG stopped announcing Mammoth IP space via BGP and the Xen network became inaccessible from all locations. At approximately 7:50AM BGP announcement resumed and service was restored for both Australian and international traffic.
TPG has not confirmed the specifics of root cause or resolution but in traceroute we can now see
6 tpg-internet-pty-ltd.10gigabitethernet3-1.core1.sjc1.he.net (72.52.66.22) 148.588 ms 148.621 ms 148.616 ms
7 203-219-35-147.static.tpgi.com.au (203.219.35.147) 149.843 ms 148.347 ms 148.442 ms
8 203.220.0.231.mammoth.net.au (203.220.0.231) 147.237 ms 147.209 ms 147.414 ms
The trace has shortened by one hop; and thus conclude:
Thus, the total outage between 7:20AM and 7:50AM corresponds to TPG migrating Xen network from router 202.7.173.230 to direct connection with their upstream router.