Network outage affecting Xen VPS
Incident Report for Mammoth Cloud
Postmortem

At approximately 5:00AM AEST Mammoth started receiving a high rate of errors from our various monitoring probes for traffic destined to our Xen VPS servers ("Xen network" for rest of this postmortem).

After investigating the fault we determined that while the Xen network was accessible from all Australian locations we tested, international traffic was being dropped at TPG router 202.7.173.230 . Here is an example traceroute:

1 199.87.228.65 (199.87.228.65) 0.802 ms 1.353 ms 1.712 ms
2 pdx-edge-rtr01.forked.net (199.87.231.25) 2.214 ms 2.571 ms 2.830 ms
3 v323.core1.pdx1.he.net (216.218.244.225) 3.591 ms 4.012 ms 4.222 ms
4 * * *
5 10ge10-20.core1.sjc2.he.net (72.52.92.157) 18.788 ms 19.364 ms 19.588 ms
6 tpg-internet-pty-ltd.10gigabitethernet12-1.core1.sjc2.he.net (64.62.194.114) 197.041 ms 197.205 ms 197.328 ms
7 203-219-35-129.static.tpgi.com.au (203.219.35.129) 222.355 ms 222.117 ms 222.190 ms
8 202.7.173.230 (202.7.173.230) 194.314 ms 194.251 ms 194.317 ms
9 * * *

By comparison, a working traceroute previously ended like this:

 6  tpg-internet-pty-ltd.10gigabitethernet3-1.core1.sjc1.he.net (72.52.66.22)  148.588 ms  148.621 ms  148.616 ms
7 203-219-35-129.static.tpgi.com.au (203.219.35.129) 186.208 ms 186.483 ms 186.590 ms
8 202.7.173.230 (202.7.173.230) 186.779 ms 187.769 ms 187.336 ms
9 203.220.0.231.mammoth.net.au (203.220.0.231) 189.997 ms 190.253 ms 190.375 ms

(where hop 9 is the Xen network router)

Fault was raised with TPG via email at 5:20AM AEST. With no resolution we escalated the issue by phone at 6:40AM, where it was confirmed to Mammoth that it was a router fault and was being worked on.

At approximately 7:20AM TPG stopped announcing Mammoth IP space via BGP and the Xen network became inaccessible from all locations. At approximately 7:50AM BGP announcement resumed and service was restored for both Australian and international traffic.

TPG has not confirmed the specifics of root cause or resolution but in traceroute we can now see

 6  tpg-internet-pty-ltd.10gigabitethernet3-1.core1.sjc1.he.net (72.52.66.22)  148.588 ms  148.621 ms  148.616 ms
 7  203-219-35-147.static.tpgi.com.au (203.219.35.147)  149.843 ms  148.347 ms  148.442 ms
 8  203.220.0.231.mammoth.net.au (203.220.0.231)  147.237 ms  147.209 ms  147.414 ms

The trace has shortened by one hop; and thus conclude:

  • router 202.7.173.230 is no longer in use; and
  • the fault was resolved by connecting Mammoth directly to the upstream router on 203.219.35.0/24

Thus, the total outage between 7:20AM and 7:50AM corresponds to TPG migrating Xen network from router 202.7.173.230 to direct connection with their upstream router.

Posted Aug 09, 2016 - 14:48 AEST

Resolved
This incident has been resolved.
Posted Aug 09, 2016 - 08:28 AEST
Monitoring
Connectivity from both Australia and overseas has been restored
Posted Aug 09, 2016 - 07:55 AEST
Identified
TPG has confirmed a fault with upstream router and are working to resolve the issue.
Posted Aug 09, 2016 - 07:52 AEST
Update
Issue is now affecting Australian traffic, we are investigating the issue.
Posted Aug 09, 2016 - 07:33 AEST
Investigating
We are aware of a network issue affecting our Xen VPS customers. The issue primarily seems to be impacting network access from international locations.
Posted Aug 09, 2016 - 05:11 AEST