<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hmm that's not the whole story though. If it were just the 1 router
failure (in reality a hypervisor failure), we'd be in a much better
position, but it's combined with 2 other modem failures. We had the
ETiger->SnoDEM modem die over the winter, and it needs
replacement. That link has been down for a month or more now. And
most recently we're having the Tukwila->Baldi modem lose
connectivity frequently. We've implemented an automatic mitigation
for that, but it still produces sporadic short downtime windows of a
few minutes. I'd just like to move that modem to a NetMetal 5. Our
servers are also being affected by instability in the Quagga routing
software. We need to replace this with a more stable alternative,
like BIRD. Lastly, the Baldi emergency uplink is only configured to
go to Westin and Corvallis, but not Tukwila.<br>
<br>
We could have avoided DNS outages too, if the anycast groups were
populated with more of the available servers. I believe lack of
good automation for server build-outs is causing the deployment lag
here.<br>
<br>
The network is designed to withstand failures, even multiple
failures, but we've got many broken things right now that need
fixing. After that fixing, I would really love to see some folks
get behind improving our monitoring, deployment and diagnostic
automation. Networks like this won't scale unless they're nearly
completely automated and simple to manage. I would not mind at all
if we even rolled back some features until we can get them
re-implemented in 100% automated ways.<br>
<br>
As important as all this is, I still think the deep penetration
project takes precedence, so I can't drop that work in favor of
this. Aside from helping out on the simple break-fix stuff, I mean.<br>
<br>
--Bart<br>
<br>
<br>
<div class="moz-cite-prefix">On 3/9/2016 8:23 PM, Ryan Elliott
Turner wrote:<br>
</div>
<blockquote
cite="mid:CACHX6g99N99c4_+t4YacKOXz5=xmuZuNrZQF8kgUWQsiFK_L6Q@mail.gmail.com"
type="cite">
<div dir="ltr">Thanks for the update, Nigel.</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Mar 9, 2016 at 10:17 PM, Nigel
Vander Houwen <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:nigel@nigelvh.com" target="_blank">nigel@nigelvh.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hello All,<br>
<br>
Just wanted to send out a quick notice here. We’ve had a
failure at our Seattle edge router, which we’re still
investigating. In the meantime, our Tukwila edge router is
still providing connectivity, but you may notice higher
latencies or issues reaching things. If you find things you
can’t reach, please let me know, as we’d like to make sure
the redundancy is working, while we’re working to resolve
the issues we’re investigating with the Seattle edge router.<br>
<br>
Nigel<br>
_______________________________________________<br>
PSDR mailing list<br>
<a moz-do-not-send="true" href="mailto:PSDR@hamwan.org">PSDR@hamwan.org</a><br>
<a moz-do-not-send="true"
href="http://mail.hamwan.net/mailman/listinfo/psdr"
rel="noreferrer" target="_blank">http://mail.hamwan.net/mailman/listinfo/psdr</a><br>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div class="gmail_signature">
<p><font face="arial, helvetica, sans-serif">Ryan Turner</font></p>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
PSDR mailing list
<a class="moz-txt-link-abbreviated" href="mailto:PSDR@hamwan.org">PSDR@hamwan.org</a>
<a class="moz-txt-link-freetext" href="http://mail.hamwan.net/mailman/listinfo/psdr">http://mail.hamwan.net/mailman/listinfo/psdr</a>
</pre>
</blockquote>
<br>
</body>
</html>