[HamWAN PSDR] Service Impact Notice

Fri Mar 11 09:29:12 PST 2016

Bart, Rob,

The biggest problem I see here is time resources. I brought this up to Bart off list, but there’s a continuing struggle to either have time to do the work yourself, or get other people to do the work.

I deployed all of our monitoring and logging infrastructure, and I can say as a fact it’s been a struggle to get anyone to even do the basic work of adding new devices to the existing monitoring system, even after providing tutorials. This has gotten a bit better in very recent history, but it remains an issue.

Automation is absolutely something we need to put more work into. Ryan and I have already put a bunch of work into this, which again, we have struggled to get folks to pick up, use, and contribute to.

Modems breaking happens, and site access can be a significant problem. The East Tiger-SnoDEM link that Bart called out has been known down, but we can’t feasibly get that replaced in the middle of winter. Hopefully soon that can be taken care of.

We can try to treat this like a production network all we want, but the reality is that we have effectively one part time staff trying to do, as Rob put it, both the Operations and Development work.

The reality is that this is a network with VERY limited admin resources, which get split up to do various important things, the 900MHz work included, but that leaves even less available to do any day to day work. This isn’t our full time job, we’re not paid, we all have lives and families, we have VERY few people that actually volunteer to do any of the work, so the reality is there’s a lot we have a hard time getting to. Reality puts us much closer to “best effort” than “production”, and until we get more time/resources to do the work, it’s going to continue to be a struggle.

If folks want to volunteer, I’d be happy to put them on improvements in monitoring, automation, and fixing things in the existing production network. 

Nigel

> On Mar 11, 2016, at 09:11, Rob Salsgiver <rob at nr3o.com> wrote:
> 
> Bart,
>  
> You touch on a few things that have been “niggling” at the back of my mind for quite a while now – most of them come down in one way or another to overall reliability (of HamWAN) for EMCOMM, which most know has been my main driver for supporting the effort.  
>  
> There’s been a TON of great work done and quite frankly, I’ve been amazed that HamWAN has gone as far and fast as it has, particularly for a “ham” effort.  
>  
> At the same time we’ve slowly been adding and attracting the attention of various EMCOMM organizations with the promise and potential of redundant, reliable, resilient communications when “the big one” hits.  Obviously not everything HamWAN is expected to survive a major quake or other event, but even pockets of reliable, high-speed communication are more than what can be accomplished via voice relays.
>  
> All of which bring back to the current outage and discussion.  There have been several outages in key places since we began.  Last year SnoDEM was all but stranded due to a Haystack modem failure and other events at the same time.  Now we have a similar situation in a different place brought on by multiple failures or weaknesses.  In other instances I’ve been told we’ve had outages via misconfigured devices or other reasons.  Even in a perfect world, human error happens.
>  
> I believe HamWAN would benefit from somewhat of a shift in operating philosophy that would create two separate departments or divisions – operations and development.  
>  
> Operations responsibilities
> 1)       Provide day to day monitoring of network resources and conditions
> 2)       Manage (admin) of those portions of the network that are designated as “in production”.  This should be the majority of the network.
> 3)       Provide communications and coordination of network maintenance
> 4)       Maintain an active inventory of all operational (production) sites, site hardware, and site access information.
> 5)       Maintain and manage all production site device configurations and config change management.
> 6)       Coordinate implementation of new functionality introduced by the Development department with appropriate monitoring, end-user communication, etc
> 7)       Recommend topics and technologies to be explored by the Development team to enhance operational stability and delivery of new features to the network.
> 8)       Document technologies, methods, and tools selected for use (and why) from an operational standpoint.
> 9)       Maintain an active inventory of spare hardware to support all sites.
> 10)   Establish a plan to correct ALL key site failures within XXXX days.
> 11)   Coordinate with Development to actively inject and test network failures and redundancy capabilities.
> 12)   Coordinate with Development to enhance HamWAN’s ability to operate in “pockets” when portions of the network fail in an earthquake – i.e. – each “island” stays operational with as many services as possible
>  
> Development responsibilities
> 1)       Continued exploration of new hardware, software, and network management tools (Quagga vs BIRD, Metals vs QRTs, etc)
> 2)       Conduct experimentation with new hardware and software on separate network resources where possible, or in coordination with Operations on the larger network (more on this below).
> 3)       Document technologies, methods, and tools explored and indicate pros/cons of each where possible.
> 4)       Continued exploration, analysis, and documentation of available antenna and shielding designs
> 5)       Exploration of new antenna designs and/or other hardware?
> 6)       Exploration of new frequencies and how they are affected by terrain, vegetation, weather, etc
> 7)       This particular list can go on FOREVER
>  
> The distinction here is largely mental, but it’s important.  It is entirely probable to have the same people in both groups, yet having the separation is important if HamWAN wishes to be taken seriously as a services provider to the EMCOMM community.  Any benefits from that would also improve service for ALL HamWAN users.
>  
> Having EMCOMM onboard is important.  Not only does it provide a needed service to them, but if critical mass can be achieved it gives HamWAN access to multiple sites in every city and county.  In turn though, HamWAN as a network needs to be reliable in the “customer’s” eyes.  This means that infrastructure is managed with uptime as the highest priority, experimentation is managed to minimize adverse production impacts, and equipment failures are identified and corrected quickly.
>  
> This is admittedly a fair amount of work.  Much of it I suspect is already underway – maybe not just quite in this format.  Additional help will definitely be useful.  Everyone involved only has so much time available, and they should be able to focus on those items that are important to them.  I believe the above framework (or something similar) begins to put some useful structure in place that continues to shape HamWAN from being the “wild west” of amateur and network “geek” exploration into the reliable, commercial grade, disaster resistant, amateur platform it envisions to be - while still allowing amateurs to push the limits of technology like they are meant to.
>  
> If the above (or something similar) is of interest to the current directors and group as a whole, we can easily create a similar worklist that individuals on the sidelines can start picking things they can help with to help bring about.
>  
> Just ideas.  Not saying they’re perfect, but it’s a start.  Any other thoughts?
>  
> Cheers,
> Rob Salsgiver – NR3O
>  
> From: PSDR [mailto:psdr-bounces at hamwan.org] On Behalf Of Bart Kus
> Sent: Friday, March 11, 2016 12:56 AM
> To: psdr at hamwan.org
> Subject: Re: [HamWAN PSDR] Service Impact Notice
>  
> Hmm that's not the whole story though.  If it were just the 1 router failure (in reality a hypervisor failure), we'd be in a much better position, but it's combined with 2 other modem failures.  We had the ETiger->SnoDEM modem die over the winter, and it needs replacement.  That link has been down for a month or more now.  And most recently we're having the Tukwila->Baldi modem lose connectivity frequently.  We've implemented an automatic mitigation for that, but it still produces sporadic short downtime windows of a few minutes.  I'd just like to move that modem to a NetMetal 5.  Our servers are also being affected by instability in the Quagga routing software.  We need to replace this with a more stable alternative, like BIRD.  Lastly, the Baldi emergency uplink is only configured to go to Westin and Corvallis, but not Tukwila.
> 
> We could have avoided DNS outages too, if the anycast groups were populated with more of the available servers.  I believe lack of good automation for server build-outs is causing the deployment lag here.
> 
> The network is designed to withstand failures, even multiple failures, but we've got many broken things right now that need fixing.  After that fixing, I would really love to see some folks get behind improving our monitoring, deployment and diagnostic automation.  Networks like this won't scale unless they're nearly completely automated and simple to manage.  I would not mind at all if we even rolled back some features until we can get them re-implemented in 100% automated ways.
> 
> As important as all this is, I still think the deep penetration project takes precedence, so I can't drop that work in favor of this.  Aside from helping out on the simple break-fix stuff, I mean.
> 
> --Bart
> 
> 
> On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
>> Thanks for the update, Nigel.
>>  
>> On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel at nigelvh.com <mailto:nigel at nigelvh.com>> wrote:
>>> Hello All,
>>> 
>>> Just wanted to send out a quick notice here. We’ve had a failure at our Seattle edge router, which we’re still investigating. In the meantime, our Tukwila edge router is still providing connectivity, but you may notice higher latencies or issues reaching things. If you find things you can’t reach, please let me know, as we’d like to make sure the redundancy is working, while we’re working to resolve the issues we’re investigating with the Seattle edge router.
>>> 
>>> Nigel
>>> _______________________________________________
>>> PSDR mailing list
>>> PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>>> http://mail.hamwan.net/mailman/listinfo/psdr <http://mail.hamwan.net/mailman/listinfo/psdr>
>> 
>> 
>>  
>> -- 
>> Ryan Turner
>> 
>> 
>> 
>> 
>> _______________________________________________
>> PSDR mailing list
>> PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>> http://mail.hamwan.net/mailman/listinfo/psdr <http://mail.hamwan.net/mailman/listinfo/psdr> 
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hamwan.net/pipermail/psdr/attachments/20160311/150aea12/attachment-0001.html>