[HamWAN PSDR] Service Impact Notice

Fri Mar 11 13:12:19 PST 2016

Replying the to the latest fully-quoted message instead of Ed's, but Ed 
your observations are spot on.

Rob, I think the concept of network ops is finished both in the industry 
and for HamWAN.  In the industry, we're working at such enormous scales 
that you cannot possibly staff enough people to do any of the ops tasks 
manually.  Even if you did, the unavoidable human failure rate would 
cripple your resulting system.  In HamWAN, we have the same problems as 
industry (albeit at a microscopic scale), but additionally requiring 
staff to operate things is an adoption hurdle.  We don't have the 
incentive of wages to staff these required job functions.  Combine that 
with a general lack of computer/network knowledge in the ham community 
and you're doomed, even if you did manage to gather enough well-meaning 
people to support you.

This problem isn't unique to the Puget Sound Data Ring.  Everyone else 
trying to implement a HamWAN will face the same challenges, as Ed 
correctly points out.  We need to make the leap from phase 1 to phase 2 
(see Ed's email), because we've been successful enough (yay!) to grow to 
such a scale that we're starting to fail at phase 1.

HamWAN has so far delivered interfacing standards, and a bunch of docs 
that educate people on suggestions (not standards) for how to configure 
the non-standardized parts of your network.  That's a good starting 
point, but now that we know our standard ideas work reasonably well, 
it's time to take on the additional task of making them 
self-implementing in new HamWAN instances.  This means a lot of software 
development.

And therein lies the problem.  In this project we have maybe 2 people 
who can help write the software required.  For us to successfully make 
the leap from phase 1 to phase 2, we've got to become attractive to 
people who write software.  A team of 6-10 folks would give us a good 
chance at making the leap.

I'm not sure how to do recruiting for this, but don't let that be the 
seminal question of this email.  I'd like to hear from people if they 
agree with the direction shift I've proposed here.

--Bart

On 3/11/2016 10:25 AM, Sam Kuonen wrote:
>
> I'll echo the time constraints. We're looking at core infrastructure 
> deployment for Georgia, USA and have a lot of generalized interest in 
> the project.
>
> We're experiencing similar volunteer constraints and have yet to begin 
> full operations. I can only picture how physical network operations 
> are going to proceed and suffer once those deployments start.
>
> Regards,
>
> Sam Kuonen, KK4UVL
>
>
> On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel at nigelvh.com 
> <mailto:nigel at nigelvh.com>> wrote:
>
>     Bart, Rob,
>
>     The biggest problem I see here is time resources. I brought this
>     up to Bart off list, but there’s a continuing struggle to either
>     have time to do the work yourself, or get other people to do the work.
>
>     I deployed all of our monitoring and logging infrastructure, and I
>     can say as a fact it’s been a struggle to get anyone to even do
>     the basic work of adding new devices to the existing monitoring
>     system, even after providing tutorials. This has gotten a bit
>     better in very recent history, but it remains an issue.
>
>     Automation is absolutely something we need to put more work into.
>     Ryan and I have already put a bunch of work into this, which
>     again, we have struggled to get folks to pick up, use, and
>     contribute to.
>
>     Modems breaking happens, and site access can be a significant
>     problem. The East Tiger-SnoDEM link that Bart called out has been
>     known down, but we can’t feasibly get that replaced in the middle
>     of winter. Hopefully soon that can be taken care of.
>
>     We can try to treat this like a production network all we want,
>     but the reality is that we have effectively one part time staff
>     trying to do, as Rob put it, both the Operations and Development work.
>
>     The reality is that this is a network with VERY limited admin
>     resources, which get split up to do various important things, the
>     900MHz work included, but that leaves even less available to do
>     any day to day work. This isn’t our full time job, we’re not paid,
>     we all have lives and families, we have VERY few people that
>     actually volunteer to do any of the work, so the reality is
>     there’s a lot we have a hard time getting to. Reality puts us much
>     closer to “best effort” than “production”, and until we get more
>     time/resources to do the work, it’s going to continue to be a
>     struggle.
>
>     If folks want to volunteer, I’d be happy to put them on
>     improvements in monitoring, automation, and fixing things in the
>     existing production network.
>
>     Nigel
>
>>     On Mar 11, 2016, at 09:11, Rob Salsgiver <rob at nr3o.com
>>     <mailto:rob at nr3o.com>> wrote:
>>
>>     Bart,
>>     You touch on a few things that have been “niggling” at the back
>>     of my mind for quite a while now – most of them come down in one
>>     way or another to overall reliability (of HamWAN) for EMCOMM,
>>     which most know has been my main driver for supporting the effort.
>>     There’s been a TON of great work done and quite frankly, I’ve
>>     been amazed that HamWAN has gone as far and fast as it has,
>>     particularly for a “ham” effort.
>>     At the same time we’ve slowly been adding and attracting the
>>     attention of various EMCOMM organizations with the promise and
>>     potential of redundant, reliable, resilient communications when
>>     “the big one” hits.  Obviously not everything HamWAN is expected
>>     to survive a major quake or other event, but even pockets of
>>     reliable, high-speed communication are more than what can be
>>     accomplished via voice relays.
>>     All of which bring back to the current outage and discussion. 
>>     There have been several outages in key places since we began. 
>>     Last year SnoDEM was all but stranded due to a Haystack modem
>>     failure and other events at the same time.  Now we have a similar
>>     situation in a different place brought on by multiple failures or
>>     weaknesses.  In other instances I’ve been told we’ve had outages
>>     via misconfigured devices or other reasons.  Even in a perfect
>>     world, human error happens.
>>     I believe HamWAN would benefit from somewhat of a shift in
>>     operating philosophy that would create two separate departments
>>     or divisions – operations and development.
>>     Operations responsibilities
>>     1)Provide day to day monitoring of network resources and conditions
>>     2)Manage (admin) of those portions of the network that are
>>     designated as “in production”.  This should be the majority of
>>     the network.
>>     3)Provide communications and coordination of network maintenance
>>     4)Maintain an active inventory of all operational (production)
>>     sites, site hardware, and site access information.
>>     5)Maintain and manage all production site device configurations
>>     and config change management.
>>     6)Coordinate implementation of new functionality introduced by
>>     the Development department with appropriate monitoring, end-user
>>     communication, etc
>>     7)Recommend topics and technologies to be explored by the
>>     Development team to enhance operational stability and delivery of
>>     new features to the network.
>>     8)Document technologies, methods, and tools selected for use (and
>>     why) from an operational standpoint.
>>     9)Maintain an active inventory of spare hardware to support all
>>     sites.
>>     10)Establish a plan to correct ALL key site failures within XXXX
>>     days.
>>     11)Coordinate with Development to actively inject and test
>>     network failures and redundancy capabilities.
>>     12)Coordinate with Development to enhance HamWAN’s ability to
>>     operate in “pockets” when portions of the network fail in an
>>     earthquake – i.e. – each “island” stays operational with as many
>>     services as possible
>>     Development responsibilities
>>     1)Continued exploration of new hardware, software, and network
>>     management tools (Quagga vs BIRD, Metals vs QRTs, etc)
>>     2)Conduct experimentation with new hardware and software on
>>     separate network resources where possible, or in coordination
>>     with Operations on the larger network (more on this below).
>>     3)Document technologies, methods, and tools explored and indicate
>>     pros/cons of each where possible.
>>     4)Continued exploration, analysis, and documentation of available
>>     antenna and shielding designs
>>     5)Exploration of new antenna designs and/or other hardware?
>>     6)Exploration of new frequencies and how they are affected by
>>     terrain, vegetation, weather, etc
>>     7)This particular list can go on FOREVER
>>     The distinction here is largely mental, but it’s important.  It
>>     is entirely probable to have the same people in both groups, yet
>>     having the separation is important if HamWAN wishes to be taken
>>     seriously as a services provider to the EMCOMM community.  Any
>>     benefits from that would also improve service for ALL HamWAN users.
>>     Having EMCOMM onboard is important.  Not only does it provide a
>>     needed service to them, but if critical mass can be achieved it
>>     gives HamWAN access to multiple sites in every city and county. 
>>     In turn though, HamWAN as a network needs to be reliable in the
>>     “customer’s” eyes. This means that infrastructure is managed with
>>     uptime as the highest priority, experimentation is managed to
>>     minimize adverse production impacts, and equipment failures are
>>     identified and corrected quickly.
>>     This is admittedly a fair amount of work.  Much of it I suspect
>>     is already underway – maybe not just quite in this format. 
>>     Additional help will definitely be useful.  Everyone involved
>>     only has so much time available, and they should be able to focus
>>     on those items that are important to them.  I believe the above
>>     framework (or something similar) begins to put some useful
>>     structure in place that continues to shape HamWAN from being the
>>     “wild west” of amateur and network “geek” exploration into the
>>     reliable, commercial grade, disaster resistant, amateur platform
>>     it envisions to be - while still allowing amateurs to push the
>>     limits of technology like they are meant to.
>>     If the above (or something similar) is of interest to the current
>>     directors and group as a whole, we can easily create a similar
>>     worklist that individuals on the sidelines can start picking
>>     things they can help with to help bring about.
>>     Just ideas.  Not saying they’re perfect, but it’s a start.  Any
>>     other thoughts?
>>     Cheers,
>>     Rob Salsgiver – NR3O
>>     *From:*PSDR [mailto:psdr-bounces at hamwan.org]*On Behalf Of*Bart Kus
>>     *Sent:*Friday, March 11, 2016 12:56 AM
>>     *To:*psdr at hamwan.org <mailto:psdr at hamwan.org>
>>     *Subject:*Re: [HamWAN PSDR] Service Impact Notice
>>
>>     Hmm that's not the whole story though.  If it were just the 1
>>     router failure (in reality a hypervisor failure), we'd be in a
>>     much better position, but it's combined with 2 other modem
>>     failures.  We had the ETiger->SnoDEM modem die over the winter,
>>     and it needs replacement.  That link has been down for a month or
>>     more now.  And most recently we're having the Tukwila->Baldi
>>     modem lose connectivity frequently.  We've implemented an
>>     automatic mitigation for that, but it still produces sporadic
>>     short downtime windows of a few minutes. I'd just like to move
>>     that modem to a NetMetal 5. Our servers are also being affected
>>     by instability in the Quagga routing software.  We need to
>>     replace this with a more stable alternative, like BIRD.  Lastly,
>>     the Baldi emergency uplink is only configured to go to Westin and
>>     Corvallis, but not Tukwila.
>>
>>     We could have avoided DNS outages too, if the anycast groups were
>>     populated with more of the available servers.  I believe lack of
>>     good automation for server build-outs is causing the deployment
>>     lag here.
>>
>>     The network is designed to withstand failures, even multiple
>>     failures, but we've got many broken things right now that need
>>     fixing.  After that fixing, I would really love to see some folks
>>     get behind improving our monitoring, deployment and diagnostic
>>     automation.  Networks like this won't scale unless they're nearly
>>     completely automated and simple to manage.  I would not mind at
>>     all if we even rolled back some features until we can get them
>>     re-implemented in 100% automated ways.
>>
>>     As important as all this is, I still think the deep penetration
>>     project takes precedence, so I can't drop that work in favor of
>>     this.  Aside from helping out on the simple break-fix stuff, I mean.
>>
>>     --Bart
>>
>>     On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
>>>     Thanks for the update, Nigel.
>>>     On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen
>>>     <nigel at nigelvh.com <mailto:nigel at nigelvh.com>> wrote:
>>>>     Hello All,
>>>>
>>>>     Just wanted to send out a quick notice here. We’ve had a
>>>>     failure at our Seattle edge router, which we’re still
>>>>     investigating. In the meantime, our Tukwila edge router is
>>>>     still providing connectivity, but you may notice higher
>>>>     latencies or issues reaching things. If you find things you
>>>>     can’t reach, please let me know, as we’d like to make sure the
>>>>     redundancy is working, while we’re working to resolve the
>>>>     issues we’re investigating with the Seattle edge router.
>>>>
>>>>     Nigel
>>>>     _______________________________________________
>>>>     PSDR mailing list
>>>>     PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>>>>     http://mail.hamwan.net/mailman/listinfo/psdr
>>>
>>>
>>>     --
>>>
>>>     Ryan Turner
>>>
>>>
>>>
>>>
>>>     _______________________________________________
>>>     PSDR mailing list
>>>     PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>>>     http://mail.hamwan.net/mailman/listinfo/psdr
>>     _______________________________________________
>>     PSDR mailing list
>>     PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>>     http://mail.hamwan.net/mailman/listinfo/psdr
>
>     _______________________________________________
>     PSDR mailing list
>     PSDR at hamwan.org <mailto:PSDR at hamwan.org>
>     http://mail.hamwan.net/mailman/listinfo/psdr
>
>
>
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hamwan.net/pipermail/psdr/attachments/20160311/324e12f4/attachment-0001.html>