[HamWAN PSDR] Service Impact Notice

Fri Mar 11 16:27:28 PST 2016

HamWAN isn't some magical network that never has failures.  If that's
the impression emcomm organizations are being sold, then that needs to
stop.

HamWAN is just as susceptible to component failure as any commercial
network out there.  The main difference is that we get to design and
scale the network to have an emphasis on reliability instead of
maximizing subscribers for profit.  We also have the huge advantage of
maintaining our own infrastructure.  That way, when parts fail they
can be fixed by someone from our own community instead of waiting for
a corporation to prioritize our issue in relation to their other
customers.  Can you imagine how long it would take a commercial
provider to fix your issue after the big one?  On the other hand, hams
are always prepared for the big one and can likely be deployed much
more quickly.

As the PSDR gets larger, it gets more resilient.  That's why it took
several things falling over simultaneously to cause an impact.  While
I agree with some of the ideas in this thread, I think we already meet
the bar for a lot of them.  We've always treated the network as
"production" with the potential for customer impact, which is why this
thread was started in the first place.  I don't believe we've ever had
an impacting event because someone was "experimenting" with something.
In the end, this really is an experimental network and needs to remain
so in order to recruit and train more hams into it.  The emcomm
organizations shouldn't be excited about HamWAN because it's more
reliable than their commercial networks.  They should instead be
excited by the fact that it's another community that can support them
in a disaster.

-Cory
NQ1E

On Fri, Mar 11, 2016 at 1:12 PM, Bart Kus <me at bartk.us> wrote:
> Replying the to the latest fully-quoted message instead of Ed's, but Ed your
> observations are spot on.
>
> Rob, I think the concept of network ops is finished both in the industry and
> for HamWAN.  In the industry, we're working at such enormous scales that you
> cannot possibly staff enough people to do any of the ops tasks manually.
> Even if you did, the unavoidable human failure rate would cripple your
> resulting system.  In HamWAN, we have the same problems as industry (albeit
> at a microscopic scale), but additionally requiring staff to operate things
> is an adoption hurdle.  We don't have the incentive of wages to staff these
> required job functions.  Combine that with a general lack of
> computer/network knowledge in the ham community and you're doomed, even if
> you did manage to gather enough well-meaning people to support you.
>
> This problem isn't unique to the Puget Sound Data Ring.  Everyone else
> trying to implement a HamWAN will face the same challenges, as Ed correctly
> points out.  We need to make the leap from phase 1 to phase 2 (see Ed's
> email), because we've been successful enough (yay!) to grow to such a scale
> that we're starting to fail at phase 1.
>
> HamWAN has so far delivered interfacing standards, and a bunch of docs that
> educate people on suggestions (not standards) for how to configure the
> non-standardized parts of your network.  That's a good starting point, but
> now that we know our standard ideas work reasonably well, it's time to take
> on the additional task of making them self-implementing in new HamWAN
> instances.  This means a lot of software development.
>
> And therein lies the problem.  In this project we have maybe 2 people who
> can help write the software required.  For us to successfully make the leap
> from phase 1 to phase 2, we've got to become attractive to people who write
> software.  A team of 6-10 folks would give us a good chance at making the
> leap.
>
> I'm not sure how to do recruiting for this, but don't let that be the
> seminal question of this email.  I'd like to hear from people if they agree
> with the direction shift I've proposed here.
>
> --Bart
>
>
>
> On 3/11/2016 10:25 AM, Sam Kuonen wrote:
>
> I'll echo the time constraints. We're looking at core infrastructure
> deployment for Georgia, USA and have a lot of generalized interest in the
> project.
>
> We're experiencing similar volunteer constraints and have yet to begin full
> operations. I can only picture how physical network operations are going to
> proceed and suffer once those deployments start.
>
> Regards,
>
> Sam Kuonen, KK4UVL
>
>
> On Fri, Mar 11, 2016, 12:29 PM Nigel Vander Houwen <nigel at nigelvh.com>
> wrote:
>>
>> Bart, Rob,
>>
>> The biggest problem I see here is time resources. I brought this up to
>> Bart off list, but there’s a continuing struggle to either have time to do
>> the work yourself, or get other people to do the work.
>>
>> I deployed all of our monitoring and logging infrastructure, and I can say
>> as a fact it’s been a struggle to get anyone to even do the basic work of
>> adding new devices to the existing monitoring system, even after providing
>> tutorials. This has gotten a bit better in very recent history, but it
>> remains an issue.
>>
>> Automation is absolutely something we need to put more work into. Ryan and
>> I have already put a bunch of work into this, which again, we have struggled
>> to get folks to pick up, use, and contribute to.
>>
>> Modems breaking happens, and site access can be a significant problem. The
>> East Tiger-SnoDEM link that Bart called out has been known down, but we
>> can’t feasibly get that replaced in the middle of winter. Hopefully soon
>> that can be taken care of.
>>
>> We can try to treat this like a production network all we want, but the
>> reality is that we have effectively one part time staff trying to do, as Rob
>> put it, both the Operations and Development work.
>>
>> The reality is that this is a network with VERY limited admin resources,
>> which get split up to do various important things, the 900MHz work included,
>> but that leaves even less available to do any day to day work. This isn’t
>> our full time job, we’re not paid, we all have lives and families, we have
>> VERY few people that actually volunteer to do any of the work, so the
>> reality is there’s a lot we have a hard time getting to. Reality puts us
>> much closer to “best effort” than “production”, and until we get more
>> time/resources to do the work, it’s going to continue to be a struggle.
>>
>> If folks want to volunteer, I’d be happy to put them on improvements in
>> monitoring, automation, and fixing things in the existing production
>> network.
>>
>> Nigel
>>
>> On Mar 11, 2016, at 09:11, Rob Salsgiver <rob at nr3o.com> wrote:
>>
>> Bart,
>>
>> You touch on a few things that have been “niggling” at the back of my mind
>> for quite a while now – most of them come down in one way or another to
>> overall reliability (of HamWAN) for EMCOMM, which most know has been my main
>> driver for supporting the effort.
>>
>> There’s been a TON of great work done and quite frankly, I’ve been amazed
>> that HamWAN has gone as far and fast as it has, particularly for a “ham”
>> effort.
>>
>> At the same time we’ve slowly been adding and attracting the attention of
>> various EMCOMM organizations with the promise and potential of redundant,
>> reliable, resilient communications when “the big one” hits.  Obviously not
>> everything HamWAN is expected to survive a major quake or other event, but
>> even pockets of reliable, high-speed communication are more than what can be
>> accomplished via voice relays.
>>
>> All of which bring back to the current outage and discussion.  There have
>> been several outages in key places since we began.  Last year SnoDEM was all
>> but stranded due to a Haystack modem failure and other events at the same
>> time.  Now we have a similar situation in a different place brought on by
>> multiple failures or weaknesses.  In other instances I’ve been told we’ve
>> had outages via misconfigured devices or other reasons.  Even in a perfect
>> world, human error happens.
>>
>> I believe HamWAN would benefit from somewhat of a shift in operating
>> philosophy that would create two separate departments or divisions –
>> operations and development.
>>
>> Operations responsibilities
>> 1)       Provide day to day monitoring of network resources and conditions
>> 2)       Manage (admin) of those portions of the network that are
>> designated as “in production”.  This should be the majority of the network.
>> 3)       Provide communications and coordination of network maintenance
>> 4)       Maintain an active inventory of all operational (production)
>> sites, site hardware, and site access information.
>> 5)       Maintain and manage all production site device configurations and
>> config change management.
>> 6)       Coordinate implementation of new functionality introduced by the
>> Development department with appropriate monitoring, end-user communication,
>> etc
>> 7)       Recommend topics and technologies to be explored by the
>> Development team to enhance operational stability and delivery of new
>> features to the network.
>> 8)       Document technologies, methods, and tools selected for use (and
>> why) from an operational standpoint.
>> 9)       Maintain an active inventory of spare hardware to support all
>> sites.
>> 10)   Establish a plan to correct ALL key site failures within XXXX days.
>> 11)   Coordinate with Development to actively inject and test network
>> failures and redundancy capabilities.
>> 12)   Coordinate with Development to enhance HamWAN’s ability to operate
>> in “pockets” when portions of the network fail in an earthquake – i.e. –
>> each “island” stays operational with as many services as possible
>>
>> Development responsibilities
>> 1)       Continued exploration of new hardware, software, and network
>> management tools (Quagga vs BIRD, Metals vs QRTs, etc)
>> 2)       Conduct experimentation with new hardware and software on
>> separate network resources where possible, or in coordination with
>> Operations on the larger network (more on this below).
>> 3)       Document technologies, methods, and tools explored and indicate
>> pros/cons of each where possible.
>> 4)       Continued exploration, analysis, and documentation of available
>> antenna and shielding designs
>> 5)       Exploration of new antenna designs and/or other hardware?
>> 6)       Exploration of new frequencies and how they are affected by
>> terrain, vegetation, weather, etc
>> 7)       This particular list can go on FOREVER
>>
>> The distinction here is largely mental, but it’s important.  It is
>> entirely probable to have the same people in both groups, yet having the
>> separation is important if HamWAN wishes to be taken seriously as a services
>> provider to the EMCOMM community.  Any benefits from that would also improve
>> service for ALL HamWAN users.
>>
>> Having EMCOMM onboard is important.  Not only does it provide a needed
>> service to them, but if critical mass can be achieved it gives HamWAN access
>> to multiple sites in every city and county.  In turn though, HamWAN as a
>> network needs to be reliable in the “customer’s” eyes.  This means that
>> infrastructure is managed with uptime as the highest priority,
>> experimentation is managed to minimize adverse production impacts, and
>> equipment failures are identified and corrected quickly.
>>
>> This is admittedly a fair amount of work.  Much of it I suspect is already
>> underway – maybe not just quite in this format.  Additional help will
>> definitely be useful.  Everyone involved only has so much time available,
>> and they should be able to focus on those items that are important to them.
>> I believe the above framework (or something similar) begins to put some
>> useful structure in place that continues to shape HamWAN from being the
>> “wild west” of amateur and network “geek” exploration into the reliable,
>> commercial grade, disaster resistant, amateur platform it envisions to be -
>> while still allowing amateurs to push the limits of technology like they are
>> meant to.
>>
>> If the above (or something similar) is of interest to the current
>> directors and group as a whole, we can easily create a similar worklist that
>> individuals on the sidelines can start picking things they can help with to
>> help bring about.
>>
>> Just ideas.  Not saying they’re perfect, but it’s a start.  Any other
>> thoughts?
>>
>> Cheers,
>> Rob Salsgiver – NR3O
>>
>> From: PSDR [mailto:psdr-bounces at hamwan.org] On Behalf Of Bart Kus
>> Sent: Friday, March 11, 2016 12:56 AM
>> To: psdr at hamwan.org
>> Subject: Re: [HamWAN PSDR] Service Impact Notice
>>
>>
>> Hmm that's not the whole story though.  If it were just the 1 router
>> failure (in reality a hypervisor failure), we'd be in a much better
>> position, but it's combined with 2 other modem failures.  We had the
>> ETiger->SnoDEM modem die over the winter, and it needs replacement.  That
>> link has been down for a month or more now.  And most recently we're having
>> the Tukwila->Baldi modem lose connectivity frequently.  We've implemented an
>> automatic mitigation for that, but it still produces sporadic short downtime
>> windows of a few minutes.  I'd just like to move that modem to a NetMetal 5.
>> Our servers are also being affected by instability in the Quagga routing
>> software.  We need to replace this with a more stable alternative, like
>> BIRD.  Lastly, the Baldi emergency uplink is only configured to go to Westin
>> and Corvallis, but not Tukwila.
>>
>> We could have avoided DNS outages too, if the anycast groups were
>> populated with more of the available servers.  I believe lack of good
>> automation for server build-outs is causing the deployment lag here.
>>
>> The network is designed to withstand failures, even multiple failures, but
>> we've got many broken things right now that need fixing.  After that fixing,
>> I would really love to see some folks get behind improving our monitoring,
>> deployment and diagnostic automation.  Networks like this won't scale unless
>> they're nearly completely automated and simple to manage.  I would not mind
>> at all if we even rolled back some features until we can get them
>> re-implemented in 100% automated ways.
>>
>> As important as all this is, I still think the deep penetration project
>> takes precedence, so I can't drop that work in favor of this.  Aside from
>> helping out on the simple break-fix stuff, I mean.
>>
>> --Bart
>>
>> On 3/9/2016 8:23 PM, Ryan Elliott Turner wrote:
>>
>> Thanks for the update, Nigel.
>>
>> On Wed, Mar 9, 2016 at 10:17 PM, Nigel Vander Houwen <nigel at nigelvh.com>
>> wrote:
>>
>> Hello All,
>>
>> Just wanted to send out a quick notice here. We’ve had a failure at our
>> Seattle edge router, which we’re still investigating. In the meantime, our
>> Tukwila edge router is still providing connectivity, but you may notice
>> higher latencies or issues reaching things. If you find things you can’t
>> reach, please let me know, as we’d like to make sure the redundancy is
>> working, while we’re working to resolve the issues we’re investigating with
>> the Seattle edge router.
>>
>> Nigel
>> _______________________________________________
>> PSDR mailing list
>> PSDR at hamwan.org
>> http://mail.hamwan.net/mailman/listinfo/psdr
>>
>>
>>
>>
>> --
>>
>> Ryan Turner
>>
>>
>>
>>
>> _______________________________________________
>>
>> PSDR mailing list
>>
>> PSDR at hamwan.org
>>
>> http://mail.hamwan.net/mailman/listinfo/psdr
>>
>>
>> _______________________________________________
>> PSDR mailing list
>> PSDR at hamwan.org
>> http://mail.hamwan.net/mailman/listinfo/psdr
>>
>>
>> _______________________________________________
>> PSDR mailing list
>> PSDR at hamwan.org
>> http://mail.hamwan.net/mailman/listinfo/psdr
>
>
>
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr
>
>
>
> _______________________________________________
> PSDR mailing list
> PSDR at hamwan.org
> http://mail.hamwan.net/mailman/listinfo/psdr
>