Monday, December 17, 2012

20121217 Why Weialgo? Essay #1


I have two articles that I've already typed up but I haven't put them on the blog yet, mainly because they're so rambling that even I have a hard time understanding them.  One is on the graphing and reporting of weialgo data, the other on "Why Weialgo", which is more of me complaining about why this concept is so hard for some people to grasp /facepalm type of thing.

But, something happened over the weekend (thank you Lord, as the outage has been perfect to study and caused me to write the email in the first place) which caused me to write an email which did a better job of being an intro to "Why Weialgo" than my war and peace blog entry which goes over the same subject.

My Internet connection, for all intents and purposes, went down.  It's a wireless connection to a line of sight tower several miles away.  The "old" wireless connection was notoriously unreliable, but, it's failure mode was acceptable in that I was able to put a TCP tunnel between the house and the ISP using a Linux VM in their DC, and compensating for the loss that way.  When they installed the "new" wireless connection a year or so ago, network paradise was created.  It was so good that I had a hard time coming up with reasons why a copper/fiber connection would be better.

That was until last Friday.

Now, when you read this, yes, there is a weialgo graph.  But there's a more important undercurrent philosophy/process/knowledge going through this.  It's about TCP, and how it operates over a network.  And since TCP is how 95%+ of everything in data communications is done nowdays, a deep knowledge of TCP is key to making all, ALL, *-!ALL!-*, IT systems work.  If you don't understand how your systems communicate, you'll never understand why they don't work (or what they'll need to be made to work).

Admittedly, some of my explanations of TCP behavior are sometimes a bit simplistic, but the essence is true. Wireless is hard on TCP, that's just the fact of it.  Weialgo can provide a view of this relationship between wireless and TCP.


The email, lightly edited:


subject: failure mode of the *isp*.net internet link


Rob Luce <blah@blah.com>

6:09 AM (13 minutes ago)

to Todd, Ben 

I know you think I'm nuts, but seeing this graph makes me profoundly...   sad.




This is what the wireless link has been doing about 15 minutes out of every hour or so since last Friday.  This is worse than the old link.  It's also exceptionally educational at the same time.


The old wireless line would go down often, but would normally stay down for a matter of seconds (for the sake of discussion, say 5 to 10 seconds on average), then come back up for a matter of seconds (5 to 10 seconds on average) and "flap" like that while it was having issues.  When it came back up, it would stay up long enough though that the TCP tunnel between us and the ISP could clear it's entire queue of packets (which I set to 512 kByte), then requeue while the link was down.  BUT!!!  The old line would only hit TCP once before coming back up for a statistically significant amount of time (For an android, 5 seconds is nearly an eternity).  This would cause only one halving of the TCP window, as there was only "one" outage to compensate for from TCP's point of view, after which, TCP would "quickly" ramp back up to the full window so that the queue could clear out.


This new wireless failure mode is the death of TCP congestion control, and it details everything that's wrong with the IETF's insistence that dropped packets are to be treated as congestion, instead of what they are more likely to be nowdays - indications of a wireless network inline with the flow of communications.


Since both sides of the TCP tunnel, in this case, are Linux boxes, there are some parameters that are available that otherwise wouldn't be, I've tried to adjust as much as possible to compensate for the frequent drops in the network (BIC mainly, most of the other options aren't of much use in this case).  But, there's no getting around the "birdshot" pattern of packet loss that this link is having.


Every time TCP sees a dropped packet, it goes into "congestion recovery", but here, there is no congestion.  Normally, this means halving the cwnd, and eventually falling back to slow start.  Every one of those red strikes means the cwnd getting halved, and eventually falling back to slow start (mss * 3) (http://tools.ietf.org/html/rfc3782 http://tools.ietf.org/html/rfc2581).


"Down" and then "Up" (or "flap"), I can try to compensate for using a TCP tunnel with custom tweaking.   With "Shot full of holes like birdshot" (or "flutter"), there isn't really anything that can be done using TCP.  I'd have to write my own tunneling software using UDP as the carrier, with a set transmit window, and add something like a full SACK map to cover all in-flight packets.  Of course, there would be some benefits to this, but that level of coding isn't part of my normal skill set.





Every one of those red strikes on the weialgo graph that I've inspected has the exact same pattern.  It's the "TCP Killer Birdshot" or "flutter" pattern.   Which means, not much can be done about it using off the shelf technology.


Hopefully, this is something that *isp*.net finds this week of their own accord.  But, if you would, please call and let them know that the link has been performing strangely since last Friday.  


The ping plotter is using 28 byte packets at 100 ms.  The weialgo graphs are of the first hop router http://192.168.0.20/1.1.1.1.html after I determined that the issue wasn't being caused by the vm.  The vm graphs go back farther in time.  http://192.168.0.20/1.1.1.146.html


Moo

Rob



Sitting here typing/cutting-and-pasting this post, I'm having to watch the pingplotter between me and the ISP so that I know when to click "Save".  (By the way, pingplotter should be mandatory for every network person, and I wish they had an app for Linux.)



Wow, that is ugly.  Strangely enough, I tend to try to find ways to make things work.  If all you have is binder twine, grey tape, and determination, could you make a phone?  I have a tendency to say "you should really just go buy a couple phones and some phone wire", then go right to work making two cups on a string.

But, looking these graphs, I can't imagine how I'd make something work over that.  Change the tunnel to UDP, make a SACK map out of everything to make the tunnel reliable, and retransmit the individual TCP-in-UDP packets like a machine gun until the ISP gives up and fixes the link.  There would be alot of benefit to this in wireless environments, but, I'm a hack scripter, not a real coder. 

What do you think?

No comments:

Post a Comment