Random Routing: 20121128 Ping: I laughed, I cried! It is better than SNMP! I’m going to use it again and again!

If I could hypnotize you to get this idea through to you, I would. As it is, I have to resort to a wall of text.

Ok, let's use our imagination for a moment.

Let's say you've been brought in to a new network. There are no monitoring tools working. All you have is people complaining about application and Internet performance, and everyone is pointing at the network as the cause.

Nothing is set up on the network to allow monitoring. SNMP isn't configured properly. Netflow isn't enabled. You have limited server capability, even if you did get one of the monitoring protocols working on all the network equipment, you wouldn't have the server horsepower to properly use it.

What could you do that would give you data to hold back the tide of angry end users? What tool should be your go to in tough situations like this? What tool gives you the best information you could ask for in any network?

I'm here to talk about the much maligned, much ignored, but dependable workhorse, Ping. Yes, that utility you go to first every time someone says "The network is down". No, you've never said thank you to Ping. It's just there, ready to do it's job at a moments notice. It's the best friend that a network admin could ever have, and never asks for recognition from you, never takes a day off.

Why do we depend on Ping when we're in trouble, but never hear of it any other time? What makes people think that ping is reliable and has a job to do when everything is coming apart, but any other time it's not good enough to use for any "serious" work?

Ping gets no respect. That's just plain wrong.

Ping is the best tool to provide a network admin with the most important bits of information you'll ever want to know about a network.

Ping tells you two things. #1 is the device responding. #2 how long did the device take to respond.

Now, every network admin/tech/designer/engineer knows this. Ping is everywhere for a reason. But here's the reminder that everyone needs to remember. THESE ARE THE TWO MOST IMPORTANT THINGS YOU NEED TO KNOW ABOUT THE NETWORK.

Sure, it's nice to know the types of information SNMP and Netflow can give you. But both of those require network equipment to be configured properly, servers to be available with enough capability to handle the processing loads, and software to be purchased, loaded, configured and maintained. To do a decent installation of your favorite HP SNMP monitoring suite for a mid-sized company could take months to do, thousands of man-hours to complete, and doesn't come for pocket change. Proper Netflow collection and reporting isn't any better.

And neither of them will tell you the two most basic things about the network with the level of accuracy of Ping. If you want to know, is it up and how quickly did it respond over the network? Ping is the champ, bar none.

Now, I'm not trying to detract from SNMP and Netflow. They, and systems like them, are essential tools to keep a developed network running properly. But, most new Network Admins treat them as the first thing you should do to get a handle on a network that has gotten out of control. I wont call them the last thing (after all, how hard is it to set up MRTG, really?), but, setting up a organization-wide Ping statistical data collection tool is simple, fast, and can be done from a spare laptop.

"Organization-wide Ping statistical data collection tool"? Ok, I'm not sure if that is the best phrase for it, but it's what I call it, and it kind of describes where I'm going with this.

Let's say you collect a list of devices that are on the network that are considered important. The IP address of every router, the important switches and other important network devices. Throw in a list of the important server or system IP addresses. You'll have to do this anyways, so, just type them all up in a list.

Now, let's say we ICMP Ping each one of them, once a second or so, continuously. Yep, 24x7, 365.

It's ok, really. Ping wont break your switches, or any other device for that matter.

See, here's the thing about Ping, the individual packets are as small as they get. If you configure Ping properly (take a look at the -l 0 option in Windows ping, or -s in Linux), the total packet size is 28 bytes. So, for every device you ping, you use 224 bits per second. Even a 300 baud modem from 30 years ago can handle that just fine, and if you upgrade to a 1200 baud modem, you have bits to spare. At today's network speeds, 224 bits each second for each device you want to ping doesn't come up to the level of background noise. It's background noise to the background noise.

On top of the fact that Ping is extremely efficient with network bandwidth, responding to a ping requires next to zero CPU processing capability. The most basic of network devices that are in use today have the ability to respond to ping with sub-millisecond variability.

Really, it's ok. Pinging devices on your network once a second wont break anything. This is the part that gets me when I start discussing this subject with other IT people. For some reason, there seems to be an aversion to pinging something repeatedly on a continuous basis. But, no one thinks snmp walking devices once every 5 minutes is a big deal. SNMP and Netflow require orders of magnitude (yes, I'm using that term properly, and I think I'm safe saying it) more processing power than Ping ever could, period. And, Ping gives you the two things you need most.

Now, let's collect all of that Ping data and turn it into something useful. Something like graphs. Non-network people love graphs. Let's say we take all of that data and have our laptop turn it into a graph once a day for each device. On top of that, let's create a composite metric for each device, letting us know where we need to start working first (rather than purely being slaves to end user complaints).

Here's a graph as an example.

The graph is based on ping data of a random site router. No specific reason for pulling this graph in particular, it was the first graph I clicked on. What's the graph mean? Easy. Yellow is bad. Red is really bad. If there's lots of red, it's really really bad. I'll explain more later.

This graph was generated on a 6 to 8 year old box (2 ghz P4? No one seems to remember how old the box actually is) that someone was throwing away. That throw away box is pinging 800 devices every other second. Collecting all of the data, and then generating a graph once a day for each device based on that data. 20+ hours of every day, CPU utilization on the box is less than 5% (remember ping takes next to nothing in CPU). The only time that box "works" is when it's reporting on the data, and generating graphs, which it statically stores in a html directory to be looked at via a web server on the same box.

It took next to nothing to set up. It takes next to nothing to run. And, all of a sudden, poof, you have very useful data, that you wont see from any other tool. I dare say, I don't know of any other tool that will do anything to this level.

Here's a monthly graph for a different site router.

How hard is it to look at that graph and see there is a problem? The graph covers a whole month, but it's very easy to see the site circuit outage in the middle of the month.

I wrote the tool (hack) back in 2004, and have kept modifying it ever since. It's much too useful to not use, it's cheap (free?), and takes up next to nothing in system resources to run.

It would be the first tool I deploy in any network I become involved with, and it makes an excellent supplement to every other monitoring tool out there.

And, it's just Ping. Hundreds of Pings. Thousands of Pings. Once ever second, or every other second. And then graph against a model. All using open source software and a couple scripts. Takes next to nothing to set up.

At one time, I had this system sending an email whenever a device failed to respond for 10 pings in a row. Yes, Red/Green, with a hair trigger. None of this, miss two SNMP polls and we send an alert. If it was down for 10 seconds, email to a pager (boot to the head).

I named the system Weialgo, Weippert Algorithm. Someone I worked with gave me the idea for it as a snide comment to how poor of a job another SNMP monitoring system did at alerting when a site/system was down. In less than 5 minutes, he laid out how Red/Green should be done, including the underlying logic. I'm sure he doesn't/wont remember the conversation, but I remember it as one of those moments where my eyes must have been wide open and I was thinking "That's Brilliant!".

Where I currently work, we call the graphs the "Grass on Fire" report, for obvious reasons.

Over the next couple weeks or so, I'll document the different versions, the setup, and the different modifications I've made with this system over the years.

Random Routing

Tuesday, November 27, 2012

20121128 Ping: I laughed, I cried! It is better than SNMP! I’m going to use it again and again!

No comments:

Post a Comment