Friday, November 30, 2012

20121201 I never claimed my hacks were pretty.... Weialgo version 2 and 3.

I looked through my "stuff" and haven't found a version of weialgo older than 2008 here at home.  I probably have older versions squirreled away at work, but this seems to be the oldest version I have at home.


#!/usr/bin/perl

#Weialgo version 2

use Net::Ping;
use Time::HiRes qw (usleep gettimeofday);
use strict;
#use warnings;

my $host = $ARGV[0];
my $hostname = $ARGV[1];
if ( $host == "" ) { print "\nno IP to ping $ARGV[0] $ARGV[1] $ARGV[2]\n\n"; exit;}
open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
select(LOG); $| = 1;
close(LOG);
select(STDOUT); $| = 1;
my ($seconds, $microseconds) = gettimeofday();
my $prevseconds = $seconds;
my $starttime = $seconds;
srand($microseconds);
my $offsetms = int(rand(1000000));
usleep(1000000-$microseconds+$offsetms);
my $down = 0;
my $totaldown = 0;
my $transitions = 0;
my $totaltime = 1;
my $i = 0;
my $j = 0;
my $ret = 0;
my $duration = 0;
my $ip = 0;
my $runtime = 0;
my $sentpackets = 0;
my $meetsla = 0;
my $minsla = 100000;
my $sec = 0;
my $min = 0;
my $hour = 0;
my $mday = 0;
my $mon = 0;
my $year = 0;
my $wday = 0;
my $yday = 0;
my $isdst = 0;

my $p = Net::Ping->new("icmp");
$p->hires();

while ( $i==0 ) {

  ($seconds, $microseconds) = gettimeofday();
  ($ret, $duration, $ip) = $p->ping($host, 0.6);
  $runtime = $seconds - $starttime;
  $sentpackets++;
  if ( $ret == 0 ) {
    if ( $down == 0 ) {
      open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
      printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,1,%.2f\n", 1000 * 10);
      close(LOG);
    }
    $duration = 10000;
  }
  if ( $ret == 1 ) {
    if ( $down > 1 ) {
      $j = $seconds - $down + 1;
      $totaldown = $totaldown + $j;
      $seconds--;
      open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
      printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,$j,%.2f\n", 1000 * 10);
      close(LOG);
      $seconds++;
    }
    open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
    printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,0,%.2f\n", 1000 * $duration);
    close(LOG);
    $duration = $duration * 1000;
  }
  if ( $duration < $minsla ) { $minsla = $duration; }
  if ( $duration < 10000 ) { if ( $duration < ( $minsla + $minsla + 50 ) ) { $meetsla++; } }
  ($seconds, $microseconds) = gettimeofday();
  if ( $microseconds < $offsetms ) {
    $j = $microseconds + 1000000;
    $microseconds = $j;
  }
  $j = 1000000+$offsetms-$microseconds;
  if ( $ret == 1 ) {
    $j = $j + 5000000;
    $totaltime = $totaltime + 6;
    if ( $down > 0 ) { $transitions++; }
    $down = 0;
  }
  if ( $ret == 0 ) {
    $j = $j + 1000000;
    $totaltime = $totaltime + 2;
    if ( $down == 0 ) { $transitions++; $down = $seconds}
    if ( -e "/mnt/ramdisk/pingslow.txt" ) {
      $j = $j + 4000000;
      $totaltime = $totaltime + 4;
    }
    #if ( $seconds - $prevseconds > 3 ) {
    #  $j = $j + (( $seconds - $prevseconds - 2 ) * 1000000 );
    #}
  }
  $prevseconds = $seconds;
#  print "$j,$down\n";
  usleep($j);
  if ( -e "/mnt/ramdisk/pingflag.txt" ) {
    $i = 1;
    $j = 0;
    if ( $ret == 0 ) {
      $j = $seconds - $down + 1;
      $totaldown = $totaldown + $j;
      open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
      printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,$j,%.2f\n", 10000);
      close(LOG);
    }

    ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);
    open(LOG, '>>/storage/weialgo/rollups/v2sla.csv');
    printf LOG ("%4d-%02d-%02d %02d:%02d:%02d,$seconds,$host,$hostname,$minsla,$sentpackets,$meetsla,%.2f\n",$year+1900,$mon+1,$mday,$hour,$min,$sec, $meetsla / $sentpackets * 100);
    close(LOG);
    sleep(5);
  }

}

$p->close();


This seems to be the last version of version 2 weialgo that I put together.  Yes, it was written in Perl, mainly because that I didn't see much improvement in doing in a native binary in C.  Perl was much easier to update, and seemed to have similar performance.

I'm not going to bother including or describing the reporting modules.  Most of that was done in shell scripts or perl. You can use perl/sed/awk/grep/sort to go through the information that this script provides and get to some very useful information if so inclined.

Net::Ping and Time::HiRes are both CPAN Perl modules that do most of the magic.  Reading through the perl script, you can get an idea how those modules work, or you can go out to the Perl website and read up on the modules directly.

The thought process for this version was a carry-over of the original version, which was more for Red/Green alerting, with the side benefit of being able to report on data with a number of different statistical models.

At the time I was running upwards of 200 or so pings to individual systems/routers, which was about the limit of what this process was capable of.  It was a simple enough "/usr/bin/perl ./weialgo2.pl 10.1.1.1 myserver.com &" to get it started, and let it run continuously.

As I remember, I would let the program ping the device once every 6 seconds, and record the round trip time and the other data I though relevant at the time.  If the ping dropped, or didn't return in time, I changed the ping time to once a second, until the device responded again.

This last part was the "Weippert Algorithm".  It basically goes like this.  Decide how fast you want to know that a device is down (say, 60 seconds).  Divide that number in half, subtract 1, round down (29).  Using a fast, low bandwidth, low cpu status protocol (like ICMP ping, or TCP SYN/ACT port open check), check the device every int(X/2-1) seconds (29).  The moment the device misses a poll, check the device every second until it comes back up.  If the device misses/drops int(X/2) polls, declare the device down and send an alert.

(If you know Scott Weippert, you can tell him how brilliant he is.  :-) )

The Algorithm is simple, and very effective for Red/Green alerting.  Unfortunately, I couldn't convince anyone I worked with that it was better than SNMP for Red/Green (it's MUCH better, SNMP is a crappy system for basic availability alerting, from a network point of view, but I haven't won that fight yet).

As I worked through the different versions of Weialgo, I found that I was using Weialgo more for statistical reporting, and not for Red/Green.  From a statistical viewpoint, the int(X/2-1) aspect of weialgo complicates statistical reports quite a bit, as it means that all data has to be time indexed as part of the reporting process.  It's much easier to just ping every second or two, and report on the data using that assumption.

The main drawback to this version (and any other process-per-device based polling) is that it drives up the number of concurrent processes on the polling server.  On the P4 I was running this on, at around 200 instances of weialgo, the amount of incurred latency based on just process switching within Linux began to throw off the results.  CPU utilization would normally bounce off 100% continuously, and the box was useless for anything else.  So, I normally tried to keep the number of polled devices much less than 200, usually around 100, so that I could do some simple reporting on the same box against the data.

At one point in time, earlier versions of Weialgo would email out to a pager every time a device went down.  That lasted for a couple weeks (a router would lose it's E1 for 10 seconds at 2am in the morning, PAGE THE NETWORK TEAM!!!!  I wasn't very popular for a couple weeks.).  Adding that functionality back into the perl script would be easy enough to do.

Obviously, I'm over-reporting information in the log files, but the total amount of data is minor in my opinion even with the extra data.  gzip/bzip2 the log files after reporting on them, and you can keep decades of data in a few gigabytes.

Also, since I put all of the data into the same log file, it's possible to have concurrency problems, depending on the version of *nix you want to run this on.  I never had a problem on Linux, but Solaris tended to throw a mangled line in the log file every once in a while.  If you have problems with concurrent processes writing to the same logfile creating mangled entries, split the logfiles up by renaming the logfile with the device name.

$offsetms was put in because of the number of concurrent processes I was running.  I didn't want all of the pings to go out the same exact microsecond.  So, using $offsetms, I randomized the start time to different microseconds for each process.  This spread out the pings, and the processing.

To try this script out, it should be fairly simple.  You'd need Linux (or your favorite version of *nix), Perl, and CPAN install Net::Ping and Time::HiRes.  Copy and paste the script into a .pl on the box, and run the script with the ip address and hostname you want to ping.  Sit back and watch it ping the device until then end of time, or until the server is rebooted.

Oh, as you can probably see from the script, I'm a fan of RAM disks for transient data like this.  Since the log file isn't held open by the processes, you can mv the log file at any time and the processes will keep running.  So, rather than doing hundreds of individual writes to a single file every second on a hard drive, I do the "spam" individual writes to a file on a RAM disk, the mv/report the logfile data via a cron process every hour or so.  Saves wear and tear on the hard drive, and speeds up everything overall.  And, frankly, if I lose an hour or so of pings, no big deal.  RAM disks are another basic IT staple (like ping) that has been lost to an unearned bad reputation.

My next post will be on the "current" version of Weialgo, version 5.  Version 4 and 5 were both based on two issues with the previous versions.  #1  Process per device polling limited the scalability of weialgo to a few hundred devices per polling server.  #2  Red/Green alerting wasn't needed, I was only using the data to run reports and create graphs.

Version 4 was an attempt to create a single process that polled multiple devices (Version 4 never worked properly).  Version 5 was a complete re-write (eg: I threw out all of my previous work) when I discovered that there was a much easier way to ping thousands of devices.  It was a bit of a /facepalm moment.

Here's a copy of one of my Version 3 scripts.  Obviously, I had yanked out all of the int(X/2-1) logic, and I'm just logging straight pings to simplify statistical reporting.


#!/usr/bin/perl

# Weialgo version 3

use Net::Ping;
use Time::HiRes qw (usleep gettimeofday);
use Time::Local;
use strict;
#use warnings;

my $host = $ARGV[0];
my $hostname = $ARGV[1];
if ( $host == "" ) { print "\nno IP to ping $ARGV[0] $ARGV[1] $ARGV[2]\n\n"; exit;}
my $sec = 0;
my $min = 0;
my $hour = 0;
my $mday = 0;
my $mon = 0;
my $year = 0;
my $wday = 0;
my $yday = 0;
my $isdst = 0;
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime(time);
$year += 1900;
$mon += 1;
$mon = sprintf("%02d", $mon);
$mday = sprintf("%02d", $mday);
open(LOG, ">>/tmp/v3logfile_$hostname\_$year$mon$mday.csv");
select(LOG); $| = 1;
close(LOG);
select(STDOUT); $| = 1;
my ($seconds, $microseconds) = gettimeofday();
my $starttime = $seconds;
srand($microseconds);
#my $offsetms = int(rand(1000000));
#my $offsetms = 100;
my $offsetms = 1;
usleep(1000000-$microseconds+$offsetms);
my $totaldown = 0;
#my $totaltime = 1;
my $i = 0;
my $j = 0;
my $ret = 0;
my $duration = 0;
my $previousduration = 2;
my $ip = 0;
my $runtime = 0;
my $sentpackets = 0;

my $p = Net::Ping->new("icmp");
$p->hires();

while ( $i==0 ) {

  ($seconds, $microseconds) = gettimeofday();
  $j = 1000000+$offsetms-$microseconds;
  usleep($j);
  ($seconds, $microseconds) = gettimeofday();
  ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime(time);
  $year += 1900;
  $mon += 1;
  $mon = sprintf("%02d", $mon);
  $mday = sprintf("%02d", $mday);
  ($ret, $duration, $ip) = $p->ping($host, 0.5);
  $runtime = $seconds - $starttime;
  $sentpackets++;
  if ( $ret == 0 ) {
    $totaldown++;
    $previousduration = $previousduration + 0.2;
    if ( $previousduration > 2 ) {
      $previousduration = 2;
    }
    open(LOG, ">>/tmp/v3logfile_$hostname\_$year$mon$mday.csv");
    printf LOG ("$seconds,$host,$hostname,$runtime,$microseconds,$sentpackets,$totaldown,1,%.2f\n", 1000 * $previousduration);
    close(LOG);
  }
  if ( $ret == 1 && $duration > 0 ) {
    open(LOG, ">>/tmp/v3logfile_$hostname\_$year$mon$mday.csv");
    printf LOG ("$seconds,$host,$hostname,$runtime,$microseconds,$sentpackets,$totaldown,0,%.2f\n", 1000 * $duration);
    close(LOG);
    $previousduration = $duration;
  }
  if ( -e "/tmp/pingflag.txt" ) {
    $i = 1;
  }
  if ( $microseconds > 50000 ) {
    sleep (1);
  }
  if ( $microseconds > 100000 ) {
    sleep (1);
  }
  if ( $microseconds > 150000 ) {
    sleep (1);
  }
  if ( $microseconds > 200000 ) {
    sleep (1);
  }
  if ( $microseconds > 250000 ) {
    sleep (1);
  }
  if ( $microseconds > 300000 ) {
    sleep (1);
  }
  if ( $microseconds > 400000 ) {
    sleep (1);
  }
  if ( $microseconds > 500000 ) {
    sleep (1);
  }

}

$p->close();



No comments:

Post a Comment