Friday, November 30, 2012

20121201 I never claimed my hacks were pretty.... Weialgo version 2 and 3.

I looked through my "stuff" and haven't found a version of weialgo older than 2008 here at home.  I probably have older versions squirreled away at work, but this seems to be the oldest version I have at home.


#!/usr/bin/perl

#Weialgo version 2

use Net::Ping;
use Time::HiRes qw (usleep gettimeofday);
use strict;
#use warnings;

my $host = $ARGV[0];
my $hostname = $ARGV[1];
if ( $host == "" ) { print "\nno IP to ping $ARGV[0] $ARGV[1] $ARGV[2]\n\n"; exit;}
open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
select(LOG); $| = 1;
close(LOG);
select(STDOUT); $| = 1;
my ($seconds, $microseconds) = gettimeofday();
my $prevseconds = $seconds;
my $starttime = $seconds;
srand($microseconds);
my $offsetms = int(rand(1000000));
usleep(1000000-$microseconds+$offsetms);
my $down = 0;
my $totaldown = 0;
my $transitions = 0;
my $totaltime = 1;
my $i = 0;
my $j = 0;
my $ret = 0;
my $duration = 0;
my $ip = 0;
my $runtime = 0;
my $sentpackets = 0;
my $meetsla = 0;
my $minsla = 100000;
my $sec = 0;
my $min = 0;
my $hour = 0;
my $mday = 0;
my $mon = 0;
my $year = 0;
my $wday = 0;
my $yday = 0;
my $isdst = 0;

my $p = Net::Ping->new("icmp");
$p->hires();

while ( $i==0 ) {

  ($seconds, $microseconds) = gettimeofday();
  ($ret, $duration, $ip) = $p->ping($host, 0.6);
  $runtime = $seconds - $starttime;
  $sentpackets++;
  if ( $ret == 0 ) {
    if ( $down == 0 ) {
      open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
      printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,1,%.2f\n", 1000 * 10);
      close(LOG);
    }
    $duration = 10000;
  }
  if ( $ret == 1 ) {
    if ( $down > 1 ) {
      $j = $seconds - $down + 1;
      $totaldown = $totaldown + $j;
      $seconds--;
      open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
      printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,$j,%.2f\n", 1000 * 10);
      close(LOG);
      $seconds++;
    }
    open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
    printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,0,%.2f\n", 1000 * $duration);
    close(LOG);
    $duration = $duration * 1000;
  }
  if ( $duration < $minsla ) { $minsla = $duration; }
  if ( $duration < 10000 ) { if ( $duration < ( $minsla + $minsla + 50 ) ) { $meetsla++; } }
  ($seconds, $microseconds) = gettimeofday();
  if ( $microseconds < $offsetms ) {
    $j = $microseconds + 1000000;
    $microseconds = $j;
  }
  $j = 1000000+$offsetms-$microseconds;
  if ( $ret == 1 ) {
    $j = $j + 5000000;
    $totaltime = $totaltime + 6;
    if ( $down > 0 ) { $transitions++; }
    $down = 0;
  }
  if ( $ret == 0 ) {
    $j = $j + 1000000;
    $totaltime = $totaltime + 2;
    if ( $down == 0 ) { $transitions++; $down = $seconds}
    if ( -e "/mnt/ramdisk/pingslow.txt" ) {
      $j = $j + 4000000;
      $totaltime = $totaltime + 4;
    }
    #if ( $seconds - $prevseconds > 3 ) {
    #  $j = $j + (( $seconds - $prevseconds - 2 ) * 1000000 );
    #}
  }
  $prevseconds = $seconds;
#  print "$j,$down\n";
  usleep($j);
  if ( -e "/mnt/ramdisk/pingflag.txt" ) {
    $i = 1;
    $j = 0;
    if ( $ret == 0 ) {
      $j = $seconds - $down + 1;
      $totaldown = $totaldown + $j;
      open(LOG, '>>/mnt/ramdisk/v2logfile.csv');
      printf LOG ("$seconds,$host,$hostname,$runtime,$totaltime,$transitions,$totaldown,$j,%.2f\n", 10000);
      close(LOG);
    }

    ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);
    open(LOG, '>>/storage/weialgo/rollups/v2sla.csv');
    printf LOG ("%4d-%02d-%02d %02d:%02d:%02d,$seconds,$host,$hostname,$minsla,$sentpackets,$meetsla,%.2f\n",$year+1900,$mon+1,$mday,$hour,$min,$sec, $meetsla / $sentpackets * 100);
    close(LOG);
    sleep(5);
  }

}

$p->close();


This seems to be the last version of version 2 weialgo that I put together.  Yes, it was written in Perl, mainly because that I didn't see much improvement in doing in a native binary in C.  Perl was much easier to update, and seemed to have similar performance.

I'm not going to bother including or describing the reporting modules.  Most of that was done in shell scripts or perl. You can use perl/sed/awk/grep/sort to go through the information that this script provides and get to some very useful information if so inclined.

Net::Ping and Time::HiRes are both CPAN Perl modules that do most of the magic.  Reading through the perl script, you can get an idea how those modules work, or you can go out to the Perl website and read up on the modules directly.

The thought process for this version was a carry-over of the original version, which was more for Red/Green alerting, with the side benefit of being able to report on data with a number of different statistical models.

At the time I was running upwards of 200 or so pings to individual systems/routers, which was about the limit of what this process was capable of.  It was a simple enough "/usr/bin/perl ./weialgo2.pl 10.1.1.1 myserver.com &" to get it started, and let it run continuously.

As I remember, I would let the program ping the device once every 6 seconds, and record the round trip time and the other data I though relevant at the time.  If the ping dropped, or didn't return in time, I changed the ping time to once a second, until the device responded again.

This last part was the "Weippert Algorithm".  It basically goes like this.  Decide how fast you want to know that a device is down (say, 60 seconds).  Divide that number in half, subtract 1, round down (29).  Using a fast, low bandwidth, low cpu status protocol (like ICMP ping, or TCP SYN/ACT port open check), check the device every int(X/2-1) seconds (29).  The moment the device misses a poll, check the device every second until it comes back up.  If the device misses/drops int(X/2) polls, declare the device down and send an alert.

(If you know Scott Weippert, you can tell him how brilliant he is.  :-) )

The Algorithm is simple, and very effective for Red/Green alerting.  Unfortunately, I couldn't convince anyone I worked with that it was better than SNMP for Red/Green (it's MUCH better, SNMP is a crappy system for basic availability alerting, from a network point of view, but I haven't won that fight yet).

As I worked through the different versions of Weialgo, I found that I was using Weialgo more for statistical reporting, and not for Red/Green.  From a statistical viewpoint, the int(X/2-1) aspect of weialgo complicates statistical reports quite a bit, as it means that all data has to be time indexed as part of the reporting process.  It's much easier to just ping every second or two, and report on the data using that assumption.

The main drawback to this version (and any other process-per-device based polling) is that it drives up the number of concurrent processes on the polling server.  On the P4 I was running this on, at around 200 instances of weialgo, the amount of incurred latency based on just process switching within Linux began to throw off the results.  CPU utilization would normally bounce off 100% continuously, and the box was useless for anything else.  So, I normally tried to keep the number of polled devices much less than 200, usually around 100, so that I could do some simple reporting on the same box against the data.

At one point in time, earlier versions of Weialgo would email out to a pager every time a device went down.  That lasted for a couple weeks (a router would lose it's E1 for 10 seconds at 2am in the morning, PAGE THE NETWORK TEAM!!!!  I wasn't very popular for a couple weeks.).  Adding that functionality back into the perl script would be easy enough to do.

Obviously, I'm over-reporting information in the log files, but the total amount of data is minor in my opinion even with the extra data.  gzip/bzip2 the log files after reporting on them, and you can keep decades of data in a few gigabytes.

Also, since I put all of the data into the same log file, it's possible to have concurrency problems, depending on the version of *nix you want to run this on.  I never had a problem on Linux, but Solaris tended to throw a mangled line in the log file every once in a while.  If you have problems with concurrent processes writing to the same logfile creating mangled entries, split the logfiles up by renaming the logfile with the device name.

$offsetms was put in because of the number of concurrent processes I was running.  I didn't want all of the pings to go out the same exact microsecond.  So, using $offsetms, I randomized the start time to different microseconds for each process.  This spread out the pings, and the processing.

To try this script out, it should be fairly simple.  You'd need Linux (or your favorite version of *nix), Perl, and CPAN install Net::Ping and Time::HiRes.  Copy and paste the script into a .pl on the box, and run the script with the ip address and hostname you want to ping.  Sit back and watch it ping the device until then end of time, or until the server is rebooted.

Oh, as you can probably see from the script, I'm a fan of RAM disks for transient data like this.  Since the log file isn't held open by the processes, you can mv the log file at any time and the processes will keep running.  So, rather than doing hundreds of individual writes to a single file every second on a hard drive, I do the "spam" individual writes to a file on a RAM disk, the mv/report the logfile data via a cron process every hour or so.  Saves wear and tear on the hard drive, and speeds up everything overall.  And, frankly, if I lose an hour or so of pings, no big deal.  RAM disks are another basic IT staple (like ping) that has been lost to an unearned bad reputation.

My next post will be on the "current" version of Weialgo, version 5.  Version 4 and 5 were both based on two issues with the previous versions.  #1  Process per device polling limited the scalability of weialgo to a few hundred devices per polling server.  #2  Red/Green alerting wasn't needed, I was only using the data to run reports and create graphs.

Version 4 was an attempt to create a single process that polled multiple devices (Version 4 never worked properly).  Version 5 was a complete re-write (eg: I threw out all of my previous work) when I discovered that there was a much easier way to ping thousands of devices.  It was a bit of a /facepalm moment.

Here's a copy of one of my Version 3 scripts.  Obviously, I had yanked out all of the int(X/2-1) logic, and I'm just logging straight pings to simplify statistical reporting.


#!/usr/bin/perl

# Weialgo version 3

use Net::Ping;
use Time::HiRes qw (usleep gettimeofday);
use Time::Local;
use strict;
#use warnings;

my $host = $ARGV[0];
my $hostname = $ARGV[1];
if ( $host == "" ) { print "\nno IP to ping $ARGV[0] $ARGV[1] $ARGV[2]\n\n"; exit;}
my $sec = 0;
my $min = 0;
my $hour = 0;
my $mday = 0;
my $mon = 0;
my $year = 0;
my $wday = 0;
my $yday = 0;
my $isdst = 0;
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime(time);
$year += 1900;
$mon += 1;
$mon = sprintf("%02d", $mon);
$mday = sprintf("%02d", $mday);
open(LOG, ">>/tmp/v3logfile_$hostname\_$year$mon$mday.csv");
select(LOG); $| = 1;
close(LOG);
select(STDOUT); $| = 1;
my ($seconds, $microseconds) = gettimeofday();
my $starttime = $seconds;
srand($microseconds);
#my $offsetms = int(rand(1000000));
#my $offsetms = 100;
my $offsetms = 1;
usleep(1000000-$microseconds+$offsetms);
my $totaldown = 0;
#my $totaltime = 1;
my $i = 0;
my $j = 0;
my $ret = 0;
my $duration = 0;
my $previousduration = 2;
my $ip = 0;
my $runtime = 0;
my $sentpackets = 0;

my $p = Net::Ping->new("icmp");
$p->hires();

while ( $i==0 ) {

  ($seconds, $microseconds) = gettimeofday();
  $j = 1000000+$offsetms-$microseconds;
  usleep($j);
  ($seconds, $microseconds) = gettimeofday();
  ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime(time);
  $year += 1900;
  $mon += 1;
  $mon = sprintf("%02d", $mon);
  $mday = sprintf("%02d", $mday);
  ($ret, $duration, $ip) = $p->ping($host, 0.5);
  $runtime = $seconds - $starttime;
  $sentpackets++;
  if ( $ret == 0 ) {
    $totaldown++;
    $previousduration = $previousduration + 0.2;
    if ( $previousduration > 2 ) {
      $previousduration = 2;
    }
    open(LOG, ">>/tmp/v3logfile_$hostname\_$year$mon$mday.csv");
    printf LOG ("$seconds,$host,$hostname,$runtime,$microseconds,$sentpackets,$totaldown,1,%.2f\n", 1000 * $previousduration);
    close(LOG);
  }
  if ( $ret == 1 && $duration > 0 ) {
    open(LOG, ">>/tmp/v3logfile_$hostname\_$year$mon$mday.csv");
    printf LOG ("$seconds,$host,$hostname,$runtime,$microseconds,$sentpackets,$totaldown,0,%.2f\n", 1000 * $duration);
    close(LOG);
    $previousduration = $duration;
  }
  if ( -e "/tmp/pingflag.txt" ) {
    $i = 1;
  }
  if ( $microseconds > 50000 ) {
    sleep (1);
  }
  if ( $microseconds > 100000 ) {
    sleep (1);
  }
  if ( $microseconds > 150000 ) {
    sleep (1);
  }
  if ( $microseconds > 200000 ) {
    sleep (1);
  }
  if ( $microseconds > 250000 ) {
    sleep (1);
  }
  if ( $microseconds > 300000 ) {
    sleep (1);
  }
  if ( $microseconds > 400000 ) {
    sleep (1);
  }
  if ( $microseconds > 500000 ) {
    sleep (1);
  }

}

$p->close();



Tuesday, November 27, 2012

20121128 Ping: I laughed, I cried! It is better than SNMP! I’m going to use it again and again!

If I could hypnotize you to get this idea through to you, I would.  As it is, I have to resort to a wall of text.

Ok, let's use our imagination for a moment.

Let's say you've been brought in to a new network.  There are no monitoring tools working.  All you have is people complaining about application and Internet performance, and everyone is pointing at the network as the cause.

Nothing is set up on the network to allow monitoring.  SNMP isn't configured properly.  Netflow isn't enabled.  You have limited server capability, even if you did get one of the monitoring protocols working on all the network equipment, you wouldn't have the server horsepower to properly use it.

What could you do that would give you data to hold back the tide of angry end users?  What tool should be your go to in tough situations like this?  What tool gives you the best information you could ask for in any network?

I'm here to talk about the much maligned, much ignored, but dependable workhorse, Ping.  Yes, that utility you go to first every time someone says "The network is down".  No, you've never said thank you to Ping.  It's just there, ready to do it's job at a moments notice.  It's the best friend that a network admin could ever have, and never asks for recognition from you, never takes a day off.

Why do we depend on Ping when we're in trouble, but never hear of it any other time?  What makes people think that ping is reliable and has a job to do when everything is coming apart, but any other time it's not good enough to use for any "serious" work?

Ping gets no respect.  That's just plain wrong.

Ping is the best tool to provide a network admin with the most important bits of information you'll ever want to know about a network.

Ping tells you two things.  #1 is the device responding.  #2 how long did the device take to respond.

Now, every network admin/tech/designer/engineer knows this.  Ping is everywhere for a reason.  But here's the reminder that everyone needs to remember.  THESE ARE THE TWO MOST IMPORTANT THINGS YOU NEED TO KNOW ABOUT THE NETWORK.

Sure, it's nice to know the types of information SNMP and Netflow can give you.  But both of those require network equipment to be configured properly, servers to be available with enough capability to handle the processing loads, and software to be purchased, loaded, configured and maintained.  To do a decent installation of your favorite HP SNMP monitoring suite for a mid-sized company could take months to do, thousands of man-hours to complete, and doesn't come for pocket change.  Proper Netflow collection and reporting isn't any better.

And neither of them will tell you the two most basic things about the network with the level of accuracy of Ping.  If you want to know, is it up and how quickly did it respond over the network?  Ping is the champ, bar none.

Now, I'm not trying to detract from SNMP and Netflow.  They, and systems like them, are essential tools to keep a developed network running properly.  But, most new Network Admins treat them as the first thing you should do to get a handle on a network that has gotten out of control.  I wont call them the last thing (after all, how hard is it to set up MRTG, really?), but, setting up a organization-wide Ping statistical data collection tool is simple, fast, and can be done from a spare laptop.

"Organization-wide Ping statistical data collection tool"?  Ok, I'm not sure if that is the best phrase for it, but it's what I call it, and it kind of describes where I'm going with this.

Let's say you collect a list of devices that are on the network that are considered important.  The IP address of every router, the important switches and other important network devices.  Throw in a list of the important server or system IP addresses.  You'll have to do this anyways, so, just type them all up in a list.

Now, let's say we ICMP Ping each one of them, once a second or so, continuously.  Yep, 24x7, 365.

It's ok, really.  Ping wont break your switches, or any other device for that matter.

See, here's the thing about Ping, the individual packets are as small as they get.   If you configure Ping properly (take a look at the -l 0 option in Windows ping, or -s in Linux), the total packet size is 28 bytes.  So, for every device you ping, you use 224 bits per second.   Even a 300 baud modem from 30 years ago can handle that just fine, and if you upgrade to a 1200 baud modem, you have bits to spare.  At today's network speeds, 224 bits each second for each device you want to ping doesn't come up to the level of background noise.  It's background noise to the background noise.

On top of the fact that Ping is extremely efficient with network bandwidth, responding to a ping requires next to zero CPU processing capability.  The most basic of network devices that are in use today have the ability to respond to ping with sub-millisecond variability.

Really, it's ok.  Pinging devices on your network once a second wont break anything.  This is the part that gets me when I start discussing this subject with other IT people.  For some reason, there seems to be an aversion to pinging something repeatedly on a continuous basis.  But, no one thinks snmp walking devices once every 5 minutes is a big deal.  SNMP and Netflow require orders of magnitude (yes, I'm using that term properly, and I think I'm safe saying it) more processing power than Ping ever could, period.  And, Ping gives you the two things you need most.

Now, let's collect all of that Ping data and turn it into something useful.  Something like graphs.  Non-network people love graphs.  Let's say we take all of that data and have our laptop turn it into a graph once a day for each device.  On top of that, let's create a composite metric for each device, letting us know where we need to start working first (rather than purely being slaves to end user complaints).

Here's a graph as an example.


The graph is based on ping data of a random site router.  No specific reason for pulling this graph in particular, it was the first graph I clicked on.  What's the graph mean?  Easy.  Yellow is bad.  Red is really bad.  If there's lots of red, it's really really bad.  I'll explain more later.

This graph was generated on a 6 to 8 year old box (2 ghz P4?  No one seems to remember how old the box actually is) that someone was throwing away.  That throw away box is pinging 800 devices every other second.  Collecting all of the data, and then generating a graph once a day for each device based on that data.  20+ hours of every day, CPU utilization on the box is less than 5% (remember ping takes next to nothing in CPU).  The only time that box "works" is when it's reporting on the data, and generating graphs, which it statically stores in a html directory to be looked at via a web server on the same box.

It took next to nothing to set up.  It takes next to nothing to run.  And, all of a sudden, poof, you have very useful data, that you wont see from any other tool.  I dare say, I don't know of any other tool that will do anything to this level.

Here's a monthly graph for a different site router.


How hard is it to look at that graph and see there is a problem?  The graph covers a whole month, but it's very easy to see the site circuit outage in the middle of the month.

I wrote the tool (hack) back in 2004, and have kept modifying it ever since.  It's much too useful to not use, it's cheap (free?), and takes up next to nothing in system resources to run.

It would be the first tool I deploy in any network I become involved with, and it makes an excellent supplement to every other monitoring tool out there.

And, it's just Ping.  Hundreds of Pings.  Thousands of Pings.  Once ever second, or every other second.  And then graph against a model.  All using open source software and a couple scripts.  Takes next to nothing to set up.

At one time, I had this system sending an email whenever a device failed to respond for 10 pings in a row.  Yes, Red/Green, with a hair trigger.  None of this, miss two SNMP polls and we send an alert.  If it was down for 10 seconds, email to a pager (boot to the head).

I named the system Weialgo, Weippert Algorithm.  Someone I worked with gave me the idea for it as a snide comment to how poor of a job another SNMP monitoring system did at alerting when a site/system was down.  In less than 5 minutes, he laid out how Red/Green should be done, including the underlying logic.  I'm sure he doesn't/wont remember the conversation, but I remember it as one of those moments where my eyes must have been wide open and I was thinking "That's Brilliant!".

Where I currently work, we call the graphs the "Grass on Fire" report, for obvious reasons.

Over the next couple weeks or so, I'll document the different versions, the setup, and the different modifications I've made with this system over the years.


Monday, November 26, 2012

20121127 Everything Packet: Where do you start? Build a Chair.


I'm sure, if you take a look at the last post, and this post, it probably looks like I forgot about the blog for a long period of time.  Honestly, I didn't.  It's just one of those things where you sit and look at it and ask yourself "Why?".  I don't expect alot of people to be interested in the finer details of packet handling on a communications network, so I'm not sure why I would invest alot of time typing up all of my thoughts on networking to put on the Internet so that very few people would read it.

But, then I take a look at the embarrassingly huge amount of time I've spent on other pursuits that had even less productive potential (WoW anyone?).  So, it's meaningless if this is a waste of time.  So, "Why" isn't a problem.

The other problem is a serious one.  Where to start.  Really.  I mean it.

I've been doing computers as long as I could remember.  It's not something I do for work.  It's who I am.  Even when I'm having fun, it usually has something to do with computers.  Nowdays, that's easy to say, 30 years ago, not so much.

I've built up more than my fair share of opinions on what is wrong with Computing (big C) over that time.  They cover the entire gamut.  Anything from the "Big Idea" things, all the way down to the "Nit Picky" details.

Where do you start?  When you take a look at everything wrong with the world, where you could literally start in on any section and just rip it to shreds, how do you decide what to tackle first?

I guess we'll start with how to describe the problems with a particular chair.

The problem with this analogy is that everyone knows what a chair is.  Or they think they do.  With technology (or any suitably advanced subject, like Nuclear Science or such), most people don't know what a "chair" is.  And those that should know what a "chair" is, probably don't.  Those people who actually know what a "chair" is, had to find out the hard way, and most are so filled with self doubt (because they've trialed-and-erred their way to this knowledge) that they don't volunteer the fact that they know what a "chair" is, just in case they haven't learned everything there is to know about "chairs".

So, when the subject about "chairs" comes up, you find that the information about "chairs" is primarily dominated by people who, by personality, are the ones that are most willing to stand up and say what a "chair" is, not by the ones that actually know "chairs".  This leads to suffering on a wide-scale basis.

So, what I'm saying is, "chair" design is primarily documented and promoted by people that are type A personalities, not by the people who have the best overall understanding on how to design a good all-around "chair".

So, what does it take to understand a "chair"?  You have to understand the Universe, Human Society, Human Government, Human History, Biochemistry, Biomechanics, Human Physiology, Mechanical Engineering, and Chemistry.... at the very least.

Do you really need to know about the Universe (which is a very big subject, to be sure) to understand how to build a chair?  Not really.  But, I will guarentee that you'll build a better chair, the more you understand the Universe.

And, the "Universe" here is really my term for a point of view.  I could have just said "you'll need to understand several points of view to properly build a chair", but phrases like that cause the brain to shut off.  No, one of the things you'll need to understand to properly build a chair is the Universe.  I didn't say "Build the BEST chair", I just said build a chair.

The Chair exists as part of a room, which exists in a house or building, which is owned by someone, which sits on the ground in a country, controlled by a government, which is part of a group of nations that exist on a planet, which orbits a star, which is one of billions in a galaxy, which is one of the billions of galaxies that make up the known universe.

Is it really that important to know about the Universal context of something when building a "chair"?  I don't know, what are you going to build the chair out of?  Let's say you decide to make the frame of the chair out of osmiridium.  Osmiridium is a fantastic material for building chairs out of as it's incredibly strong, very durable, and very corrosion resistant.

The problem with osmiridium is that it's very rare, and likely to stay that way.  But, you'd only know that if you understood how osmiridium was created in the first place, and that takes a much bigger view than you'll ever see by just looking at a "chair".  (On a side note:  I think, if osmiridium ever becomes commonplace, you'll find all chairs frames being made of it.)

The same situation exists for every context of "chair" building.  To build a proper chair, you have to know a great deal about "everything".  Heck, all I need to do is say OSHA, and that should be enough.

So... if building a proper chair, building a chair properly, not even a good chair just a proper one, is so hard, What makes everyone think that everything in IT is so damn easy?

Yes.  Why does everyone think that IT is somehow magic, and that people who have NO interest in it can, all of a sudden, sit down and make unbelievably broad statements on how IT systems are to work, and enforce standards that only make sense in the land of physics make believe?

-IT Systems are orders of magnitude more complex than a chair.
Documenting something as "simple" as network routing protocols has taken volumes of books and entire lifetimes of people to get right.  Ask John Moy to explain OSPF in detail in less than 5 minutes.  Now, try to explain a chair in detail, in less than 5 minutes.  The chair could probably be done in less that 30 seconds, and most people would probably understand it.  I don't know Mr. Moy, but I would guess that he could talk for 5 hours and 90% of people still wouldn't understand.

-IT Systems are conceptual in nature, chairs you can see.
Computer systems "work" in your head.  When you ask UPS how they track your package, there's no machine to take apart to point at the little stick that turns the gear that makes the package tracker point to your package in downtown Duluth.   It's all concepts, imagination put to productive use.  Public school beats imagination OUT of kids, it's not a productive skill to have according to most bureaucrats.  I hate to tell you this, but you CAN'T do IT without having a strong imagination.  Troubleshooting, design, support.  Imagination is king.  Imagination is lacking in the human race as a whole, and I believe it exists at an even smaller ratio for people with a business major.  I'm not saying people with business majors aren't smart.  Some are wickedly smart.  But, being technically minded, and having strong imaginations are not traits that are sought after in people with business degrees.

-IT Systems are unbelievably interwoven, chairs interact with the other furniture in your room.
A typical IT system will normally involve itself with every aspect of that business, and people and companies outside of that business.  All I have to do is say "Financial systems, Logistics, Human Resources, Voice and Video communications, Ecommerce, Manufacturing, Facilities..." and that's just a start.  Here's the fact, because of IT, every human being that works for a organization is directly tied to every other human being that works with that organization.  And the complexity of those interconnects are directly reflected in the IT Systems design.

-IT Systems require a diverse group of people to design, build, and maintain.  A chair takes one Amish carpenter.
I generally get the feeling that management feels that as time goes by, fewer people are required to maintain IT systems.  History shows that the exact opposite is true.  Remember the mainframe?  Mainframes died, not because the mainframe concept was backwards or out of date.  Mainframes died because they weren't supported, upgraded, and the knowledge of the people who built and supported those mainframes wasn't actively promoted through training of the next generations of IT people.  Everything you are doing today in IT is doomed if you don't continuously update the technology, and update the people.  And, let's face it, you can't pick just anyone to learn and love IT.  Most people would rather play fantasy football than go home and build a mainframe in their basement.

So, if designing a chair is complex, and IT Systems are orders of magnitude more complex than building a chair, why does everyone think it's so easy to design IT?

Here's my checklist for designing and building IT systems.

1.  You must love IT.  You must do IT for fun on your spare time.
2.  You must have at least 10 years of blood, sweat, and tears in IT before you can design the "smallest" IT production system which affects human lives.
3.  You must own it after you build it.  If it doesn't work right, it's your fault, don't avoid it.
4.  You must find and teach anyone you can who loves IT as well.  If you have a good design, you must share the idea with others.

Number 4 is why I'm doing this blog.  Let's see how far I get.