Saturday, April 9, 2011

Reliability : Uptime and Downtime

Downtime
Downtime (computers or networks being unavailable) is one of the most frustrating experiences as a user, and something IT staff work tirelessly to avoid.
Downtime can be caused by hardware or software failures, security breaches, or as a planned exercise so we can change hardware or update software.
We try to avoid downed systems by using stable software versions, more expensive and redundant hardware (multiple network cards, multiple fibers, RAID disks, hot swap hardware, etc.).
Most of our systems are rebooted only to install operating system security updates, and we do that in the middle of the night.
Some hardware upgrades are unavoidable, but we also use the opportunity of power outages when we can.
Our schedule is complicated by the need to be up while students work well into the night and during class time for the Dubai campus too.
In the computing industry (and engineering in general), uptime/downtime is expressed in nines.

Nines
Availability
%
Downtime per year
one nine
0.9
90%
36.5 days
two nines
0.99
99%
3.65 days
three nines
0.999
99.9%
8.76 hours
four nines
0.9999
99.99%
52.56 minutes
five nines
0.99999
99.999%
5.26 minutes
six nines
0.999999
99.9999%
31.5 seconds


So how do we fare?   Many of the background servers are rebooted for security patches once or twice a year, they offer five nines.  Since rebooting takes between one and three minutes, five nines is the best that can be achieved without resorting to totally redundant servers.
The Engineering file server was in that category for four years, but last year it started having problems in the Fall.  Its replacement in December 2010 has been totally stable again – expected to be five nines again.
The engineering network backbone  (for CPH, E1, E2,E3) has also been five nines for a while.  The login web browser on the nexus login screen depends on a web server that was at four nines last year – we’ve since moved it.
The campus homepage web server was down for a while on the snow day.  It doesn’t take much downtime to degrade your nines.
The two network resources with bad uptimes this term were wireless and the campus external Internet connection.  Both are managed by IST.
Wireless is always less reliable as it is subject to interference.  However, there is more involved than that.  IST uses a SandVine installation to shape wireless traffic and reduce peer-to-peer traffic.  That had problems.  Also there were unexplained reboots of wireless access points and other oddness.   With scheduled and unscheduled outages and configuration problems, wireless operations were  definitely problematic this term.  I don’t have enough information to state the nines, but I wouldn’t be surprised if it was around two for some areas of campus.
The other problem this term was our off-campus or external network links.  There were a number of hardware and fibre related problems between us and Toronto.  Several times we had partial outages where only a percentage of the network traffic was lost, sometimes only between certain computers and not others.  We have a third network provider and our systems are supposed to fail over to the redundant link.  This takes some time to take effect, so it’s not immediate.  Also, our redundant link was not as fast, leading to congestion.
IST is planning to improve the external links.  It’s a good plan, people are depending on our external connections.
It’s often said that you only notice your IT staff (computing staff) when things go wrong.   We’re trying our hardest to make that not happen.