May 06 2003
Drive Don't Fail Me Now Print E-mail
Written by Paul Winkeler   
Tuesday, 06 May 2003
Whenever I walk into a meeting with a prospective customer and I hear that their main problem consists of a never-ending string of backup failures, you can bet the air will be filled with additional remarks such as "it didn't used to be this bad" and "the other guy thought we had some bad tapes, but I think it's the drives", and so on.
As the all-knowing consultant I have at least one trick up my sleeve to try and settle this debate so I will propose to graph for them a listing of every media/drive related backup failure over the course of their current installation. Many of you will know that I am going to rely on the media manager files stored at
/usr/openv/netbackup/db/media/errors
Unless the media managers were wiped and re-installed, these error log files are going to contain a record of every media failure ever recorded on that machine. And yet this falls far short of what we should expect an enterprise class product such as Veritas NetBackup to provide for the management of the media and hardware under its control... At first glance these files appear to contain all the information you need:
  • Date
  • Volume label
  • Error Type
  • Drive Index
Yet you quickly discover this list's glaring problem, namely the identification of the drive.
A media manager assigns each drive a unique (to that media manager) drive index. In a static environment that is fine. But the first time a drive's power-supply blows and it gets replaced you can no longer rely on the data in that text file to tell you how reliable the drive at index 3 has been: it was replaced! What is really needed is a way of logging errors based on not only the drive's index, but also its serial number and robot status at the time of the failure. Only then can we seriously analyze media and drive problems over time.

A smaller problem is that each media manager keeps its media error logs to itself. Consolidating a statistical view means going out to each media manager and retrieving its file. And that is when you realize that the errors for drive index 0 on media manager alpha should not be confused with those of drive index 0 media manager beta. (We won't even mention the horrors SSO brings into this picture!)

That said (are you listening Veritas?) the data that is collected can safely be used to trend failure frequency so you can settle the debate with respect to the increase in the failure rate, even if you cannot authoritatively speak to its causes without a lot more research. Until then you will just have to close your eyes when you submit that job and whisper "Drive don't fail me now!"

If you're interested in ways to quickly build media/drive failure rate graphs from the raw data, take a look at the script "volstats.pl" in the NBU Perl module distribution for a way to massage the raw data into a CSV (Comma Separated Values) file. From there it is easy enough to load the data into Excel and graph to your heart's content.
The This e-mail address is being protected from spam bots, you need JavaScript enabled to view it himself will be more than happy to assist you with the generation and/or interpretation of the graphs.

Last Updated ( Thursday, 26 April 2007 )
 
< Prev   Next >