smartctl on your hard drive, you often get a plethora of information that can be hard to interpret for unexperienced users. This post attempts to provide aid in interpreting what the technical reasons behind the error messages are. If you’re looking for advice on whether to replace your hard drive, the only guidance I can give you is it might fail any time, so better backup your data, but it might also run for many years to come.. Furthermore, this article does not describe basic SMART
WHEN_FAILED checking but rather interpretation of more subtle signs of possibly impending HDD failures.
One example that is particularly hard to interpret is the device error log storing the last few errors, for example
Error 8910 occurred at disk power-on lifetime: 7257 hours (302 days + 9 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 1a 00 33 96 61 Error: UNC at LBA = 0x01963300 = 26620672 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 18 00 33 96 40 00 03:09:52.125 READ FPDMA QUEUED 60 88 10 50 06 11 40 00 03:09:52.125 READ FPDMA QUEUED 60 08 08 60 ac 5e 40 00 03:09:52.113 READ FPDMA QUEUED 60 08 00 48 cf 6d 40 00 03:09:52.099 READ FPDMA QUEUED 60 90 f0 b0 ef e5 40 00 03:09:52.065 READ FPDMA QUEUED
Obviously, the first line shows when this error occured. The other lines, however, are not as obvious. Let’s examine the next section:
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 1a 00 33 96 61 Error: UNC at LBA = 0x01963300 = 26620672
While this section also shows the content of some registers while the error occured, the interesting part of it is the error description
= 0x01963300 = 26620672.
A LBA is a logical block address, i.e. some logical address on the hard drive. It is shown in both hexadecimal form
0x01963300 and in decimal form
26620672. In order to convert it to a byte address, you need to multiply it by the value listed at the head of the
Sector Size: 512 bytes logical/physical
In almost any case, this value is 512 bytes, so in this example the byte offset would be
26620672 * 512 = 13629784064 = 12.69 GiB. In some cases it might be helpful to look up this address in a tool like GParted to see in which partition the error occured in. Also see this smartmontools HOWTO describing this process in detail.
The error message now tells us than an error called
UNC occured at this LBA. UNC is shorthand for UNCorrectable, which means the data which has been read from the hard drive at this LBA was damaged and could not be corrected.
Hard drives not only store your data by itself, but automatically compute a so-called error-correction code (ECC). While there are many subtypes of those mathematical codes, they have one aspect in common: Given a set of bytes (e.g. the ones stored on the hard drive) which might be slightly damaged (i.e. some 0-bits are now-1 bits or vice versa) and and the matching ECC code (constituting of a few extra bytes) a suitable decoder can recover a limited number of bit errors. In most cases, ECC codes can also detect errors – for example, one specific ECC code might be able to correct one bit flip in two bytes, but it can detect up to three bitflips in two bytes.
If there are more bitflips than the ECC can recover (but not more than it can detect), this results in an unrecoverable error – the UNC. If there are more bitflips than the ECC can detect, anything might happen: Usually, the data that is computed from the ECC will be damaged, or no error might be detected at all.
Note that this explanation is highly simplified. For example, ECC codes are not stored as bytes separate from the data, but instead a mathematical function is computed on the data, resulting in a set of bytes that is larger that the original dataset – containing both the data itself plus the error-recovery extra data. In other words, the ECC data plus the data itself are mixed together.
This has multiple consequences for the interpretation. Firstly, this means that physically the data could be read, yet it does not seem to be correct. This means
Other error messages
While UNC errors occur reasonably often, there are other, more rare errors that you can’t find too much documentation about.
There is one definitive source for all
smartctl error messages: The
smartmontools source code.
We can find the error descriptions in
ataprint.cpp (also see the GPL license information in the source tarball):
const char *abrt = "ABRT"; // ABORTED const char *amnf = "AMNF"; // ADDRESS MARK NOT FOUND const char *ccto = "CCTO"; // COMMAND COMPLETION TIMED OUT const char *eom = "EOM"; // END OF MEDIA const char *icrc = "ICRC"; // INTERFACE CRC ERROR const char *idnf = "IDNF"; // ID NOT FOUND const char *ili = "ILI"; // MEANING OF THIS BIT IS COMMAND-SET SPECIFIC const char *mc = "MC"; // MEDIA CHANGED const char *mcr = "MCR"; // MEDIA CHANGE REQUEST const char *nm = "NM"; // NO MEDIA const char *obs = "obs"; // OBSOLETE const char *tk0nf = "TK0NF"; // TRACK 0 NOT FOUND const char *unc = "UNC"; // UNCORRECTABLE const char *wp = "WP"; // WRITE PROTECTED
Realistically, you’ll only encounter a few of these errors even if you are working with hard disks professionally. Some of these errors like
NM are also related to hot-swapping of hard drives and do not neccessarily represent errors related to hard drive health itself.
One important error is
ICRC – the interface CRC error. This means that there are errors being detected on the IDE/SATA or PCIe bus the hard drive is connected to. Although this is rare and might be caused by the HDD itself, it might mean that your chipset (the hardware controlling e.g. SATA) is damaged – in this case, replacing the hard drive would not fix the issue. Possibly there is also an intermittent cable connection.
How severe are those errors?
Over the life of most hard drives, especially consumer models, errors will occur – more often so in portable devices where high acceleration forces are more like to be encountered.
What separates a good hard drive from one at the end of its life (excluding those that fail without warning) is often the frequency of new errors. If you look at the total lifetime of the HDD, i.e.
Power_On_Hours or similar:
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 8586
and compare the value (in this case
8586) with the lifetime at the last error,
Error 8911 occurred at disk power-on lifetime: 7257 hours
in this case,
7257, you can see over a thousand HDD operational hours have passed since the last error. This indicates that there is no mechanical defect which could result in destruction of the hard drive but rather a couple of defective or damaged sectors. UNC errors do not even neccessarily mean that the sectors are physically damaged.
Often hard drive errors are triggered when a files that are accessed very rarely (such as archived video files that are only opened every few years). When there are enough bit flips in such files for any reason, this can result in a larger number of HDD errors appearing at once.
Another indicator is the total number of errors the hard drive has encountered, i.e.
Error 8911 occurred at disk power-on lifetime: 7257 hours
ATA Error Count: 8911 (device log contains only the most recent five errors)
While this number is not shown for all hard drives, a very high number or a number which is growing rapidly indicates there is some physical issue with the drive. Issues relating to only a few bad sectors induce a sudden jump in the error counter, but after that. Note, however, that there can be other reasons for a high error counter, for example a bad or intermittent physical connection to the hard drive.
Also see this previous post on how to fix bad HDD sectors.