How to interpret smartctl messages like ‘Error: UNC at LBA’?

When running smartctl on your hard drive, you often get a plethora of information that can be hard to interpret for unexperienced users. This post attempts to provide aid in interpreting what the technical reasons behind the error messages are. If you’re looking for advice on whether to replace your hard drive, the only guidance I can give you is it might fail any time, so better backup your data, but it might also run for many years to come.. Furthermore, this article does not describe basic SMART WHEN_FAILED checking but rather interpretation of more subtle signs of possibly impending HDD failures.

One example that is particularly hard to interpret is the device error log storing the last few errors, for example

Error 8910 occurred at disk power-on lifetime: 7257 hours (302 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 1a 00 33 96 61  Error: UNC at LBA = 0x01963300 = 26620672

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 18 00 33 96 40 00      03:09:52.125  READ FPDMA QUEUED
  60 88 10 50 06 11 40 00      03:09:52.125  READ FPDMA QUEUED
  60 08 08 60 ac 5e 40 00      03:09:52.113  READ FPDMA QUEUED
  60 08 00 48 cf 6d 40 00      03:09:52.099  READ FPDMA QUEUED
  60 90 f0 b0 ef e5 40 00      03:09:52.065  READ FPDMA QUEUED

Obviously, the first line shows when this error occured. The other lines, however, are not as obvious. Let’s examine the next section:

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 1a 00 33 96 61  Error: UNC at LBA = 0x01963300 = 26620672

 

While this section also shows the content of some registers while the error occured, the interesting part of it is the error description Error: UNC at LBA = 0x01963300 = 26620672.

A LBA is a logical block address, i.e. some logical address on the hard drive. It is shown in both hexadecimal form 0x01963300 and in decimal form 26620672. In order to convert it to a byte address, you need to multiply it by the value listed at the head of the smartctloutput:

Sector Size:      512 bytes logical/physical

In almost any case, this value is 512 bytes, so in this example the byte offset would be 26620672 * 512 = 13629784064 = 12.69 GiB. In some cases it might be helpful to look up this address in a tool like GParted to see in which partition the error occured in. Also see this smartmontools HOWTO describing this process in detail.

UNC errors

The error message now tells us than an error called UNC occured at this LBA. UNC is shorthand for UNCorrectable, which means the data which has been read from the hard drive at this LBA was damaged and could not be corrected.

Hard drives not only store your data by itself, but automatically compute a so-called error-correction code (ECC). While there are many subtypes of those mathematical codes, they have one aspect in common: Given a set of bytes (e.g. the ones stored on the hard drive) which might be slightly damaged (i.e. some 0-bits are now-1 bits or vice versa) and and the matching ECC code (constituting of a few extra bytes) a suitable decoder can recover a limited number of bit errors. In most cases, ECC codes can also detect errors – for example, one specific ECC code might be able to correct one bit flip in two bytes, but it can detect up to three bitflips in two bytes.

If there are more bitflips than the ECC can recover (but not more than it can detect), this results in an unrecoverable error – the UNC. If there are more bitflips than the ECC can detect, anything might happen: Usually, the data that is computed from the ECC will be damaged, or no error might be detected at all.

Note that this explanation is highly simplified. For example, ECC codes are not stored as bytes separate from the data, but instead a mathematical function is computed on the data, resulting in a set of bytes that is larger that the original dataset – containing both the data itself plus the error-recovery extra data. In other words, the ECC data plus the data itself are mixed together.

This has multiple consequences for the interpretation. Firstly, this means that physically the data could be read, yet it does not seem to be correct. This means

Other error messages

While UNC errors occur reasonably often, there are other, more rare errors that you can’t find too much documentation about.

There is one definitive source for all smartctl error messages: The smartmontools source code.

We can find the error descriptions in ataprint.cpp (also see the GPL license information in the source tarball):

const char  *abrt  = "ABRT";  // ABORTED
const char   *amnf  = "AMNF";  // ADDRESS MARK NOT FOUND
const char   *ccto  = "CCTO";  // COMMAND COMPLETION TIMED OUT
const char   *eom   = "EOM";   // END OF MEDIA
const char   *icrc  = "ICRC";  // INTERFACE CRC ERROR
const char   *idnf  = "IDNF";  // ID NOT FOUND
const char   *ili   = "ILI";   // MEANING OF THIS BIT IS COMMAND-SET SPECIFIC
const char   *mc    = "MC";    // MEDIA CHANGED 
const char   *mcr   = "MCR";   // MEDIA CHANGE REQUEST
const char   *nm    = "NM";    // NO MEDIA
const char   *obs   = "obs";   // OBSOLETE
const char   *tk0nf = "TK0NF"; // TRACK 0 NOT FOUND
const char   *unc   = "UNC";   // UNCORRECTABLE
const char   *wp    = "WP";    // WRITE PROTECTED

 

Realistically, you’ll only encounter a few of these errors even if you are working with hard disks professionally. Some of these errors like MC, MCR or NM are also related to hot-swapping of hard drives and do not neccessarily represent errors related to hard drive health itself.

One important error is ICRC – the interface CRC error. This means that there are errors being detected on the IDE/SATA or PCIe bus the hard drive is connected to. Although this is rare and might be caused by the HDD itself, it might mean that your chipset (the hardware controlling e.g. SATA) is damaged – in this case, replacing the hard drive would not fix the issue. Possibly there is also an intermittent cable connection.

How severe are those errors?

Over the life of most hard drives, especially consumer models, errors will occur – more often so in portable devices where high acceleration forces are more like to be encountered.

What separates a good hard drive from one at the end of its life (excluding those that fail without warning) is often the frequency of new errors. If you look at the total lifetime of the HDD, i.e. Power_On_Hours or similar:

9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       8586

and compare the value (in this case 8586) with the lifetime at the last error,

Error 8911 occurred at disk power-on lifetime: 7257 hours

in this case, 7257, you can see over a thousand HDD operational hours have passed since the last error. This indicates that there is no mechanical defect which could result in destruction of the hard drive but rather a couple of defective or damaged sectors. UNC errors do not even neccessarily mean that the sectors are physically damaged.

Often hard drive errors are triggered when a files that are accessed very rarely (such as archived video files that are only opened every few years). When there are enough bit flips in such files for any reason, this can result in a larger number of HDD errors appearing at once.

Another indicator is the total number of errors the hard drive has encountered, i.e. 8911 in

Error 8911 occurred at disk power-on lifetime: 7257 hours

or in

ATA Error Count: 8911 (device log contains only the most recent five errors)

While this number is not shown for all hard drives, a very high number or a number which is growing rapidly indicates there is some physical issue with the drive. Issues relating to only a few bad sectors induce a sudden jump in the error counter, but after that. Note, however, that there can be other reasons for a high error counter, for example a bad or intermittent physical connection to the hard drive.

Also see this previous post on how to fix bad HDD sectors.