[GTALUG] nvme SSD: critical medium error, dev nvme0n1

Tue Aug 6 12:32:01 EDT 2019

On Mon, Jul 29, 2019 at 01:28:32PM -0400, Stewart C. Russell via talk wrote:
> I'm guessing this is bad, right?
> 
>     [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error,
> dev nvme0n1, sector 296089600 flags 80700
>     [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error,
> dev nvme0n1, sector 296089744 flags 0
> 
> Is it an oh-shit-get-yerself-a-new-drive-NOW thing, or …?
> 
> Drive is a 2+ year old Intel 512 GB SSD. Not entirely sure what the
> right diagnostics are for SSDs. Filesystem is showing clean but touching
> certain known-bad files triggers the error in the system log.
> 
> Dunno if these nvme stats are useful:
> 
>     Smart Log for NVME device:nvme0 namespace-id:ffffffff
>     critical_warning                    : 0
>     temperature                         : 25 C
>     available_spare                     : 85%
>     available_spare_threshold           : 10%
>     percentage_used                     : 1%
>     data_units_read                     : 10,349,479
>     data_units_written                  : 10,098,299
>     host_read_commands                  : 183,018,841
>     host_write_commands                 : 136,702,227
>     controller_busy_time                : 1,342
>     power_cycles                        : 201
>     power_on_hours                      : 15,722
>     unsafe_shutdowns                    : 10
>     media_errors                        : 803
>     num_err_log_entries                 : 844
>     Warning Temperature Time            : 0
>     Critical Composite Temperature Time : 0
>     Thermal Management T1 Trans Count   : 0
>     Thermal Management T2 Trans Count   : 0
>     Thermal Management T1 Total Time    : 0
>     Thermal Management T2 Total Time    : 0
> 
> Any suggestions, please, for:
> 
> * what I should be looking for in stats (nvme smart-log-add doesn't give
> me anything at all, so no wear-levelling stats)
> 
> * a decent brand to replace it with. I'm likely okay with a SATA SSD.

So according to intel's datasheet:

Media Errors: Contains the number of occurrences where the controller
detected an unrecovered data integrity error. Errors such as uncorrectable
ECC, CRC checksum failure, or LBA tag mismatch are included in this field.

Now could that mean it simply has a bad block and has read errors on
that block number and that if you were to rewrite that block it would
be remapped and fix it?  Could be.

Is it the same sector numbers each time you see a log message?

-- 
Len Sorensen