Sunday, December 21, 2008

Linux Hard Disk Issue - Excessive Load_Cycle_Count

Prologue: My 2 year old laptop's harddisk had died a few weeks ago, and I replaced it with a Seagate Momentus 5400.3 160GB SATA Drive. I have been running only Linux for a long time on my laptop. (Running Ubuntu Intrepid at present)

While trying to ascertain the cause of this premature death, I came to notice the abnormally high Load_Cycle_Count. This can be checked using smartmontools by issuing the command

sudo smartctl -n standby -a /dev/sda
where /dev/sda has to be replaced with the appropriate disk name. The option -n ensures that if the disk is already in standby, smartctl doesn't wake it up. A little bit of Googling returned quite a lot of stuffs about this issue. Laptop Harddisks, in order to improve power efficiency while on battery, have quite aggressive power management features by default. Now this is not really bad. When the disk is not accessed for sometime it spins down itself. So far so good, the disk stops spinning unnecessarily thereby cutting down power consumption. However no sooner than the disk stops spinning, something causes it to spin up again. This not only defeats the whole purpose of spindown, but also causes unnecessary wear and tear of the disk components. Most modern HDDs have a mechanism which parks the head (loads it up a ramp) when the disk spins down. The head is unloaded back over the platter, once the disk spins up again. However each load and unload cycle causes wear of the loading and unloading mechanism. Seagate HDDs (most others as well) have specifications of maximum of 60,000 load unload cycles. This is quite high. But what I found in my case was, the Load_Cycle_Count was increasing at the rate of about 5-6 per min. That meant the head was parking and unparking every 10sec on avarage. This was quite alarming.
To stop such insane behavior, I set the Advanced Power Management to 254 using hdparm. A value of 254 meant least aggressive power management. By default Ubuntu sets it at 128. This did stop the Load_Cycle_Count from increasing increasing insanely. But the disk now stopped spinning down, and its temperature was shooting up. Within a hour it went up above 60degC (room temp was around 20degC). Now that is even more alarming than the increasing load cycle count. The rated maximum operating temperature for my drive is 60degC. Operating at high temperature severely shortens the life of the disk. At a power management value of 180, the temperature settled at around 55degC. This was better, but not quite good, the disk was 35degC above ambient temperature. During peak summer, the ambient temperature at Kolkata hovers around 40degC. So my disk will get fried up in the summer if I use my laptop in a room without airconditioning.
So preventing the disk from spinning down is not a solution. It has to be ensured that once the disk spins down, it stays like that as long as possible, without spinning up.
I needed to find out who was accessing the disk so frequently. iotop is a nice utility for this. wpa-supplicant was at the top of the list. I am using a wireless connection, and wpa-supplicant frequently logs something. Next was gconf-d, followed by gnome-do and console-kit-daemon. As soon as the disk spins down, one of this will try to do a read/write causing the disk to spin up again. On top of that, every time the disk is accessed, kjournald will write the filesystem journals, update the atime, ctime and mtime of file inodes. All these together keep the disk always busy and wakes it up as soon as it tries to catch a nap.

However there is a utility called laptop-mode-tools which performs some tweaks and tries to keep the hard disk in standby mode as long as possible.
To enable it, first install laptop-mode-tools.

sudo apt-get install laptop-mode-tools
Then it has to be enabled in /etc/default/acpi-support by changing the line
ENABLE_LAPTOP_MODE=false
to
ENABLE_LAPTOP_MODE=true
I changed the configuration file a bit, so as to optimize things as far as possible.
The configuration is there at /etc/laptop-mode/laptop-mode.conf
###### Config file for laptop-mode-tools
## Verbose output on
VERBOSE_OUTPUT=1
## Laptop mode enabled always
ENABLE_LAPTOP_MODE_ON_BATTERY=1
ENABLE_LAPTOP_MODE_ON_AC=1
ENABLE_LAPTOP_MODE_WHEN_LID_CLOSED=1

# When to enable data loss sensitive features
# -------------------------------------------
#
# When data loss sensitive features are disabled, laptop mode tools acts as if
# laptop mode were disabled, for those features only.
#
# Data loss sensitive features include:
# - laptop_mode (i.e., delayed writes)
# - hard drive write cache
#
# All of the options that follow can be set to 0 in order to prevent laptop
# mode tools from using them to stop data loss sensitive features. Use this
# when you have a battery that reports the wrong information, that confuses
# laptop mode tools.
#
# Disabling data loss sensitive features is ACPI-ONLY.
# Disable all data loss sensitive features when the battery level (in % of the
# battery capacity) reaches this value.
#
MINIMUM_BATTERY_CHARGE_PERCENT=3
# Disable data loss sensitive features when the battery reports its state
# as "critical".
#
DISABLE_LAPTOP_MODE_ON_CRITICAL_BATTERY_LEVEL=1

# The drives that laptop mode controls.
# Separate them by a space, e.g. HD="/dev/hda /dev/hdb". The default is a
# wildcard, which will get you all your IDE and SCSI/SATA drives.
#
HD="/dev/[hs]d[abcdefgh]"
# The partitions (or mount points) that laptop mode controls.
# Separate the values by spaces. Use "auto" to indicate all partitions on drives
# listed in HD. You can add things to "auto", e.g. "auto /dev/hdc3". You can
# also specify mount points, e.g. "/mnt/data".
#
PARTITIONS="auto /dev/mapper/*"
ASSUME_SCSI_IS_SATA=1

# Maximum time, in seconds, of work that you are prepared to lose when your
# system crashes or power runs out. This is the maximum time that Laptop Mode
# will keep unsaved data waiting in memory before spinning up your hard drive.
#
LM_BATT_MAX_LOST_WORK_SECONDS=900
LM_AC_MAX_LOST_WORK_SECONDS=600


#
# Should laptop mode tools control readahead?
#
CONTROL_READAHEAD=1
# 10MB readahead in laptop mode
LM_READAHEAD=10240
NOLM_READAHEAD=128

# Disks will be mounted with noatime in laptop mode, atime updates to file inodes will be 
# stopped.
CONTROL_NOATIME=1
# Don't use relatime instead of noatime
USE_RELATIME=0

# set hdd timeout
CONTROL_HD_IDLE_TIMEOUT=1
LM_AC_HD_IDLE_TIMEOUT_SECONDS=60
LM_BATT_HD_IDLE_TIMEOUT_SECONDS=30
NOLM_HD_IDLE_TIMEOUT_SECONDS=7200

# set HDD power management
CONTROL_HD_POWERMGMT=1
BATT_HD_POWERMGMT=1
LM_AC_HD_POWERMGMT=127
NOLM_AC_HD_POWERMGMT=254


# enable write cache
CONTROL_HD_WRITECACHE=1

NOLM_AC_HD_WRITECACHE=1
NOLM_BATT_HD_WRITECACHE=0
LM_HD_WRITECACHE=1

CONTROL_MOUNT_OPTIONS=1


#
# Dirty synchronous ratio.  At this percentage of dirty pages the process
# which calls write() does its own writeback.
# At 80percent of dirty pages disk write is performed. This holds up things in memory and
# prevents frequent disk writes
LM_DIRTY_RATIO=80
NOLM_DIRTY_RATIO=40


#
# Allowed dirty background ratio, in percent.  Once DIRTY_RATIO has been
# exceeded, the kernel will wake pdflush which will then reduce the amount
# of dirty memory to dirty_background_ratio.
# Once writeout has commenced write as much as possible to disk, without keeping back anything.
# So this has been set to 1 percent
LM_DIRTY_BACKGROUND_RATIO=1
NOLM_DIRTY_BACKGROUND_RATIO=10


#
# kernel default settings -- don't touch these unless you know what you're 
# doing.
#
DEF_UPDATE=5
DEF_XFS_AGE_BUFFER=15
DEF_XFS_SYNC_INTERVAL=30
DEF_XFS_BUFD_INTERVAL=1
DEF_MAX_AGE=30


#
# This must be adjusted manually to the value of HZ in the running kernel
# on 2.4, until the XFS people change their 2.4 external interfaces to work in
# centisecs. This can be automated, but it's a work in progress that still
# needs some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for
# external interfaces, and that is currently always set to 100. So you don't
# need to change this on 2.6.
#
XFS_HZ=100


#
# Seconds laptop mode has to to wait after the disk goes idle before doing
# a sync.
#
LM_SECONDS_BEFORE_SYNC=2

After enabling laptop-mode, the hdd is being able to sleep peacefully for quite sometime in between spinups. Also the operating temperature is rarely exceeding 50degC now. The load cycle count is still increasing but at a much slower rate. Hopefully this HDD is going to last longer than the previous one.

6 comments:

deep-endolith said...

My hard drive has 1.4 million load cycles after only 2 years of use. In Windows, there is no problem. I am not very happy with Ubuntu. Interesting that a bug I reported as an annoyance might actually be part of the hardware-destroying problem:

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/294190

botcyb.org said...

Well, I think there is a slight misconception here. The excessive load cycle count is not because Ubuntu sets aggressive power management settings in the hard disk. The settings have been put there by the manufacturer, who perhaps thought a few minutes of battery life is preferred over the longevity of the disk. Now by default, Ubuntu doesn't override the settings of the manufacturer. Ubuntu can't be blamed entirely for this - it is not expected that the OS will have to override hardware manufacturer's power management settings, in order to save the hardware.
However, once the hard disk spins down, it is indeed the OS's responsibility to ensure that it holds back non-essential disk I/O requests as long as possible to prevent unnecessary spin-ups. Perhaps windows does this, helping windows to maintain a lower load cycle count. The laptop mode-tools, if counfigured properly also does the same job. Hence I feel, the laptop mode tools should be enabled by default in Ubuntu.

deep-endolith said...

Doesn't the hard drive killing bug only exist in laptop mode, though?

How often does wpa supplicant log compared to the other tools? Surely there's a way to limit the amount that logs are written to the disk.

sambit said...

»Doesn't the hard drive killing bug only exist in laptop mode, though?
On the contrary the issue is more prominent if laptop mode is disabled. Laptop mode is enabled and properly configured tries to minimize disk usage to some extent.

»How often does wpa supplicant log compared to the other tools?
It is quite random, but atleast one of these three - wpa-supplicant, gnome-do and gconf-d, will come alive every 10sec. All these three, comes up with almost equal frequency.

Widita said...

Hi,

I was using Windows XP and the load-unload cycle also increases a lot...

The drive is a Seagate Momentus 5400.3,

Is there any way to change the aggressive power saving mode to less aggressive mode in Windows environment?


I tried Windows Vista, the load-unload cycle growth is slowed down but sometimes still there is a sudden increase, something like 14 cycle in 16 minutes....


Thanks

Daniel Rooney said...

This happened to me once. It had come to worst such as access.error.sx and I had to change hard disks to Samsung for document storage.