Table of Contents

Hardware Diagnostics

This is an attempt to document symptoms, open-source diagnostic software, and solutions to hardware failure.

DISCLAIMER: We are not liable for any damage caused to your system by reading this howto, running diagnostics programs, or implementing fixes. This howto is for informational purposes only.

Before working inside the case you should:

  1. Power off your computer, turn off the PSU, and unplug it.
  2. Wear an antistatic wrist strap.
  3. Ground yourself by first touching the PSU.
  4. Be careful not to damage motherboard components with sharp, hard tools.
  5. Don't use force to remove components and check the manual for how to remove them.
  6. Don't do anything you are not comfortable with.
  7. If you don't know what you are doing, get an expert to do it.
  8. Don't have water or other conductive liquids near the computer or work area.
  9. Don't leave any metallic or conducting objects inside the case as they may short-circuit components.

Common Symptoms

These are just common symptoms of each component failure, and are rarely clear enough to diagnose hardware error right away. Run the diagnostics software to confirm suspicions. Sometimes more than one thing can fail at once, which will make it harder to diagnose accurately, and you may need to take it to a shop.
The meanings of beep codes that you may hear can be found on the wiki or at BIOS Central or in your motherboard manual.

RAM

PSU

Power Button

CMOS battery

HDD

GPU

CPU

Motherboard

DVD+-RW

Mouse

Being able to pinpoint the source of the high-pitched squealing/screeching noise caused by failing capacitors may be important in finding the bad hardware component, and is almost always a sign of hardware issues.

Diagnostics

Make sure to check the cables/connectors/pins to make sure they are not damaged or corroded. Oxidation may build up on the DIMM pins, and may need to be removed. This should be done carefully only with 99% isopropyl alcohol and a soft cotton q-tip. Hard objects should be avoided inside the case as they may break off pins or damage components. If you don't feel comfortable, get a professional to do it.

RAM

These RAM testing programs also test the CPU, so if the DIMMs are known good on other systems, maybe the CPU is the cause.

PSU

You should ALWAYS use a surge protector on ALL electronic devices ALL the time. This prevents damage to the PSU, to the motherboard, and it saves you lots of time and money wasted on replacing electronics damaged by power surges. Many surge protectors come with a warranty that will refund a certain amount of money if your equipment is damaged while using the surge protector properly. The surge protector is cheap, the equipment is expensive, and the refund usually more than covers it.

Power Button

CMOS battery

HDD

ALWAYS backup your data regularly and run smartctl regularly or use smartd. If you feel that the HDD is dying, don't bother running the utilities first, instead backup your important stuff immediately. If the entire disk is full of your important stuff, maybe you should image the entire drive to another HDD in case it fails before you can get your data off it. Then you can run data carving utilities to carve your data off the image. Again, there is no substitute for backing up your data, your HDD can FAIL at ANY time WITHOUT WARNING from SMART or any diagnostics program.
You can run a SMART long test by running
smartctl -t long /dev/sd?

You then have to wait the time it estimates, plus a few more minutes for the result which you can check with

smartctl -a /dev/sd?

The attributes are listed, but you can check them separately with

smartctl -A /dev/sd?

Here is an important note on attributes from man smartctl

              Each  Attribute  also has a Threshold value (whose range is 0 to
              255) which is printed under the heading "THRESH".  If  the  Nor-
              malized value is less than or equal to the Threshold value, then
              the Attribute is said to have failed.  If  the  Attribute  is  a
              pre-failure Attribute, then disk failure is imminent.

              The Attribute table printed  out  by  smartctl  also  shows  the
              "TYPE"  of  the  Attribute.  Attributes  are one of two possible
              types: Pre-failure or Old age.  Pre-failure Attributes are  ones
              which, if less than or equal to their threshold values, indicate
              pending disk failure.  Old age, or usage  Attributes,  are  ones
              which  indicate end-of-product life from old-age or normal aging
              and wearout, if the Attribute value is less than or equal to the
              threshold.   Please  note: the fact that an Attribute is of type
              'Pre-fail' does not mean that your disk is about  to  fail!   It
              only  has  this  meaning  if  the Attribute´s current Normalized
              value is less than or equal to the threshold value.
If you have a laptop/netbook and you hear clicks from the HDD once in a while, it may be that the power saving feature is spinning down the drive. This saves power, but can quickly wear down the drive. You can turn it off by running this on every boot, basically just add it it /etc/rc.d/rc.local
hdparm -B 254 /dev/sd?

GPU

Sadly, none of the GPU tests I have tried are reliable in detecting hardware errors. However, a high-pitched squealing/screeching noise coming from the GPU is reliable.

CPU

To check the CPU temperature make sure you have lm-sensors installed. Configure it by running
sensors-detect

and then copy the modules it needs modprobe'd to /etc/rc.d/rc.local, make it executable, and run it. To check temperatures run

sensors

or you can use a monitor of your choice. sensors will also list critical temperatures, but the most accurate temperatures are shown in the BIOS, so do a comparison.

Motherboard

DVD+-RW

Burn a disk iso and run:

cmp input.iso /dev/sr0

It should say:

cmp: EOF on input.iso

If it does not say that, then the DVD/CD was not burned properly. However, it may also be because of bad media or high burn speed. If the drive keeps spinning up and down while reading a disk it could be failing or it could be that you are trying to play a commercial DVD whose region code is not supported by the drive, which limits it to 1x read speed.

Mouse

Trying to eliminate driver and software issues would be a first step to diagnosing mouse problems. Try restarting Xorg, reloading mouse drivers, and restarting the system. If these have no significant effect, then it is likely a hardware issue.

Solutions

An overheating system can cause instability and may mimic failing hardware. Before doing anything else it may be worthwhile to get a compressed air canister and clean the dust out from all the fans, heat sinks, and all the hard to reach places where dust accumulates inside the case. Using a brush is not as effective and may damage components, so it should only be used outside the case. It is also important to make sure there is proper airflow inside the case:
  1. At the bare minimum you should have a large 120mm output fan in the rear of the case for an ATX motherboard. For smaller systems it varies, and only a smaller fan may fit.
  2. Make sure that cables inside the case do not obstruct airflow, and if they do then use plastic cable ties to fix it if possible. Do not leave any stray metal inside the case that could short-circuit components.
  3. Make sure fans are placed so that they cause air movement across hot components, or at the very least evacuate hot air from the case.

RAM

Remove the DIMMs/RAM sticks one at a time, and then check again with memtest86+. If it fails, replace the stick, remove another and run the test again. Rarely, more than one DIMM may be failing, so take that into account. It could also be the CPU if all your DIMMs fail the test.

If you cannot replace the RAM soon, like if you have an ancient computer and can't find RAM for it, you can use the mem= kernel option to force the kernel to use only good RAM. Say you have an error at 129.0 MB, like I did recently, you could use this kernel boot option:
mem=128M

and it will only use the first 128 MB of RAM, omitting the bad part. There is also an option to exclude a section of RAM, but it is less tested and may not work.

PSU

Replace the PSU with a new PSU, if the symptoms disappear, you can be sure it was the PSU. However, note that a bad PSU may damage the motherboard or other components, or maybe it was a power surge that damaged everything.

Power Button

Try to clean the power button using compressed air. If that doesn't work then replace the power button if possible, otherwise replace the case along with the button or take it to a repair shop and see what they can do.

CMOS battery

Replace the battery carefully using the special tab. Do NOT pry it off using a screwdriver because then it will break and it won't go back in.

Replacing the CMOS battery will reset your BIOS/UEFI settings, so be prepared for this.

HDD

You should first get your important data off of it. If you feel it is failing fast, use ddrescue to image the drive to another drive. Once you have the image, be it complete or incomplete, you can use Testdisk and/or foremost to carve data off the image. The data will not have the same file name it used to, but at least you will get the data. You can find all these utilites and more on the SystemRescueCD. Now just replace the drive.

There exist programs that claim to correct bad blocks. What they do is mark the bad blocks so the drive doesn't used them. The problem is that bad blocks are an indicator of imminent drive failure. So, you may think that they are the fixing the problem, you may put off backing up your data, and then the drive fails and you lose your data.

GPU

Replace the video card. If it is too expensive and you aren't sure, which is very likely due to the fact of the matter, try testing it in another machine to be sure, or try testing using a spare known good video card and see if the symptoms persist.

CPU

If you suspect the CPU is overheating, then you can try to remove the heatsink-fan block according to your CPU or heatsink manual, clean off the old thermal paste, apply new thermal paste, or let an expert do it. Otherwise, replace the CPU. Make sure it is actually the CPU and not the RAM that is the problem, as a CPU is very expensive compared to RAM.

If you cannot replace the CPU and it is a multi-core CPU with only one core that has failed, for example the L1 cache of CPU core #2, then you can try to disable the failed core. One way to do this is to use the maxcpus= kernel boot option. In the example you would use:
maxcpus=2

this is because CPU cores are numbered starting with 0, so 4 cores would be numbered 0, 1, 2, 3. Note that if CPU #0 is failing, you may be out of luck. Another complementary way is to first make sure you have CONFIG_HOTPLUG_CPU built-in to the kernel, and running:

echo 1 > /sys/devices/system/cpu/cpu3/online

on every boot. In this example this turns ON CPU #3. So, basically you would have the 3 good cores running i.e. 0, 1, 3.

Motherboard

Replace the motherboard, and make sure the PSU is not damaged, as it can damage your new motherboard.

DVD+-RW

Replace the drive.

Mouse

Replacing the mouse is the definitive solution. However, some fixes may help temporarily.

Sources