[2024-feb-29] Sad news: Eric Layton aka Nocturnal Slacker aka vtel57 passed away on Feb 26th, shortly after hospitalization. He was one of our Wiki's most prominent admins. He will be missed.

Welcome to the Slackware Documentation Project

This is an old revision of the document!


Hardware Diagnostics

This is an attempt to document symptoms, open-source diagnostic software, and solutions to hardware failure.

DISCLAIMER: We are not liable for any damage caused to your system by reading this howto, running diagnostics programs, or implementing fixes. This howto is for informational purposes only.

Before working inside the case you should:

  1. Power off your computer, turn off the PSU, and unplug it.
  2. Wear an antistatic wrist strap.
  3. Ground yourself by first touching the PSU.
  4. Be careful not to damage motherboard components with sharp, hard tools.
  5. Don't use force to remove components and check the manual for how to remove them.
  6. Don't do anything you are not comfortable with.
  7. If you don't know what you are doing, get an expert to do it.
  8. Don't have water or other conductive liquids near the computer or work area.
  9. Don't leave any metallic or conducting objects inside the case as they may short-circuit components.

Common Symptoms

These are just common symptoms of each component failure, and are rarely clear enough to diagnose hardware error right away. Run the diagnostics software to confirm suspicions. Sometimes more than one thing can fail at once, which will make it harder to diagnose accurately, and you may need to take it to a shop.
The meanings of beep codes that you may hear can be found on the wiki or at BIOS Central or in your motherboard manual.

RAM

  • The computer typically does not POST or boot properly, but it will try.
  • Each boot typically causes different symptoms, i.e. the boot will halt in a different place each time.
  • The fans are typically running, often at 100%.
  • There may be beep codes.
  • Randomly occurring kernel panics and segmentation faults in running programs.

PSU

  • Sudden shutdown or reboot without warning or logs of what went wrong.
  • Sudden shutdown or reboot may occur during high system load or power usage or even when idle.
  • You may notice a strange smell coming directly from the PSU as it overheats.
  • You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.
  • The system may not boot after pressing the power button, and you may need to press it more than once.
  • Rarely it dies completely, like after a power surge, and the fans will not be running and it will not POST or boot and there may be motherboard damage.

Power Button

  • System powers on without wake-on-LAN or wake-on-timer or other such wake methods.
  • Reboot loop where the system shuts down and powers up on its own continuously.
  • The system may not boot after pressing the power button, and you may need to press it more than once.
  • You may get a “boot failure” message during system POST.
  • You may hear beep codes.

CMOS battery

  • The time is reset on each boot.
  • The BIOS settings reset on boot.
  • There may be beep codes.
  • The screen may be black without any beep codes.

HDD

  • I/O errors in the logs.
  • Filesystem corruption and bad blocks detected.
  • May POST but may not finish booting properly.
  • Disk access slows down right before failure.
  • Strange noises such as clicks, grinding, and spinning up noises.
  • The BIOS will not detect a completely dead disk.

GPU

  • Graphical glitches on screen.
  • The screen may be black.
  • May cause system hangs, and audio may be looping during the hang.
  • You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.
  • There may be messages in the logs relating to the video drivers or an Xorg crash or nothing at all.

CPU

  • May or may not POST.
  • There may be beep codes.
  • The fans are typically running at 100%.
  • Kernel panics are seen with multi-core machines, when only one core is affected. These are clearly listed as Hardware Error MCE (Machine-check exception).
  • If it is overheating it may also trigger a MCE which will cause a kernel panic and forced shutdown, or it may throttle itself down and the system will seem slower.

Motherboard

  • Check for swollen capacitors.
  • You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.
  • A dead motherboard does not POST.
  • The fans connected to the motherboard will not be running, the ones connected to the PSU may be running at 100%.

DVD+-RW

  • Won't read or write disks properly.
  • May keep spinning up and down forever while trying to read a disk.
  • May not open when you press the button, but instead make some clunking sounds and open only after many button presses.

Mouse

  • Problems with drag-and-drop, drag-and-select, double-click.
  • Mouse pointer jumps.
  • Acceleration issues.
Being able to pinpoint the source of the high-pitched squealing/screeching noise caused by failing capacitors may be important in finding the bad hardware component, and is almost always a sign of hardware issues.

Diagnostics

Make sure to check the cables/connectors/pins to make sure they are not damaged or corroded. Oxidation may build up on the DIMM pins, and may need to be removed. This should be done carefully only with 99% isopropyl alcohol and a soft cotton q-tip. Hard objects should be avoided inside the case as they may break off pins or damage components. If you don't feel comfortable, get a professional to do it.

RAM

  • Run one of the following programs. Errors are typically found during the first run, but do more runs proportional to your suspicion.
These RAM testing programs also test the CPU, so if the DIMMs are known good on other systems, maybe the CPU is the cause.

PSU

  • There is no specific test for the PSU. However, you should monitor voltages. You can usually do this in the BIOS, and these are the most accurate. Make sure all the voltages are above their stated voltage. For example the +3.3V line should be greater than 3.3V, +5V line should be greater than 5V, and most importantly the +12V rail should be greater than 12V. Note that this does NOT mean that you should increase the voltage if it is low, this is usually done automatically by the BIOS. Normally, voltages should be stable about a certain value, above the critical value. If the PSU is failing, the voltages can vary quickly and are close to the critical value. For example, a good PSU will have a +3.3V line voltage of a stable 3.35 (just an example). A bad PSU will have a variable +3.3V voltage that quickly varies between 3.30 and 3.32. However, there is the possibility that the voltage is fluctuating not because of a bad PSU, but because of another bad component (say a graphics card) that is lowering the voltage on the line due to internal shorting or malfunction. Planning ahead, when you first get a new PSU, you should write down the voltages and save them for reference and monitor them over time. If there is no option in the BIOS to monitor the voltages you can use a voltmeter or multimeter to measure the voltages directly on the connectors. The pinout for the connectors can be found here: ATX 20-pin, ATX 24-pin. Alternatively, special meters for PSUs are available in electronic shops.
  • In theory, it shuts down or reboots under increased power usage because it cannot provide the needed current and overheats as a result. As such, perhaps the best way to diagnose it is by monitoring voltages under load, so you should install lm_sensors and configure it using sensors-detect, and then use a monitoring program to monitor voltages and warn when they go below the given limit. Then you load the system with some CPU or GPU intensive application and wait for the warnings or sudden shutdown/reboot. Technically, the shutdown/reboot can occur even when idle, and the voltage drop may be too fast for the alarm to catch.
  • Being able to pinpoint the PSU as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors.
You should ALWAYS use a surge protector on ALL electronic devices ALL the time. This prevents damage to the PSU, to the motherboard, and it saves you lots of time and money wasted on replacing electronics damaged by power surges. Many surge protectors come with a warranty that will refund a certain amount of money if your equipment is damaged while using the surge protector properly. The surge protector is cheap, the equipment is expensive, and the refund usually more than covers it.

Power Button

  • A bad power button is surprisingly difficult to diagnose, and there are no software tests that can help. Looking for characteristic symptoms such as a reboot loop and swapping out hardware may be the only way to diagnose it.

CMOS battery

  • You could take the battery out, making sure to use the special tab to remove it rather than trying to pry it off, and measure its voltage. Or you could just throw it away in the proper battery disposal container and replace it anyway just to be sure, and also because you may have had to remove a graphics card or PCI/PCIE card to reach the battery and you may not have a battery tester for these types of batteries.

HDD

  • You can either run a smartctl long test, which tests the entire disk surface for errors, and updates the offline attributes, or you can run the specific proprietary manufacturer's utility. smartctl will also show the HDD temperature and airflow temperature. Make sure the temperatures are below 60 C.
ALWAYS backup your data regularly and run smartctl regularly or use smartd. If you feel that the HDD is dying, don't bother running the utilities first, instead backup your important stuff immediately. If the entire disk is full of your important stuff, maybe you should image the entire drive to another HDD in case it fails before you can get your data off it. Then you can run data carving utilities to carve your data off the image. Again, there is no substitute for backing up your data, your HDD can FAIL at ANY time WITHOUT WARNING from SMART or any diagnostics program.
You can run a SMART long test by running
smartctl -t long /dev/sd?

You then have to wait the time it estimates, plus a few more minutes for the result which you can check with

smartctl -a /dev/sd?

The attributes are listed, but you can check them separately with

smartctl -A /dev/sd?

Here is an important note on attributes from man smartctl

              Each  Attribute  also has a Threshold value (whose range is 0 to
              255) which is printed under the heading "THRESH".  If  the  Nor-
              malized value is less than or equal to the Threshold value, then
              the Attribute is said to have failed.  If  the  Attribute  is  a
              pre-failure Attribute, then disk failure is imminent.

              The Attribute table printed  out  by  smartctl  also  shows  the
              "TYPE"  of  the  Attribute.  Attributes  are one of two possible
              types: Pre-failure or Old age.  Pre-failure Attributes are  ones
              which, if less than or equal to their threshold values, indicate
              pending disk failure.  Old age, or usage  Attributes,  are  ones
              which  indicate end-of-product life from old-age or normal aging
              and wearout, if the Attribute value is less than or equal to the
              threshold.   Please  note: the fact that an Attribute is of type
              'Pre-fail' does not mean that your disk is about  to  fail!   It
              only  has  this  meaning  if  the Attribute´s current Normalized
              value is less than or equal to the threshold value.
If you have a laptop/netbook and you hear clicks from the HDD once in a while, it may be that the power saving feature is spinning down the drive. This saves power, but can quickly wear down the drive. You can turn it off by running this on every boot, basically just add it it /etc/rc.d/rc.local
hdparm -B 254 /dev/sd?

GPU

  • Video Memory Stress Test is also available on sourceforge and UBCD. It has the limitation that it cannot always recognize the amount of video RAM properly, and the DOS version cannot recognize more than 512 MB of video RAM. From experience, it works well with integrated Intel cards, and not very well with Nvidia or ATI cards.
  • CUDA GPU memtest requires a video card that either supports CUDA such as an nivdia card with the development nvidia drivers and CUDA installed, or a video card that support opencl which can be an nvidia or ATI card plus opencl installed and supported by the drivers. The test is comprehensive and the authors claim it can detect soft and hard errors. From experience, it may not detect hard errors.
  • If the above don't work, then you could just run a video game in benchmark mode and watch for graphical glitches on the screen or system crashes. The problem with this method is that there is no way to know if the drivers are at fault or if the card is at fault, unless you have prior experience with the game and the glitches or crashes are new.
  • Being able to pinpoint the GPU as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors.
Sadly, none of the GPU tests I have tried are reliable in detecting hardware errors. However, a high-pitched squealing/screeching noise coming from the GPU is reliable.

CPU

  • Although memtest86+ tests the CPU as well as RAM, there is a more specific test:
    • Great Internet Mersenne Prime Search is very accurate to CPU errors in mode 1 (Small FFTs) and a bit less so in mode 2 (Large FFTs) and it will report any errors that occur. It is a great way to differentiate between a RAM and CPU error. Let it run until the CPU temperature is stable and then as long as you like proportional to your suspicion. Errors are typically found rather quickly, so you don't have to wait too long.
To check the CPU temperature make sure you have lm-sensors installed. Configure it by running
sensors-detect

and then copy the modules it needs modprobe'd to /etc/rc.d/rc.local, make it executable, and run it. To check temperatures run

sensors

or you can use a monitor of your choice. sensors will also list critical temperatures, but the most accurate temperatures are shown in the BIOS, so do a comparison.

Motherboard

  • Sadly, there is no reliable software test for motherboard errors. The diagnosis is mostly a process of elimination. You can also take it to a shop and have them test the motherboard.
  • Being able to pinpoint the motherboard as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors, a common problem with motherboards.

DVD+-RW

Burn a disk iso and run:

cmp input.iso /dev/sr0

It should say:

cmp: EOF on input.iso

If it does not say that, then the DVD/CD was not burned properly. However, it may also be because of bad media or high burn speed. If the drive keeps spinning up and down while reading a disk it could be failing or it could be that you are trying to play a commercial DVD whose region code is not supported by the drive, which limits it to 1x read speed.

Mouse

Trying to eliminate driver and software issues would be a first step to diagnosing mouse problems. Try restarting Xorg, reloading mouse drivers, and restarting the system. If these have no significant effect, then it is likely a hardware issue.

Solutions

An overheating system can cause instability and may mimic failing hardware. Before doing anything else it may be worthwhile to get a compressed air canister and clean the dust out from all the fans, heat sinks, and all the hard to reach places where dust accumulates inside the case. Using a brush is not as effective and may damage components, so it should only be used outside the case. It is also important to make sure there is proper airflow inside the case:
  1. At the bare minimum you should have a large 120mm output fan in the rear of the case for an ATX motherboard. For smaller systems it varies, and only a smaller fan may fit.
  2. Make sure that cables inside the case do not obstruct airflow, and if they do then use plastic cable ties to fix it if possible. Do not leave any stray metal inside the case that could short-circuit components.
  3. Make sure fans are placed so that they cause air movement across hot components, or at the very least evacuate hot air from the case.

RAM

Remove the DIMMs/RAM sticks one at a time, and then check again with memtest86+. If it fails, replace the stick, remove another and run the test again. Rarely, more than one DIMM may be failing, so take that into account. It could also be the CPU if all your DIMMs fail the test.

If you cannot replace the RAM soon, like if you have an ancient computer and can't find RAM for it, you can use the mem= kernel option to force the kernel to use only good RAM. Say you have an error at 129.0 MB, like I did recently, you could use this kernel boot option:
mem=128M

and it will only use the first 128 MB of RAM, omitting the bad part. There is also an option to exclude a section of RAM, but it is less tested and may not work.

PSU

Replace the PSU with a new PSU, if the symptoms disappear, you can be sure it was the PSU. However, note that a bad PSU may damage the motherboard or other components, or maybe it was a power surge that damaged everything.

Power Button

Try to clean the power button using compressed air. If that doesn't work then replace the power button if possible, otherwise replace the case along with the button or take it to a repair shop and see what they can do.

CMOS battery

Replace the battery carefully using the special tab. Do NOT pry it off using a screwdriver because then it will break and it won't go back in.

HDD

You should first get your important data off of it. If you feel it is failing fast, use ddrescue to image the drive to another drive. Once you have the image, be it complete or incomplete, you can use Testdisk and/or foremost to carve data off the image. The data will not have the same file name it used to, but at least you will get the data. You can find all these utilites and more on the SystemRescueCD. Now just replace the drive.

There exist programs that claim to correct bad blocks. What they do is mark the bad blocks so the drive doesn't used them. The problem is that bad blocks are an indicator of imminent drive failure. So, you may think that they are the fixing the problem, you may put off backing up your data, and then the drive fails and you lose your data.

GPU

Replace the video card. If it is too expensive and you aren't sure, which is very likely due to the fact of the matter, try testing it in another machine to be sure, or try testing using a spare known good video card and see if the symptoms persist.

CPU

If you suspect the CPU is overheating, then you can try to remove the heatsink-fan block according to your CPU or heatsink manual, clean off the old thermal paste, apply new thermal paste, or let an expert do it. Otherwise, replace the CPU. Make sure it is actually the CPU and not the RAM that is the problem, as a CPU is very expensive compared to RAM.

If you cannot replace the CPU and it is a multi-core CPU with only one core that has failed, for example the L1 cache of CPU core #2, then you can try to disable the failed core. One way to do this is to use the maxcpus= kernel boot option. In the example you would use:
maxcpus=2

this is because CPU cores are numbered starting with 0, so 4 cores would be numbered 0, 1, 2, 3. Note that if CPU #0 is failing, you may be out of luck. Another complementary way is to first make sure you have CONFIG_HOTPLUG_CPU built-in to the kernel, and running:

echo 1 > /sys/devices/system/cpu/cpu3/online

on every boot. In this example this turns ON CPU #3. So, basically you would have the 3 good cores running i.e. 0, 1, 3.

Motherboard

Replace the motherboard, and make sure the PSU is not damaged, as it can damage your new motherboard.

DVD+-RW

Replace the drive.

Mouse

Although replacing the mouse is the definitive solution, some fixes may help.

  • For click-related problems, it may be that the small plastic switch inside the mouse has worn down with repetitive use. Correcting the issue may be as simple as adding something to raise the height of the switch. One possibility is adding a small piece of duct tape over the switch to raise its height and improve its connection to the mouse button.
  • For wandering / jumping pointer problems, cleaning the optical / laser window on the underside of the mouse with alcohol and a q-tip may help. If it is a wireless mouse consider infrared or microwave interference depending on the wireless mouse type. Removing or isolating sources of interference or distancing oneself from the sources may help. Wireless mouse manuals recommend placing the base station away from electronic devices or other sources of electromagnetic interference.
  • For acceleration issues one can try adjusting acceleration settings. However, in some cases the acceleration issues are intermittent, and may be driver issues that are much harder to solve. Trying different driver and software versions may help, or one can try to contact the developers of the drivers / software.

Sources

 howtos:hardware:hardware_diagnostics ()