Table of Contents
Hardware Diagnostics
This is an attempt to document symptoms, open-source diagnostic software, and solutions to hardware failure.
Before working inside the case you should:
- Power off your computer, turn off the PSU, and unplug it.
- Wear an antistatic wrist strap.
- Ground yourself by first touching the PSU.
- Be careful not to damage motherboard components with sharp, hard tools.
- Don't use force to remove components and check the manual for how to remove them.
- Don't do anything you are not comfortable with.
- If you don't know what you are doing, get an expert to do it.
- Don't have water or other conductive liquids near the computer or work area.
- Don't leave any metallic or conducting objects inside the case as they may short-circuit components.
Common Symptoms
RAM
- The computer typically does not POST or boot properly, but it will try.
- Each boot typically causes different symptoms, i.e. the boot will halt in a different place each time.
- The fans are typically running, often at 100%.
- There may be beep codes.
- Randomly occurring kernel panics and segmentation faults in running programs.
PSU
- Sudden shutdown or reboot without warning or logs of what went wrong.
- Sudden shutdown or reboot may occur during high system load or power usage or even when idle.
- You may notice a strange smell coming directly from the PSU as it overheats.
- You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.
- The system may not boot after pressing the power button, and you may need to press it more than once.
- Rarely it dies completely, like after a power surge, and the fans will not be running and it will not POST or boot and there may be motherboard damage.
Power Button
- System powers on without wake-on-LAN or wake-on-timer or other such wake methods.
- Reboot loop where the system shuts down and powers up on its own continuously.
- The system may not boot after pressing the power button, and you may need to press it more than once.
- You may get a “boot failure” message during system POST.
- You may hear beep codes.
CMOS battery
- The time is reset on each boot.
- The BIOS settings reset on boot.
- There may be beep codes.
- The screen may be black without any beep codes.
HDD
- I/O errors in the logs.
- Filesystem corruption and bad blocks detected.
- May POST but may not finish booting properly.
- Disk access slows down right before failure.
- Strange noises such as clicks, grinding, and spinning up noises.
- The BIOS will not detect a completely dead disk.
GPU
- Graphical glitches on screen.
- The screen may be black.
- May cause system hangs, and audio may be looping during the hang.
- You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.
- There may be messages in the logs relating to the video drivers or an Xorg crash or nothing at all.
CPU
- May or may not POST.
- There may be beep codes.
- The fans are typically running at 100%.
- Kernel panics are seen with multi-core machines, when only one core is affected. These are clearly listed as
Hardware Error
MCE (Machine-check exception). - If it is overheating it may also trigger a MCE which will cause a kernel panic and forced shutdown, or it may throttle itself down and the system will seem slower.
Motherboard
- Check for swollen capacitors.
- You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.
- A dead motherboard does not POST.
- The fans connected to the motherboard will not be running, the ones connected to the PSU may be running at 100%.
DVD+-RW
- Won't read or write disks properly.
- May keep spinning up and down forever while trying to read a disk.
- May not open when you press the button, but instead make some clunking sounds and open only after many button presses.
Mouse
- Problems with click-and-drag, drag-and-drop, drag-to-select, double-click.
- Mouse pointer jumps.
- Acceleration issues.
Diagnostics
RAM
- Run one of the following programs. Errors are typically found during the first run, but do more runs proportional to your suspicion.
PSU
- There is no specific test for the PSU. However, you should monitor voltages. You can usually do this in the BIOS, and these are the most accurate. Make sure all the voltages are above their stated voltage. For example the +3.3V line should be greater than 3.3V, +5V line should be greater than 5V, and most importantly the +12V rail should be greater than 12V. Note that this does NOT mean that you should increase the voltage if it is low, this is usually done automatically by the BIOS. Normally, voltages should be stable about a certain value, above the critical value. If the PSU is failing, the voltages can vary quickly and are close to the critical value. For example, a good PSU will have a +3.3V line voltage of a stable 3.35 (just an example). A bad PSU will have a variable +3.3V voltage that quickly varies between 3.30 and 3.32. However, there is the possibility that the voltage is fluctuating not because of a bad PSU, but because of another bad component (say a graphics card) that is lowering the voltage on the line due to internal shorting or malfunction. Planning ahead, when you first get a new PSU, you should write down the voltages and save them for reference and monitor them over time. If there is no option in the BIOS to monitor the voltages you can use a voltmeter or multimeter to measure the voltages directly on the connectors. The pinout for the connectors can be found here: ATX 20-pin, ATX 24-pin. Alternatively, special meters for PSUs are available in electronic shops.
- In theory, it shuts down or reboots under increased power usage because it cannot provide the needed current and overheats as a result. As such, perhaps the best way to diagnose it is by monitoring voltages under load, so you should install
lm_sensors
and configure it usingsensors-detect
, and then use a monitoring program to monitor voltages and warn when they go below the given limit. Then you load the system with some CPU or GPU intensive application and wait for the warnings or sudden shutdown/reboot. Technically, the shutdown/reboot can occur even when idle, and the voltage drop may be too fast for the alarm to catch. - Being able to pinpoint the PSU as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors.
Power Button
- A bad power button is surprisingly difficult to diagnose, and there are no software tests that can help. Looking for characteristic symptoms such as a reboot loop and swapping out hardware may be the only way to diagnose it.
CMOS battery
- You could take the battery out, making sure to use the special tab to remove it rather than trying to pry it off, and measure its voltage. Or you could just throw it away in the proper battery disposal container and replace it anyway just to be sure, and also because you may have had to remove a graphics card or PCI/PCIE card to reach the battery and you may not have a battery tester for these types of batteries.
HDD
- You can either run a smartctl long test, which tests the entire disk surface for errors, and updates the offline attributes, or you can run the specific proprietary manufacturer's utility. smartctl will also show the HDD temperature and airflow temperature. Make sure the temperatures are below 60 C.
- SystemRescueCD including smartctl, ddrescue, TestDisk and foremost.
- Ultimate Boot CD including lots of FLOSS diagnostics and recovery utilities as well as the proprietary manufacturers utilities.
smartctl -t long /dev/sd?
You then have to wait the time it estimates, plus a few more minutes for the result which you can check with
smartctl -a /dev/sd?
The attributes are listed, but you can check them separately with
smartctl -A /dev/sd?
Here is an important note on attributes from man smartctl
Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH". If the Nor- malized value is less than or equal to the Threshold value, then the Attribute is said to have failed. If the Attribute is a pre-failure Attribute, then disk failure is imminent. The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value.
/etc/rc.d/rc.local
hdparm -B 254 /dev/sd?
GPU
- Video Memory Stress Test is also available on sourceforge and UBCD. It has the limitation that it cannot always recognize the amount of video RAM properly, and the DOS version cannot recognize more than 512 MB of video RAM. From experience, it works well with integrated Intel cards, and not very well with Nvidia or ATI cards.
- CUDA GPU memtest requires a video card that either supports CUDA such as an nivdia card with the development nvidia drivers and CUDA installed, or a video card that support opencl which can be an nvidia or ATI card plus opencl installed and supported by the drivers. The test is comprehensive and the authors claim it can detect soft and hard errors. From experience, it may not detect hard errors.
- If the above don't work, then you could just run a video game in benchmark mode and watch for graphical glitches on the screen or system crashes. The problem with this method is that there is no way to know if the drivers are at fault or if the card is at fault, unless you have prior experience with the game and the glitches or crashes are new.
- Being able to pinpoint the GPU as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors.
CPU
- Although memtest86+ tests the CPU as well as RAM, there is a more specific test:
- Great Internet Mersenne Prime Search is very accurate to CPU errors in mode 1 (Small FFTs) and a bit less so in mode 2 (Large FFTs) and it will report any errors that occur. It is a great way to differentiate between a RAM and CPU error. Let it run until the CPU temperature is stable and then as long as you like proportional to your suspicion. Errors are typically found rather quickly, so you don't have to wait too long.
sensors-detect
and then copy the modules it needs modprobe'd to /etc/rc.d/rc.local
, make it executable, and run it. To check temperatures run
sensors
or you can use a monitor of your choice. sensors
will also list critical temperatures, but the most accurate temperatures are shown in the BIOS, so do a comparison.
Motherboard
- Sadly, there is no reliable software test for motherboard errors. The diagnosis is mostly a process of elimination. You can also take it to a shop and have them test the motherboard.
- Being able to pinpoint the motherboard as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors, a common problem with motherboards.
DVD+-RW
Burn a disk iso and run:
cmp input.iso /dev/sr0
It should say:
cmp: EOF on input.iso
If it does not say that, then the DVD/CD was not burned properly. However, it may also be because of bad media or high burn speed. If the drive keeps spinning up and down while reading a disk it could be failing or it could be that you are trying to play a commercial DVD whose region code is not supported by the drive, which limits it to 1x read speed.
Mouse
Trying to eliminate driver and software issues would be a first step to diagnosing mouse problems. Try restarting Xorg, reloading mouse drivers, and restarting the system. If these have no significant effect, then it is likely a hardware issue.
Solutions
- At the bare minimum you should have a large 120mm output fan in the rear of the case for an ATX motherboard. For smaller systems it varies, and only a smaller fan may fit.
- Make sure that cables inside the case do not obstruct airflow, and if they do then use plastic cable ties to fix it if possible. Do not leave any stray metal inside the case that could short-circuit components.
- Make sure fans are placed so that they cause air movement across hot components, or at the very least evacuate hot air from the case.
RAM
Remove the DIMMs/RAM sticks one at a time, and then check again with memtest86+. If it fails, replace the stick, remove another and run the test again. Rarely, more than one DIMM may be failing, so take that into account. It could also be the CPU if all your DIMMs fail the test.
mem=
kernel option to force the kernel to use only good RAM. Say you have an error at 129.0 MB, like I did recently, you could use this kernel boot option:
mem=128M
and it will only use the first 128 MB of RAM, omitting the bad part. There is also an option to exclude a section of RAM, but it is less tested and may not work.
PSU
Replace the PSU with a new PSU, if the symptoms disappear, you can be sure it was the PSU. However, note that a bad PSU may damage the motherboard or other components, or maybe it was a power surge that damaged everything.
Power Button
Try to clean the power button using compressed air. If that doesn't work then replace the power button if possible, otherwise replace the case along with the button or take it to a repair shop and see what they can do.
CMOS battery
Replace the battery carefully using the special tab. Do NOT pry it off using a screwdriver because then it will break and it won't go back in.
HDD
You should first get your important data off of it. If you feel it is failing fast, use ddrescue to image the drive to another drive. Once you have the image, be it complete or incomplete, you can use Testdisk and/or foremost to carve data off the image. The data will not have the same file name it used to, but at least you will get the data. You can find all these utilites and more on the SystemRescueCD. Now just replace the drive.
GPU
Replace the video card. If it is too expensive and you aren't sure, which is very likely due to the fact of the matter, try testing it in another machine to be sure, or try testing using a spare known good video card and see if the symptoms persist.
CPU
If you suspect the CPU is overheating, then you can try to remove the heatsink-fan block according to your CPU or heatsink manual, clean off the old thermal paste, apply new thermal paste, or let an expert do it. Otherwise, replace the CPU. Make sure it is actually the CPU and not the RAM that is the problem, as a CPU is very expensive compared to RAM.
maxcpus=
kernel boot option. In the example you would use:
maxcpus=2
this is because CPU cores are numbered starting with 0, so 4 cores would be numbered 0, 1, 2, 3. Note that if CPU #0 is failing, you may be out of luck. Another complementary way is to first make sure you have CONFIG_HOTPLUG_CPU
built-in to the kernel, and running:
echo 1 > /sys/devices/system/cpu/cpu3/online
on every boot. In this example this turns ON CPU #3. So, basically you would have the 3 good cores running i.e. 0, 1, 3.
Motherboard
Replace the motherboard, and make sure the PSU is not damaged, as it can damage your new motherboard.
DVD+-RW
Replace the drive.
Mouse
Replacing the mouse is the definitive solution. However, some fixes may help temporarily.
- For click-related problems, opening the mouse and investigating the cause of the problem is the first step. On most mice, there are 4 screws that hold the upper and lower mouse parts together and are located underneath stick-on slippery pads. Note that after removing the stick-on pads they may be difficult or impossible to replace, so only open the mouse if the problem is truly bothersome or if you have extra stick-on pads provided with some mice.
- It may be that plastic parts inside the mouse have worn down with repetitive use. A possible fix for this is to first identify areas that have been worn down and then carefully add plastic in the form of plastic cement or superglue to replace the lost plastic. This is best done with a toothpick and must be very carefully controlled. Remove excess glue immediately and allow the rest to fully dry overnight before using the mouse. Do NOT apply glue to any area that might interfere with mouse function. This is a last resort solution, so do NOT try it until you are ready to throw the mouse out. Duct tape may be used instead for a temporary or trial solution.
- It may also be that the plastic button arch has fatigued over time and doesn't provide enough resistance to prevent inadvertent activation of the switch. A possible fix for this is adding a plastic piece under the plastic button arch to improve its resistance. Depending on the mouse, either gluing a piece of old cut credit card under the plastic button arch or inserting the credit card piece with a piece of foam sandwiched underneath into the pocket under the plastic button arch may work.
- For wandering / jumping pointer problems, cleaning the optical / laser window on the underside of the mouse with alcohol and a q-tip may help. If it is a wireless mouse consider infrared or microwave interference depending on the wireless mouse type. Removing or isolating sources of interference or distancing oneself from the sources may help. Wireless mouse manuals recommend placing the base station away from electronic devices or other sources of electromagnetic interference.
- For acceleration issues one can try adjusting acceleration settings. However, in some cases the acceleration issues are intermittent, and may be driver issues that are much harder to solve. Trying different driver and software versions may help, or one can try to contact the developers of the drivers / software.
Sources
- Written by H_TeXMeX_H
- Contributions from: tobisgd, onebuck, metaschima
- I cite the man page of smartctl.