←- Reviewed 20130110 by hazel –>
This is an attempt to document symptoms, open-source diagnostic software, and solutions to hardware failure.
Before working inside the case you should:
- Power off your computer, turn off the PSU, and unplug it.
- Wear an antistatic wrist strap.
- Ground yourself by first touching the PSU.
- Be careful not to damage motherboard components with sharp, hard tools.
- Don't use force to remove components and check the manual for how to remove them.
- Don't do anything you are not comfortable with.
- If you don't know what you are doing, get an expert to do it.
- Don't have water or other conductive liquids near the computer or work area.
- Don't leave any metallic or conducting objects inside the case as they may short-circuit components.
- The computer typically does not POST or boot properly, but it will try.
- Each boot typically causes different symptoms, i.e. the boot will halt in a different place each time.
- The fans are typically running, often at 100%.
- There may be beep codes, so check your manual or wiki to see what they mean.
- Randomly occurring kernel panics and segmentation faults in running programs.
- Sudden shutdown or reboot or hang without warning or logs of what went wrong.
- Sudden shutdown or reboot or hang may occur during high system load or power usage or even when idle.
- Sudden system hangs that cannot be recovered using SysRq REISUB keys, and cannot SSH into the system. Audio that is playing may loop continuously during the hang.
- You may notice a strange smell coming directly from the PSU as it overheats.
- The system may not boot after pressing the power button, and you may need to press it more than once.
- Rarely it dies completely, like after a power surge, and the fans will not be running and it will not POST or boot and there may be motherboard damage.
- The time is reset on each boot.
- The BIOS settings reset on boot.
- I/O errors in the logs.
- Filesystem corruption.
- May POST but may not finish booting properly.
- Disk access slows down right before failure.
- Strange noises such as clicks, grinding, and spinning up noises.
- The BIOS will not detect a completely dead disk.
- Graphical glitches on screen.
- If it is really dead the screen will be black.
- May causes system hangs, and audio may be looping during the hang.
- There may be messages in the logs relating to the video drivers or an Xorg crash or nothing at all.
- May or may not POST.
- Kernel panics are possible with multi-core machines, when only one core is affected.
- If beep codes are heard, check the wiki
- The fans are typically running at 100%.
- If it is overheating it may trigger a MCE (Machine-check exception) which will cause a kernel panic and forced shutdown, or it may throttle itself down and the system will seem slower.
- Check for swollen capacitors.
- A dead motherboard does not POST.
- The fans connected to the motherboard will not be running, the ones connected to the PSU may be running at 100%.
- Won't read or write disks properly.
- May keep spinning up and down forever while trying to read a disk.
- May not open when you press the button, but instead make some clunking sounds and open only after many button presses.
- Run one of the following programs. Errors are typically found during the first run, but do more runs proportional to your suspicion.
- There is no specific test for the PSU. However, you should monitor voltages. You can usually do this in the BIOS, and these are the most accurate. Make sure all the voltages are above their stated voltage. For example the +3.3V line should be greater than 3.3V, +5V line should be greater than 5V, and most importantly the +12V rail should be greater than 12V. Note that this does NOT mean that you should increase the voltage if it is low, this is usually done automatically by the BIOS. Normally, voltages should be stable about a certain value, above the critical value. If the PSU is failing, the voltages can vary quickly and are close to the critical value. For example, a good PSU will have a +3.3V line voltage of a stable 3.35 (just an example). A bad PSU will have a variable +3.3V voltage that quickly varies between 3.30 and 3.32. Planning ahead, when you first get a new PSU, you should write down the voltages and save them for reference and monitor them over time. If there is no option in the BIOS to monitor the voltages you can use a voltmeter or multimeter to measure the voltages directly on the connectors. The pinout for the connectors can be found here: ATX 20-pin, ATX 24-pin. Alternatively, special meters for PSUs are available in electronic shops.
- In theory, it shuts down, reboots, or hangs under increased power usage because it cannot provide the needed current and overheats as a result. As such, perhaps the best way to diagnose it is by monitoring voltages under load, so you should install
lm_sensorsand configure it using
sensors-detect, and then use a monitoring program to monitor voltages and warn when they go below the given limit. Then you load the system with some CPU or GPU intensive application and wait for the warnings or sudden shutdown/reboot/hang. Technically, the shutdown/reboot/hang can occur even when idle, and the voltage drop may be too fast for any alarm to catch.
- You could take the battery out, making sure to use the special tab to remove it rather than trying to pry it off, and measure its voltage. Or you could just throw it away in the proper battery disposal container and replace it anyway just to be sure, and also because you may have had to remove a graphics card or PCI/PCIE card to reach the battery and you may not have a battery tester for these types of batteries.
- You can either run a smartctl long test, which tests the entire disk surface for errors, and updates the offline attributes, or you can run the specific proprietary manufacturer's utility. smartctl will also show the HDD temperature and airflow temperature. Make sure the temperatures are below 60 C.
smartctl -t long /dev/sd?
You then have to wait the time it estimates, plus a few more minutes for the result which you can check with
smartctl -a /dev/sd?
The attributes are listed, but you can check them separately with
smartctl -A /dev/sd?
Here is an important note on attributes from
Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH". If the Nor- malized value is less than or equal to the Threshold value, then the Attribute is said to have failed. If the Attribute is a pre-failure Attribute, then disk failure is imminent. The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value.
hdparm -B 254 /dev/sd?
- Video Memory Stress Test is also available on sourceforge and UBCD. It has the limitation that it cannot always recognize the amount of video RAM properly, and the DOS version cannot recognize more than 512 MB of video RAM. From experience, it works well with integrated Intel cards, and not very well with Nvidia or ATI cards.
- CUDA GPU memtest requires a video card that either supports CUDA such as an nivdia card with the development nvidia drivers and CUDA installed, or a video card that support opencl which can be an nvidia or ATI card plus opencl installed and supported by the drivers. The test is comprehensive and the authors claim it can detect soft and hard errors. From experience, it may not detect hard errors.
- If the above don't work, then you could just run a video game in benchmark mode and watch for graphical glitches on the screen or system crashes. The problem with this method is that there is no way to know if the drivers are at fault or if the card is at fault, unless you have prior experience with the game and the glitches or crashes are new.
- Although memtest86+ tests the CPU as well as RAM, there is a more specific test:
- Great Internet Mersenne Prime Search is very accurate to CPU errors in mode 1 (Small FFTs) and a bit less so in mode 2 (Large FFTs) and it will report any errors that occur. It is a great way to differentiate between a RAM and CPU error. Let it run until the CPU temperature is stable and then as long as you like proportional to your suspicion. Errors are typically found rather quickly, so you don't have to wait too long.
and then copy the modules it needs modprobe'd to
/etc/rc.d/rc.local, make it executable, and run it. To check temperatures run
or you can use a monitor of your choice.
sensors will also list critical temperatures, but the most accurate temperatures are shown in the BIOS, so do a comparison.
- Sadly, there is no reliable software test for motherboard errors. The diagnosis is mostly a process of elimination. You can also take it to a shop and have them test the motherboard.
Burn a disk iso and run:
cmp input.iso /dev/sr0
It should say:
cmp: EOF on input.iso
If it does not say that, then the DVD/CD was not burned properly. However, it may also be because of bad media or high burn speed. If the drive keeps spinning up and down while reading a disk it could be failing or it could be that you are trying to play a commercial DVD whose region code is not supported by the drive, which limits it to 1x read speed.
- At the bare minimum you should have a large 120mm output fan in the rear of the case for an ATX motherboard. For smaller systems it varies, and only a smaller fan may fit.
- Make sure that cables inside the case do not obstruct airflow, and if they do then use plastic cable ties to fix it if possible. Do not leave any stray metal inside the case that could short-circuit components.
- Make sure fans are placed so that they cause air movement across hot components, or at the very least evacuate hot air from the case.
Remove the DIMMs/RAM sticks one at a time, and then check again with memtest86+. If it fails, replace the stick, remove another and run the test again. Rarely, more than one DIMM may be failing, so take that into account. It could also be the CPU if all your DIMMs fail the test.
mem=kernel option to force the kernel to use only good RAM. Say you have an error at 129.0 MB, like I did recently, you could use this kernel boot option:
and it will only use the first 128 MB of RAM, omitting the bad part. There is also an option to exclude a section of RAM, but it is less tested and may not work.
Replace the PSU with a new PSU, if the symptoms disappear, you can be sure it was the PSU. However, note that a bad PSU may damage the motherboard or other components, or maybe it was a power surge that damaged everything.
Replace the battery carefully using the special tab. Do NOT pry it off using a screwdriver because then it will break and it won't go back in.
You should first get your important data off of it. If you feel it is failing fast, use ddrescue to image the drive to another drive. Once you have the image, be it complete or incomplete, you can use Testdisk and/or foremost to carve data off the image. The data will not have the same file name it used to, but at least you will get the data. You can find all these utilites and more on the SystemRescueCD. Now just replace the drive.
Replace the video card. If it is too expensive and you aren't sure, which is very likely due to the fact of the matter, try testing it in another machine to be sure, or try testing using a spare known good video card and see if the symptoms persist.
If you suspect the CPU is overheating, then you can try to remove the heatsink-fan block according to your CPU or heatsink manual, clean off the old thermal paste, apply new thermal paste, or let an expert do it. Otherwise, replace the CPU. Make sure it is actually the CPU and not the RAM that is the problem, as a CPU is very expensive compared to RAM.
Replace the motherboard, and make sure the PSU is not damaged, as it can damage your new motherboard.
Replace the drive.