Skip to content

Latest commit

 

History

History
191 lines (124 loc) · 5.86 KB

hardware-failure.md

File metadata and controls

191 lines (124 loc) · 5.86 KB
title description keywords facebookImage twitterImage hidden section tableOfContents
Diagnose Hardware Failures
If the computer won't start, boot, or otherwise operate normally, there may be a hardware issue. Follow these steps to diagnose hardware failures.
memory
hard drive
won't boot
won't post
hardware
/_social/article
/_social/article
false
hardware-troubleshooting
true

NOTE: If the System will not power on, skip to the end of this article.

If the system boots, but takes a long time to boot, crashes, or reports other random, hard to track down errors, then the individual hardware components can be checked for failure.

Memory

We can test memory in your running OS with the 'memtest' package. You want to put most of your memory under test but still leave enough space for your normal workload and the OS to continue running. On a 8 GB system, testing 6 GB would be tested like this:

sudo apt install memtester
sudo memtester 6G 5

Memory test can take a number of hours. While this will not put all of memory under test, it will make any memory error likely to cause instability if not part of the tested memory or show up clearly with errors in the memtester run

Memtest86++ also has ISO downloads for personal use. You would boot from a USB drive made with the ISO. Right as memtest loads (blue screen), press F2 to enable multi-core mode. Wait at least 20 minutes for the tests to run, or until any errors are shown in red. If any errors are found, please run it again in single core mode, and let it run overnight to check for any memory errors. 6 to 8 passes are minimally recommended. If memory errors show up, the memory stick should be replaced.

Hard Drive

To check the hard drive for disk failures, start the program Disks, select the hard drive on the left, then click the icon in the top right, and choose SMART Data and Self-Tests, and then click Start Self-test and choose the Extended test. This test takes a few hours to run and will will give you a large amount of info about the health of the drive.

All of the values start at 100, and work their way down to 0. The terms "old-age" and "pre-fail" are normal. Pay attention to the overall assessment, and to how close the values are working towards the failure point, which is typically 0.

NVMe Drive

NVMe drives can't be checked with a SMART Test through the Disks application but the package smartmontools can be used for this. It can be installed with this command:

sudo apt install smartmontools

First, let's list the NVMe's that are installed:

sudo nvme list

Under 'Node' you will see a mount path for each drive something like '/dev/nvme0n1', to access the smart-log you would type in the following:

sudo smartctl -a /dev/nvme0n1

Testing the CPU

Using the stress-ng program

Run this command to install stress-ng:

sudo apt -y install stress-ng s-tui

Using the s-tui program

Now this command:

s-tui

From here use the Down to switch from Monitor to Stress by pressing the Enter. Now watch the CPU temperatures raise as the system's CPU is tested.

Testing the GPU

Benchmarking

We can confirm whether there is an issue with the GPU in your system by using a benchmarking tool called Unigine Heaven.

Click the 'Free Download' button and choose the Linux option in the dropdown. Once the download is complete, there should be a Unigine_Heaven-4.0.run file in the Downloads directory.

From a terminal, navigate to the folder with the Unigine Heaven download:

cd Downloads

Run the following command:

chmod +x Unigine_Heaven-4.0.run

Then, the application can be extracted:

./Unigine_Heaven-4.0.run

Next, let's move to the new directory that was created:

cd Unigine_Heaven-4.0/

Now, the application can be started:

./heaven

Click the 'Run' button to begin the program.

GPU Burn (for NVIDIA GPU's only)

We can also test the GPU by using GPU Burn; first, if we're on Ubuntu, we'll need to install git and CUDA with this command:

sudo apt install git system76-cuda-latest 

Then, we will create the symlink for gpu-burn:

sudo ln -s /usr/lib/cuda-11.2 /usr/local/cuda

Next, we can clone the repository with this command:

git clone https://github.com/wilicc/gpu-burn.git

Now that we have cloned it, we can move into that directory like so:

cd gpu-burn

Now we'll compile it:

make

And now we can run it like so (this example will run it for 60 minutes/1 hour):

./gpu_burn -d 3600

Machine Check Exceptions

Machine Check Exceptions are hardware failure events and can be logged with rasdaemon.service to journalctl. On Ubuntu based systems (and Pop!_OS) you can install via:

sudo apt install rasdaemon

verify rasdaemon is active

systemctl status rasdaemon

Then, after the system has crashed or been used for a period of time, take a look at the log:

journalctl -f -u rasdaemon

If there is no log or the log is empty, then the crash isn't related to a hardware failure. The log will stay empty until a MCE happens. Take a look for "uncorrected" errors, as most "corrected" errors can be ignored. If there are a consistent number of "uncorrected" errors, the hardware should be examined.

Won't Power On

NOTE: If the system fails to power on, please use the following articles to troubleshoot: Desktops Laptops

Support

Please contact support by opening a ticket to get the system repaired or to have failed components replaced.