GPU upgrade is causing all sorts of problems that look possibly CPU related


Recommended Posts

I currently have two PCs I am using, a 11700K based system with a 850 watt PSU I built myself and a Dell Precision T3610 with a 675Watt PSU. The Dell was supposed to be a backup system, but the 11700K system while functional is only half-built and it will be a while until I can finish it, so I am mostly using the "backup" Dell system as my main for now and remoting into the 11700K when I need it.  
 
The Dell originally had a Xeon E5-1620v2 and 16GB of RAM, but has been upgraded to a Xeon E5-2667v2 and 8x16GB RAM configuration... when I upgraded the RAM I was very rarely getting an error only in OCCT, but that seemed to go away and only happened when I tested all 128GB, didn't happen within what I generally use and no other RAM test ever showed an error.  
 
Since I will be stuck on this Dell for longer than I expected, I decided to also get a GPU upgrade so I can at least game on it a little as the Quadro K4000 it came with was rather useless for that. I got an HP OEM 2060 Super, these are the specs:  
 
https://i.imgur.com/WrCo4JJ.jpg  
 
https://i.imgur.com/K8xN64m.gif  
 
https://www.techpowerup.com/gpu-specs/hp-rtx-2060-super-oem.b10625/  
 
Since I wanted to make sure this card worked on it's own, I installed it in my 11700K system first and ran every benchmark, stress test, and any other test I could think of on it. Ran a neural network for a short time too and played several games over the course of a week or two. No issues whatsoever, it even operated exactly with no deviation of the expected performance of this OEM model card according to the benchmarks.  
 
So I installed it in the Dell then (Yes I used DDU in safe mode first), I had to use a 2x6 pin to 1x8 pin cable to power it but many others have used this exact same model desktop to do something similar, it looked like this when installed:  
 
https://i.imgur.com/IRwaHXR.jpg  
 
https://i.imgur.com/mWEvDnX.jpg  
 
https://i.imgur.com/n64XbRu.jpg  
 
The side cover was left off so I could monitor the wiring to make sure nothing is getting too warm or showing obvious faults, the cover is still off right now, not sure if this makes a difference.  
 
I wanted to make sure the PSU could handle it so I ran Furmark's stress test and saw no issues. Then I ran Prime95 and no issues. Then I tried both at the same time and walked away for a minute, when I came back the system had rebooted. 
 
Event log showed nothing other than an unexpected reboot, and Prime95+Furmark ran again together for 20 minutes with no issues, so I figured it was a one-off and proceeded to do the same tests on the Dell. All of them again passed (Although OCCT at first was claiming the CPU benchmark was crashing, but nothing in the event logs) and some things that I thought were issues such as mentions of nvlddmkm crashing in the event log turned out to just be red herrings from a demo I was using to test that is a known issue for many. One issue that did crop up however is that the GPU ran much hotter, not a surprise as the Dell's case is much smaller and has much lower airflow. Card was around 70C-80C in the 11700K, but it was constantly at 80C and many times hit 85C in the Dell.  
 
I thought I was done, but then two days later as I was using the PC normally and not even doing anything demanding on it my monitor went blank claiming there was no signal. It looked like it was constantly about to get the signal back then then losing it again. I tried soft-rebooting/shutting down by pressing the power button, Win+X then U U, and mashing Ctrl+Alt+Del but nothing happened, so I forced a hard power off.  
 
Event Viewer said nvlddmkm had crashed three times, and then it was flooded with warnings about the display driver having "successfully" recovered.  
 
I tried OCCTs benchmark again as that was the only thing that really claimed to have crashed more than once in the same test, ran every CPU benchmark separately several times, and then the white suite several times... no problems. 
 
I then tried the CPU stress test.... never heard my fans go that hard before, but it seemed to be working... from when I could focus on my screen throughout the fan noise at least. After about 20 minutes however the same no signal issue happened. 
 
Now I am just confused what the issue could possibly be. First thing that instantly comes to mind is the GPU, and I am still barely within the time window to claim it's defective on eBay if it is the GPU, but it never gave me any problems no matter how hard I pushed it in the 11700K system. I also considered maybe it's the power, but 675 watts should be enough unless this cabling is wrong (I was a little confused why the 8-pin power that connects to the PSU side only had 6 wires populated) or the PSU is just too old. Also wondered if it's heat (and if having the side panel off contributed to this) since the GPU is running much hotter and when I stressed the CPU I never heard my fans run that hard before. Or it could be a faulty CPU or even RAM after all, but if it's the CPU or RAM that's faulty and crashing the system why am I getting an error that it's the GPU driver that crashed? 
 
I have no idea where to even begin considering what could be the fault, and how to try to fix it. Especially in a way that won't leave my system down for days or even weeks while I try to hunt down replacement parts. At the same time since I use this system 24/7 I don't want it just randomly crashing on me and forcing a shutdown, especially since that can cause data corruption. 
 
Does anyone have any advice?

  • Sad 1
Link to comment
Share on other sites

Just something quick to try off the top of my head...

try the NEW PC's 850 watt PSU in the old Dell system (assuming you can) and see if that acts up. because if it's stable here, then I would lean towards your 675 watt PSU potentially being at fault in the old Dell system (or not enough juice etc).

p.s. you could even try that 675 watt PSU on the newer board and see what happens to see if it reacts similarly.

Link to comment
Share on other sites

Try running the stress tests and games with the framerate locked using something like the driver itself or rivatuner, etc. If the Dell system no longer crashes then it probably is the PSU. (Or maybe the motherboard.)

Link to comment
Share on other sites

I tried updating my drivers to the latest and testing again.

Literally spent all day running every stress test OCCT has. Both CPU tests, RAM test, all GPU tests, even the Power test that rocketed my power usage to 450W+

No problems whatsoever.

I figured those must have been a random occurrence so I assumed it was working fine now and started playing a game.... 10 minutes into the game the system crashed in the exact same way again. Image freezes for about 2 seconds, then I get the "no signal" message on my monitor. I could still hear the game's music in the background but it didn't appear to actually still be running since I could not hear anything else actually happening in the game.

I am starting to lose my sanity. Every stress test I throw at it shows no issues, but then when I just normally use my system I get that crash.

Link to comment
Share on other sites

On 02/03/2023 at 20:38, Cyber Akuma said:

Every stress test I throw at it shows no issues, but then when I just normally use my system I get that crash.

This is why stress tests often don't help. Also it looks like you have tested everything except memory? So why not do that too I suppose. HCI memtest and yCruncher are the two best free software for stressing RAM

Link to comment
Share on other sites

Do you mean my system RAM or the VRAM? OCCT had tests for both. If you mean system RAM I also ran hours of Memtest86, Memtest86+ and Prime95 when I first upgraded the RAM. Originally there was a WHEA error (but not a crash) months ago but that never appeared again since.

Link to comment
Share on other sites

system RAM. I have had better luck with HCI and yCruncher than with Memtest86.

Also for vram, I recently ran some ray tracing tests using Bright Memory benchmark tool, and at 8K, it almost completely maxed out the 16GB vram. https://www.neowin.net/news/amds-2321-windows-10--11-driver-is-no-threat-to-nvidias-ray-tracing-crown-anytime-soon/ so maybe you can try this.

Link to comment
Share on other sites

That was actually one of the benchmarks I ran, had no issue running it on either system with this card. It would not run past 1080p though since my monitors are only 1080p. Ran several other raytracing benchmarks too.

Link to comment
Share on other sites

On 02/03/2023 at 23:27, Cyber Akuma said:

It would not run past 1080p

i believe you'd need to enable DSR on your Nvidia card. After that, a 4K RT test should be enough for the 8GB on your 2060 super. And i still recommend HCI and yCruncher, they are really good for testing unstable system memory

Also I think it is a CPU (or motherboard issue) you are having. Sometimes faulty CPUs can be fine when stress testing but give up during games and real world workloads. but it is really hard to be sure. And of course, it could be the PSU issue as well, maybe it is not giving enough power or handling the power spikes (which is why I said a locked framerate may help if it no longer crashes).

Link to comment
Share on other sites

Yeah, that's a bit of the problem. It could be ANYTHING, I was hoping to narrow it down instead of just tossing money at parts to see what, if anything, fixes it.

The CPU is only like $20 on eBay so that's not an issue,. PSU is a bit more. But I already blew $175 on this card and I had to save up for that so I am hoping not to just toss money around everywhere hoping to fix it if I can't at least narrow it down.

By the way, what do you think the chances are it's the GPU? The seller does not accept returns but would have to if it's defective. The fact that I had no problems in the other system leads me to think it's not the GPU though.

Link to comment
Share on other sites

what's missing from your new PC that's preventing your from switching over fully? we're getting to wild goose chase territory on the Dell is sounds like.

Since you suspect CPU have you tried reseating it and applying fresh thermal paste yet? two easy enough things to do that will at least rule out iffy pin connections and possible dry/uneven spots if anything.

Link to comment
Share on other sites

Cabling for some of the drives, replacement fans, some drives, and I need to do some manual modifications. After that it's a lot of trying to restore software. I would be a lot faster to try to get the Dell working for now. I wanted this to become a viable backup system if my main one goes down again anyway.

I mean, I installed that CPU last October with fresh paste and all tests back then and using the system 24/7 since then had no issues, the day I installed the GPU I started getting issues. I could try reseating it again over the weekend if you think that really could be it, don't want to stress the pins of the CPU port too much from constant CPU-reinstalls if it's unnecessary though since these LGA sockets are pretty delicate.

Link to comment
Share on other sites

It's almost always PSU issues when things start randomly going wrong like this. Especially after adding something power hungry like a GPU. 

The key word is almost, so do the basics and try thermal paste and reseating as@Brandon Hsuggested first, it's cheap and somewhat easy to do.

If you're concerned about reseating the CPU, then just check the paste and all the cables.

Edited by Joe User
Link to comment
Share on other sites

narrowing it down using stress test or canned benchmarks almost never works unless it's a memory issue or a lack-of-power issue. Coz real world workloads, including gaming, are dynamically hitting different components depending on user input. Often that can point towards a CPU issue.

As for the GPU, you did say you started having issues with your PC right after, so who knows. You said you lost signal as soon as you started gaming so it could very well be the GPU too. Wish you'd have a cheap spare GPU lying around, but sadly options may be limited since this is OEM PC.

 

Also curious to know if you have flashed the vBIOS? https://www.techpowerup.com/vgabios/220009/220009  

Link to comment
Share on other sites

On 03/03/2023 at 01:37, Cyber Akuma said:

My GPU's BIOS seems to be a newer version than the one listed there.

That's really interesting, but newer may not mean more stable. so perhaps you can back up your current BIOS and try the older one? Who knows, it could work out!

Link to comment
Share on other sites

On 02/03/2023 at 14:04, hellowalkman said:

narrowing it down using stress test or canned benchmarks almost never works unless it's a memory issue or a lack-of-power issue. Coz real world workloads, including gaming, are dynamically hitting different components depending on user input. Often that can point towards a CPU issue.

As for the GPU, you did say you started having issues with your PC right after, so who knows. You said you lost signal as soon as you started gaming so it could very well be the GPU too. Wish you'd have a cheap spare GPU lying around, but sadly options may be limited since this is OEM PC.

 

Also curious to know if you have flashed the vBIOS? https://www.techpowerup.com/vgabios/220009/220009  

I mean, I can get a new CPU for $15-20, would it be better to just outright replace the CPU in case it's that? Although I still highly expect it's not the CPU since it would have caused issues before unless it very very specifically has something wrong with a PCIe lane that the older card never used.

On 02/03/2023 at 14:15, hellowalkman said:

That's really interesting, but newer may not mean more stable. so perhaps you can back up your current BIOS and try the older one? Who knows, it could work out!

I have to be honest, I don't see the point of downgrading my BIOS to some random older one when it worked fine in another system, that sounds like one of those "throw everything at the wall and see what sticks" attempts that could wind up causing damage. E.G. I upgraded the BIOS of one of my older cards a year ago to FIX a defect that could result in physical damage.

  • Like 1
Link to comment
Share on other sites

Well great, it JUST happened again, and this time the RTX2060 isn't even in my system! I swapped it out temporarily for a GT720 until I can figure this out about a week ago. I thought for sure it was the RTX2060 since the day I installed it my system would crash within roughly 24 hours, at most I got it running for 48, many times it wasn't even up for 24. But ever since I swapped to the GT720 it was running fine so I was trying to look into if it was the card, cooling, or power. But just now the same thing happened with the GT720 after being fine for a week. I was going between Chrome windows when suddenly the Window I swapped to was completely black, then my entire display become garbled and the video signal was lost. Trying to reset the video driver with Win+Ctrl+Shift+B did nothing, mashing Ctrl+Alt+Del did nothing, the only difference was that the second I pressed the power button it hard shutdown instead of making me hold it down for 10 seconds.

I heard someone mention that I might have to disable SERR messages or VT-x in my BIOS? Anyone ever heard of having to do that to fix something like this? I use Virtualbox on my system as well, would disabling VT-x impact that?

The event log didn't show much, just made a mention of "The computer has rebooted from a bugcheck" and generated a minidump.

The minidump seems to imply that it's somehow STILL the Nvidia driver.... despite having completely wiped my drivers and re-installed a much older driver that was the last one that supported the GT720:

 

Quote

Microsoft (R) Windows Debugger Version 10.0.22621.755 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump\031023-11156-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*
Executable search path is: 
Windows 10 Kernel Version 19041 MP (16 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Machine Name:
Kernel base = 0xfffff805`50200000 PsLoadedModuleList = 0xfffff805`50e2a210
Debug session time: Fri Mar 10 13:40:58.599 2023 (UTC - 6:00)
System Uptime: 4 days 20:46:45.393
Loading Kernel Symbols
...............................................................
................................................................
................................................................
....
Loading User Symbols
Loading unloaded module list
.................................
For analysis of this file, run !analyze -v
15: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffffc206961ea010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80571da5838, The pointer into responsible device driver module (e.g. owner tag).
Arg3: 0000000000000000, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 000000000000000d, Optional internal context dependent data.

Debugging Details:
------------------

Unable to load image nvlddmkm.sys, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nvlddmkm.sys

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 2890

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 5844

    Key  : Analysis.Init.CPU.mSec
    Value: 3858

    Key  : Analysis.Init.Elapsed.mSec
    Value: 75133

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 103


FILE_IN_CAB:  031023-11156-01.dmp

DUMP_FILE_ATTRIBUTES: 0x8
  Kernel Generated Triage Dump

BUGCHECK_CODE:  116

BUGCHECK_P1: ffffc206961ea010

BUGCHECK_P2: fffff80571da5838

BUGCHECK_P3: 0

BUGCHECK_P4: d

VIDEO_TDR_CONTEXT: dt dxgkrnl!_TDR_RECOVERY_CONTEXT ffffc206961ea010
Symbol dxgkrnl!_TDR_RECOVERY_CONTEXT not found.

PROCESS_OBJECT: 000000000000000d

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXNTFS: 1 (!blackboxntfs)


BLACKBOXPNP: 1 (!blackboxpnp)


BLACKBOXWINLOGON: 1

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  System

STACK_TEXT:  
ffffae89`095aa808 fffff805`6769555e     : 00000000`00000116 ffffc206`961ea010 fffff805`71da5838 00000000`00000000 : nt!KeBugCheckEx
ffffae89`095aa810 fffff805`67694bc1     : fffff805`71da5838 ffffc206`961ea010 ffffae89`095aa919 00000000`00000000 : dxgkrnl!TdrBugcheckOnTimeout+0xfe
ffffae89`095aa850 fffff805`67abd483     : ffffc206`961ea010 00000000`00989680 ffffae89`095aaa30 00000000`019a8d58 : dxgkrnl!TdrIsRecoveryRequired+0x1b1
ffffae89`095aa880 fffff805`67b1b2fb     : ffffc206`8b491000 00000000`00000001 ffffc206`8b491000 00000000`00000000 : dxgmms2!VidSchiReportHwHang+0x62f
ffffae89`095aa980 fffff805`67ae8142     : ffffae89`095aaa01 00000000`019a8cd7 00000000`00989680 00000000`00000040 : dxgmms2!VidSchiCheckHwProgress+0x3318b
ffffae89`095aa9f0 fffff805`67a8a11a     : 00000000`00000000 ffffc206`8b491000 ffffae89`095aab19 ffffc206`8b491000 : dxgmms2!VidSchiWaitForSchedulerEvents+0x372
ffffae89`095aaac0 fffff805`67b0d405     : ffffc206`8f3f8000 ffffc206`8b491000 ffffc206`8f3f8010 ffffc206`8b541620 : dxgmms2!VidSchiScheduleCommandToRun+0x2ca
ffffae89`095aab80 fffff805`67b0d3ba     : ffffc206`8b491400 fffff805`67b0d2f0 ffffc206`8b491000 fffff805`4d64b100 : dxgmms2!VidSchiRun_PriorityTable+0x35
ffffae89`095aabd0 fffff805`50455485     : ffffc206`89929080 fffff805`00000001 ffffc206`8b491000 00078425`bd9bbfff : dxgmms2!VidSchiWorkerThread+0xca
ffffae89`095aac10 fffff805`50602cc8     : fffff805`4d64b180 ffffc206`89929080 fffff805`50455430 00000000`00000000 : nt!PspSystemThreadStartup+0x55
ffffae89`095aac60 00000000`00000000     : ffffae89`095ab000 ffffae89`095a5000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28


SYMBOL_NAME:  nvlddmkm+dd5838

MODULE_NAME: nvlddmkm

IMAGE_NAME:  nvlddmkm.sys

STACK_COMMAND:  .cxr; .ecxr ; kb

FAILURE_BUCKET_ID:  0x116_IMAGE_nvlddmkm.sys

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {c89bfe8c-ed39-f658-ef27-f2898997fdbd}

Followup:     MachineOwner
---------

 

Link to comment
Share on other sites

This might be a power supply issue, caused by ripple. Have you tried with another power supply?

Link to comment
Share on other sites

Not yet, still working out what my options are for that due to the proprietary nature for the system. Don't want to risk a PSU that has a different pinout. Although I am starting to have doubts it's a power supply issue if I switched from a GPU that required a single 6-pin plug, to a GPU that has a single 8-pin plug, to a significantly weaker GPU that is powered from just the PCIe port and requires no additional plugs at all from the PSU. Unless the PSU is so shot that it's randomly not able to deliver stable power at all to even the basic parts of the system when for the last two years it was working perfectly fine with a 6-pin GPU.

Link to comment
Share on other sites

As in an entirety different model of CPU or just another of the same one? I can get another of the same CPU I have for around $20 on ebay used.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.