Thursday, July 5, 2012

Another One Bites the Dust

Today I decided I was going to hop onto a game for a bit, or at least I thought.  I stopped my current Xmonad instance and ran my startx-sli script (thank you Nvidia, for making TwinView still hate SLI in Linux nearly a decade after you introduced the feature), only to find my X server hard crash.  It still responded to some basic cursor movements but no new windows would draw.  Being the tinkerer that I am, I decided to to see what updates I've run recently, assuming that I of course broke it.

Well I did have a few updates in the pipeline that needed a few recompilations, one of them being an updated udev (xorg needed to be rebuilt).  However, as I soon discovered, this did not fix things.  I could have sworn I had been running in SLI on the version 302.xx drivers (the module loaded in the new defunct config).  I scoured dmesg and Xorg.0.log to find some cryptic NVRM error codes.  Thinking this is strange, I googled around a bit and found nothing relevant to the problem.

So I decided to run the utility they ship with their drivers.  Well that's when I saw this dreadful message that was grepped out of /var/log/messages and gzipped to be sent to nvidia:

Jun 27 23:29:09 eggsbenedict kernel: [291206.148151] NVRM: GPU at 0000:02:00.0 has fallen off the bus.

And that's when I realized I hadn't actually used the other card in well over a week (I had been doing work with both monitors, instead). I'm pretty sure that Linux watched my card die and didn't even crash the X server, it just kept going without me noticing. I decided to hope that my suspicions were wrong against all odds and rebooted into Windows 7, only to be stuck after the splash screen, where it became apparent to me that it was finally dead.

Dead 8800 GTX

Rig with a new vacant PCI-Ex slot