The Worlds Weirdest Mac Issue


Depending on how things go, the title for this entry might more appropriately be “the self healing Mac”.  Only time will tell!  So what is this all about?  Well recently my trusty companion of 2 years, the “mid 2012 MacBook Pro Retina 15”, decided to have a near (as I can tell) death experience.

It all started with a single kernel panic while doing some boring daily tasks in Chrome.  Within a 24 hour period the problem accelerated to a continuous kernel panic loop.  My first thought was “recent update”, but searching high and low for clues didn’t yield much.  Basic diagnostics (read as the highest of high level) seemed to imply the hardware was OK, but it really felt like a hardware issue.  Or if not hardware, possibly drivers.  But of course neither of those made much sense.  This was OSX running on a nearly new Mac, after all!  It’s like suggesting that your brand new Toyota Corolla would up and completely die 3 miles off the lot (heavy sarcasm here).

Searching around I discovered that there were possibly some issues with Mavericks and the Retina that I had maybe been dodging.  It had also been 2 years of accumulating crap (dev tools, strange drivers, virtualization utilities, deep utilities, games) any of which could be suspect.  So I decided I would try a time machine rollback to before the first kernel panic, and if that failed, take a time machine back to the 1995 and do the classic Windows “fix” – wipe and re-install (ugh).

The time machine restore took literally ages thanks to the bizarrely slow read rates of my backup NAS (detailed here), but eventually completed (400GB, 24 hours).  Unfortunately, the system wasn’t back for more than 10 minutes before the first kernel panic!  That meant that either the condition actually pre-existed the first known occurrence and had just been lurking, or the issue was in fact hardware.  I moved forward with the clean install.

First I deployed a new version of Mavericks.  Boot up holding command R, follow the linked guide, and you’re off to the races.  The reinstall was pretty smooth (erase disk, groan, quite back to recover menu, install new OS) and first boot just felt better.  Of course you know what they say about placebos!  After an hour of installing my usual suite of apps, upgrading to Mavericks and grabbing the latest updates, the dreaded kernel panic struck!  Things were looking grim.

With little to lose I decided to maybe try rolling back to Mountain Lion on the outside chance that the latest Mavericks update was causing issues.  One more reinstall, followed by an app install only and I was feeling good.  Until terror struck!  Yes, another kernel panic.  Incidentally these kernel panics were all over the place (really suggesting RAM).

At this point I became bitter.  Suddenly the “it looks amazing and is all sealed and covered in fairy magic!” Apple approach didn’t seem so great.  Changing out a DIMM on a PC laptop is a cheap and very easy fix.  Hell, in these pages I’ve covered complete tear downs of PC laptops (down to motherboard replacements).  Compounding the issue was that I never opted for Apple Care (yes yes, I know that failing to spend more money no top of a premium $2500 laptop means I deserve what I get if said premium hardware somehow completely dies within 3 years).  Apples decision to solder the memory to the motherboard meant I’d be looking at an extremely expensive motherboard swap out and a good sized chunk of downtime (the latter being a really big issue for me).  Starting to feel truly grumpy, I decided to run a few tests.

First, memtest in OS.  Lots of failures.  Instant failures too.  As a matter of fact I’ve never seen such a horrific memtest result!  It was honestly a bit of a wonder the thing could even boot!  Thinking that maybe the software result was anomalous (memtest for OSX is a bit old at this point and in theory doesn’t support anything newer than 10.5.x) I decided to do the old faithful Apple Hardware Test.  If nothing else that utility is always a cool walk down GUI memory lane!

Well depressingly enough, AHT wouldn’t run.  I didn’t think to snap a pic at that point (I wasn’t planning on this entry), but this gives you an idea (stolen from the Apple support forums):

Image
Disclaimer: Not my pic. Error numbers have been changed to protect the guilty!

The actual error code I was faced with was -6002D.  Yep.  That’s generally memory.  So it looked like a total bust.  My Apple honeymoon appeared to be officially over.  I decided to do one final wipe, in preparation for the now seemingly inevitable hospital visit, and this time lay down only the bare minimum footprint needed to keep doing a bit of work in the meantime since one positive outcome of all of this wiping was the kernel panics had gone from continuous loop to fairly rare.

After turning in for the night, and struggling through a restless sleep fraught with nightmares of Genius Bar lines stretching to the horizon, I crept downstairs to discover that I didn’t see this:

Again, not mine… But you get the idea.  Seen one, seen em all!
Again, not mine… But you get the idea. Seen one, seen em all!

The Mac had made it through the night!  Now this was interesting.  Could it possible be that something in this lineup had become toxic?  With the cloud and “evergreen” software, it was possible.  After all, since our software library is now real time and online, it can be hard to avoid the newest version right?

  • Chrome
  • Lync
  • Skype
  • Office 2011
  • Camtasia
  • Omni Graffle
  • iMovie
  • Garage Band
  • Unarchiver
  • Evernote
  • Dropbox

That is literally the “slim” list that was in place every time the problem would happen post system wipe.  The new list, that seemed stable (against all odds), was solely Office, Lync and Skype.  It was time to do some testing!  Well the results were interesting to say the least! I decided to beat the Mac up a bit.  First, Unigine Heaven 4 in Extreme mode left running overnight (I was always curious how it would do anyhow):

Screen Shot 2014-06-19 at 7.28.35 PM

A great score it’s not, but banging through maxed out Unigine and left running overnight without a hitch kind of implies that the GPU (and drivers) are not an issue.  Well we did suspect that after all, right?  How about taking a closer look at memory?

Screen Shot 2014-06-19 at 7.29.10 PM

Hmmm… OK so far so good…

Screen Shot 2014-06-19 at 7.29.24 PM

Well it’s not ECC anyhow and who knows what this code is actually doing right?  For all we know this is just a register dump.  Time for the big guns.  How about some Prime 95 max memory torture testing?  This is another thing I’ve always wanted to subject the Mac to. No way it survives…

Screen Shot 2014-06-19 at 7.28.19 PM

 

Uh, ok.  It just got serious.  How the hell could this heap, which was unable to even start AHT one clean install ago, somehow now banging through hours of Prime 95 torture?  There was only one thing left to do (well OK two).  First up, memtest.  Keeping in mind that it might be incompatible of course!

WTF?!
WTF?!

What… the…. heck!?  This time the test ran like a charm; exactly as expected.  So not only does it appear that memtest does in fact work fine on Mountain Lion, but the MacBook passed.  With a cautious glimmer of hope starting to form, and more than a bit of fear, it was time for…. AHT!

2014-06-19 20.27.58

You have got to be kidding me!  This time not only did AHT run, but it passed the damn test!  At this point I started checking for hidden cameras, aliens and paranormal activity.  It just didn’t make any sense!

So where does this leave us?  Well at this point I have added everything back in except CHROME and have successfully repeated all of these tests!  Is it somehow possible that CHROME caused this?  But how?  Chrome certainly can’t survive reboots.  Or can it?  With modern laptops and the way they manage power, it’s hard to know if the machine is every really off. Is it possible that some software anomaly was leaving the Mac in a state that prevented it from being able to enter AHT and survived reboots?  It really does seem impossible and it doesn’t make sense, yet none of this makes sense.  How could the Mac have gone from being so unstable it couldn’t even enter AHT, to passing it over and over with flying colors and surviving brutal overnight torture tests with only a software change?  I’ve been doing this a long time (hint… Atari 400, Timex Sinclair 1000, etc) and have never seen something like this.  Is it a self healing Mac?  Is it software so insidious it can survive reboots?  I almost don’t want to know.  One thing is for sure though and that’s that I will be keeping a close eye on this and providing any updates on these pages.  And if I should suddenly vanish?  Tell them to burn the Macbook!