Lot’s of noise lately about OpenStack and while I still feel the broader value proposition is limited (a subject for a different entry), as a service level architect on the provider side of the world, it certainly makes sense to get intimate with the technology at this stage.  As is Android to mobile handset manufacturers, so has OpenStack become to “we need to build a cloud!” service providers.  With HP, Dell, IBM and RackSpace on board (to name a few) it has become the collective answer for providers looking to compete against proprietary technology based offerings from AWS, Microsoft, VMware and Google.

So with this in mind, where do we begin?  Well first, OpenStack of course is a “Cloud Computing OS”.  What the heck does that mean though?  Well in simple terms, OpenStack provides a centralized management system for advertising infrastructure services and resources to end users.   These types of software solutions came about as an answer to the inherant multi-tenancy problem that IT providers run into as they try to rationalize the ever expanding supply and demand balancing act for resources into a service delivery model.  Consider traditional IT approaches and it becomes clear why a new solution was needed.  

Way back when in the early 2000s, all servers were physical computers that ran one instance of an operating system.  Whether it be Linux, or Windows, or UNIX, the operating system provided developers and end users access to the processing and storage resources of the server.  As an IT admin, if your developers needed to deploy a new application to support a business requirement, you generally found out somewhere in their development cycle that they would be needing equipment and, most likely too late, you would find out specifically how much equipment they would need.  The equipment would then have to be procured through the normal (and extremely long) IT procurement process and eventually it would arrive and get deployed.  By this point the developers would likely be dissatisfied that getting their equipment took “too long”.

Fast forward to the mid-2000s and along comes VMware and the breakthrough of production ready x86 server virtualization.  At first this seemed miraculous and, although it took some time for most organizations to actually trust the technology for production and, even today, we are not 100% virtualized in the most mature markets, it ultimately did transform IT on some levels.  After all, in the era of physical servers there was lots of wasted capacity.  Moore’s law has been cranking along at full steam for a good long while now, but many business problems are still fairly simple.  So each new application would require a new server from an implementation standpoint, but in reality the server was ultimately largely idle.   This capacity management issue is where virtualization has had the most tangible benefit to date.  Where complex resource management schemes put the burden of complexity on the developer (who often was not ready for it), virtualization presented exactly what they were expecting; a single server instance they could wholly consume.  This ability to fully utilize big hardware by instantiating multiple copies of standard operating systems (Windows and Linux) really just scratches the surface, but was a big part of the initial value proposition.

Over time the “OS as code” approach of virtualization lead to more advanced management tools and the potential for more agile process.  VMware in particular delivered next generation IT in the form of vCenter and near “magical” tools like vMotion.  The logical presence of an OS was starting to decouple from its physical location giving high availability dependent design a new lease on life.  Unfortunately, the broader potential inherent in this new model was never fully realized.  The process aspects continued (and still continue) to lag.  Lots of old physical process was dragged forward into the virtual world.  Procuring a virtual server had similar approval gates as a physical one.  In addition, most IT shops didn’t leverage virtualization as a starting point in turning themselves into a “broker of services”, charging for the resources they provide (despite the usage metrics needed to do this all being centralized in the hypervisor management tools).  Lastly, the capacity management challenge still loomed, it had just been deferred.  The “stair step” model of IT capacity boom/bust cycles remained with shortfalls in host capacity killing business agility and excess capacity on the shelf leading to deferred ROI.

In the past 6 years or so the situation for IT has gone critical with the gap between infrastructure teams and the business (represented by the developer community) growing wider.  Analysts and developers have moved closer and closer to the core business in order to provide the agility required to compete in modern global markets, but core IT generally hasn’t followed along.  Taken in this context, the rapid grass roots adoption of Amazon and it’s S3 and EC2 offerings becomes no real surprise.  Much like RIM in the 90’s provided a solution to a problem IT didn’t realize it had (mobile worker productivity), Amazon found a ready, willing and indeed eager audience in developers frustrated by core IT’s inability to provide resources in time.  These developers are under ever increasing pressure from their business partners to deliver solutions more and more quickly.  Suddenly, with a credit card and a few clicks, they could have Windows and Linux servers online in 15 minutes.  In addition, the entire service could be operated through an API (something core IT often can’t even spell).

Fast forward to present day and we find organizations in transformation.  “Cloud” isn’t an “if”, it has become a “when and a how”.  Every CIO worth the title has, at the very least, an articulated strategy for integrating cloud into their playbook (if for no other reason than pressure from the business).  Naysayers and cynics persist (particularly within core IT), but my view is that we are seeing the creative destruction that we saw as centralized mainframe computing models shifted to the microcomputer distributed systems model.  The naysayers will shift to a legacy and maintenance position and the sphere of control will shift.  This is happening now and each day brings new developments.   Developers, for their part, are also starting to shift focus to cloud design patterns.  What this means is developing for “infrastructure as code” is following the approach that we see within agile startups (think Netflix and Instagram) even within the enterprise.  Rather than assuming highly available infrastructure, you assume that your infrastructure will fail and you build your applications to not care.

So where does this leave core IT?  Well the buzzwords today are “private cloud”, “hybrid cloud” and a host of others.  What it really means though is that IT must evolve from being a builder of infrastructure to a broker of services.  Classic virtualization management tools like vCenter and System Center Virtual Machine Manager remain focused on abstracting infrastructure (compute, network and storage) rather than on service creation.  At the other end of the spectrum, complex management and orchestration platforms like those provided by BMC, CA, IBM or Cisco tend to provide a “lego set” approach where a wide range of legacy tools (configuration management, workflow orchestration, automation, etc) are rationalized together into something looking like a service controller, but never quite getting there.  In addition, they are difficult to implement and operate, often requiring lots of customization (and consulting spend).

For its part, VMware took the curiosity that was Lab Manager (an attempt to build a multi-tenancy layer on top of the largely monolithic world-view of vCenter) and evolved it into vCloud Director.  vCD was, in many ways, the first “Cloud OS” for builders.  Of course vCD has an exclusively virtualized focus in general and VMware flavored virtualization in particular.  It also remains somewhat incomplete if you consider building out a true infrastructure as a service offering.  To remediate this gap  VMware is shifting focus around as they evolve the entire vCloud Suite product line and is building a solid end-to-end story.  Microsoft is doing similar work with System Center by way of the Azure pack and recently released IaaS extensions.

So what is it that all of these “Cloud OS” solutions (OpenStack, vCloud Suite, System Center IaaS, etc) are attempting to deliver?  At a high level:

  • Provide a Service Catalog facility – essentially the capability of building “recipes” (to steal a Chef term) or “blueprints” (to steal VMwares term) for advertising services.  A “service” might be “Web Server” and come in “Linux/Apache” and “Windows/IIS” flavors.  A “service” might also be “N Tier App” and provide options on the web, app and data tiers.  Behind these service catalog entries would be provisioning logic to deploy and configure the required components once the service has been requested by an authorized user, which brings us to…
  • Granular Role Based Access Control – RBAC is at the heart of all service delivery models.  Most legacy management tools (System Center, vCenter) provide administrative access controls, but weren’t really designed to facilitate a self-service experience whereby there is an expectation that end-users will be able to directly request resources from the system.  Because these platforms are mainly focused on infrastructure abstraction on the back-end, and were designed to be the foundation of a building rather than the entire 40 floors, this makes sense.  RBAC is core to what the CloudOS brings.
  • Configuration Management and Service Control – this is a tricky one.  In some cases this is directly core to the mission of System Center and vCenter, but in many ways, for cloud scale, those tools don’t quite hit the mark.  Capacity planning was a very rough exercise in traditional models with lots of “spreadsheet based” modeling.  In a true dynamic infrastructure model, a centralized intelligence should be monitoring the total host footprint and proactively watch dogging capacity consumption rate and providing alerts on impending shortfalls.  In addition, configuration of all base infrastructure (physical and virtual) should be centrally tracked (vCenter and System Center do this pretty well)
  • Usage Tracking and Monetization – again an area where traditional management tools are too coarse.  VMware offers (the now service provider only) vCenter Chargeback Manager and ITBM as great tools to help view virtual infrastructure through a service provider lens, but to provide real utility the Service Control layer must do granular tracking of resource consumption (metering of usage)
  • Multi-Tenancy – this is the big one.  For all of its power, vCenter still provides a “flat earth” view.  Yes virtual machines can be grouped into “vApps”, and at the network and storage layers administrators can implement isolation, but ultimately a vCenter assumes that it is wholly owned by a single organization.  There is no underlying facility to allow two different “virtual datacenters” (a logical collection of vCenter resources) to be isolated as if they belonged to entirely two different companies.  In short, serving the infrastructure needs of Coke and Pepsi with one vCenter isn’t so feasible without a ton of abstraction.  With vCloud Director (or any CloudOS) that higher level of abstraction, and the logical constructs required to enable it, are “in the box”.
  • Resource Tiering and Dynamic Resource Management – an extension of the base infrastructure abstraction, a CloudOS brings another layer of orchestration providing increased agility.  The need here is easiest to see from the network and storage perspective.  Consider networking.  In the multi-tenancy model, it is quite feasible to have 100 customers that all want “192.168.1.0/24″.  Delivering this to them should not only be possible, but it should be automated and easy.  In addition, the typical low-level network virtualization constructs like VLANs and VRFs are extremely limited in terms of scale and difficult to orchestrate (requiring scripted control of hardware devices).  As a result, a modern CloudOS will leverage the more advanced virtual networking options inherent in modern hypervisors as a strong foundation. An example of this is VXLAN in vCenter/vCloud Director which allows for the dynamic provisioning of “virtual wires”, isolated layer 2 broadcast domains, which can be spanned across hosts over layer 3.  Overlay networks unlocking previously impossible flexibility from layer 2 and layer 3 are a key part of the CloudOS value proposition.  On the storage side, storage tiers become critical in terms of differentiating class of service.  Whereas placement of virtual machines on volumes is proactive in a legacy model and determined by an administrator, in a cloud model it should be reactive and determined by end-user request and entitlement.
  • Consumption Portal – ultimately the goal of a service broker IT model is to deliver a self-service consumption model to authorized end users.  The synergy of all of the capabilities detailed above is realized in the customer facing portal through which an end-user can login, and based on their approval level (or willingness to pay on demand) start to provision and manage resources (servers, storage, etc).

If the above set provides a basic overview of the capabilities a CloudOS should provide, how many boxes does OpenStack check?  Not surprisingly, it (roughly) checks them all.  Unfortunately, in true open standards fashion, it puts multiple checks in each box.  The flexibility and options are great, but it definitely steepens the learning curve, adds to implementation complexity, and fragments the ecosystem.  Of course as with Linux distros there are a ton of OpenStack implementations.  For the purposes of this experiment I chose to implement Mirantis. So let’s answer the above question from the perspective of the Mirantis solution:

  • Service Catalog: check.  Glance is the OpenStack component which provides the capability to discover and catalog instances (virtual machine templates and their meta data).  Glance provides the foundation for the OpenStack catalog.  As with all things OpenStack there are multiple options for the format of the VMs, the database which holds the catalog and the store which acts as the image repository.  It’s worth stressing that the base unit for the catalog is literally a VM image (similar with the vCD catalog).  If you are used to EC2, and the rich capabilities of CloudFormation or OpsWorks, the basic kit will come up short.  The answer with vCD is to front end it with vCAC blueprints that offer far more complexity and actual workflow.  With OpenStack, I suspect the answer is the same – front end it with a smarter tool.  We’ll see how this goes later in the entry. Note that the actual catalog facility (vs the management of the entries in them) is provided by Keystone (detailed below under RBAC)
  • Consumption Portal: check. Horizon dashboard serves as the consumption portal.  For folks familiar with vCD, think of the vCD org admin view.  As indicated above, it doesn’t provide the type of experience that you would get from a full consumption portal like vCloud Automation Center.  This is a gap that can potentially be filled with third party integration (more complexity)
  • Resource Tiering and Management: check. Quantum (network), Cinder (block store), Swift (object store) are the facilities which provide network and storage management (also worth mentioning is that the compute side is controlled by the Nova virtual machine management facility – more on this below).  The network side is interesting.  Quantum provides management and allocation of IP ranges to tenants, but it does not provide any controls for layer 2 (it is basically smart managed DHCP and IPTables).  Neutron is the underlying virtual switch networking facility, but in vCenter integration scenarios it is only compatible with NSX and the plugin enabling integration can only be acquired directly from VMware.  Without Neutron, Quantum allows customers to create and allocate networks and IP ranges, but it relies on VLAN isolation to be pre-configured accordingly in the vDS in order for traffic to flow (yikes – these are known as “Nova networks” and are directly defined in the environment config).  On the storage side, Cinder provides block storage facility for Nova and provides storage tiering.  Swift is the object store.  For those who do not know, an object store is a storage system which is designed to overlay a highly scalable cluster of underlying storage nodes.  The object store is accessed via a REST API and interaction with it is object level.  So an API call is made, an “object” is passed (a stream of data), and this object is then stored distributed across the entire cluster.  These storage facilities are designed for massive scale-out and high durability (think S3)
  • Role Based Access Control: check.  Keystone is the identity and access management facility in OpenStack and is also the actual service catalog.  Keystone provides very good granularity for both administrative and end user roles, as well as SAML federation (all goodness)
  • Configuration Management and Service Control: check.  Nova, once again, is the facility which provides governance of virtual machine deployment and configuration as well as host configuration and capacity management.  RabbitMQ and Puppet are under the covers and the entire workflow engine is front-ended by quite a good GUI called “Fuel” which we will be seeing in just a bit.

OK so that pretty much covers the key components of OpenStack!  This is a good place to wrap this entry up.  Next entry I will walk through actually setting it up.  Before we go, a couple of pictures.  First, a snapshot of OpenStack architecture component integration:

openstack-conceptual-arch-folsom

And one more (this one an original).  We’ve talked a lot this entry about all sorts of components that make up this nebulous thing we call cloud.  I’m a big believer in showing things in a picture, so technology implementation details aside, what does a “cloud service” really look like conceptually?  Meaning, what are the capabilities that should underpin an efficient Infrastructure as a Service implementation?  Here is my view:

cloudarch

Most of what we see in the diagram has been detailed above, so the available frameworks on which one can build a service are getting closer to providing the right foundation for an ideal service.  None are quite there yet, however, which is why the providers who can invest in proprietary glue code continue to enjoy a big advantage.


Well it’s been literally months since I last weighed my travel kit, but with my travel ramping up, up and away lately and the kit having undergone a pretty extensive transformation, I decided it was time for an update.  Last round weight was amazing at 8.75lbs, but still significantly up from the incredible 6.3lbs from “back in the day” (my 2007 kit).  One change is that this round I also opted to weigh the actual laptop case.  In the past I was using primarily ultralight messenger bags, but these days I’m using a backpack.  On to the gory details!

First the numbers:

  1. Galaxy Note 10 LTE 2014 with Samsung cover                       1lb 8oz  (replacing the iPad Air)
  2. Nokia 1520 LTE with Nokia cover                                            12oz (replacing the Nexus 7 which actually died)
  3. Razer Blade 14 2014                                                            4lb 5oz (replacing the Retina 15)
  4. Razer Blade power supply                                                    1lb 2oz (the Achilles heal of even high end PC laptops)
  5. iPhone 5s with Mophie Juice Pack                                            8oz (the only survivor!)
  6. 2TB portable Toshiba USB 3 drive                                           8oz
  7. Accessories stack:                                                                 4lb
    1. retractable ethernet, HDMI, USB (x5), audio
    2. HDMI to VGA
    3. USB 3 hub
    4. USB multi-cable/charger (VMware swag)
    5. mini power strip with USB charging
    6. high AMP USB charger
    7. multi port USB charger
    8. Verizon LTE hotspot
    9. AT&T LTE hotspot
    10. ZyXEL travel router
    11. Plantronics USB collapsible headset
    12. Razer Orochi Mouse
    13. Logitech Presentation remote
    14. Lenmar Powerwave 6600 battery/USB charger
    15. Audio Technica headset
    16. 256GB USB thumb drive
    17. 64GB USB thumb drive
  8. VMware solar panel equipped backpack                                 2lb 3oz

Total weight: 14lb 14oz

WOW!  Now that is some serious weight right?  Basically the current kit is as heavy as the past two kits combined! Certainly not a good development with travel becoming more intense and more difficult, but there are a few clarifying points here.  First, in the past I hadn’t been weighing the actual bag, so that adds a pound or two onto the old totals (call them 10 and 7.5)  That’s still a sizable weight increase, but there are lots of extra accessories these days.  The more time you spend on the road, the more you want to really be ready for a wide range of contingencies.  So the multiple hotspots and audio devices, the portable drives and the battery backups are all new.

So netting this out, it’s a lot more weight, but a ton more capability.  The more shocking change, perhaps, is the turnover in gear composition.  Last entry I pointed out that with cloud services and cloud based data, mature file formats and more multi platform apps, and mobile devices driving commoditization, moving between platforms is pretty low friction so a move to OSX isn’t as jarring as it once was.  Well this cuts both ways and a move back is just as easy.  I documented recently the strange issues I had with the Mac which seemed to sort themselves out, but I do see an occasional odd system halt.  With the Mac coming up on two years old, and being something less than fully stable, I decided it was time to relegate to home duty and get a new kit for the road.  The mid-life refresh (Haswell/GT750) isn’t super exciting, but the new Razer really is!  With a 3200×1800 screen and a GTX870, paired with the expected quad Haswell 2.2/3.2, 8GB RAM and fast 256GB SSD, it packs a solid punch for work or gaming.  The screen quality is phenomenal and is touch based, the build quality is excellent (near Mac level), and the entire thing is almost a slightly smaller, tiny bit lighter, black version of the Macbook.  Except of course for the Windows part.  Which admittedly, is a mixed bag.  I can probably say that I have something of a love/hate with Windows 8.1.  I’ll save that for another entry though.

On the tablet front my Nexus 7 just up and died (not uncommon) so I decided to replace it with Windows Phone in order to have exposure and access to all three platforms (useful for my job).  Because I really do like handwritten notes (one thing I missed about both my old Windows tablets and my recent Galaxy Note 2 phone) I decided to sell the iPad Air in favor of the Galaxy Note 10 2014.  The new Note 10 is also a screen triumph (2560×1600), and has plenty of horsepower, but as always I have a similar love/hate with Android (even KitKat) as I do with Windows 8.1.  Google has joined Microsoft as having a lot to learn from Apple about consistent and fluid UX/UI design.

So the while Redmond had no representation last round, this time they have come raring back with two significant devices in the mix!  Maybe next time we’ll have a Chromebook!  Honestly though, I have to say that we still have a long way to go in terms of ecosystem integration maturity.  I feel there isn’t nearly enough “payoff” for fully committing to a single vendor story (even Apple).  There is of course some settings and data sync tied to universal id from all three (Google/Apple/MSFT), and there is the consistent tablet/phone app story if you decide for mobile device vendor redundancy, but I’d like to see a lot more.  Maybe next generation of service/OS things will improve.

That’s it for now, but here are some parting shots of the big weigh in!:

2014-08-17 23.07.49 2014-08-17 23.12.09


I recently decided to upgrade the ReadyNAS Ultra to make room for some new storage requirements.  The ReadyNAS remains a surprisingly powerful and flexible device so things went well overall, but there was some weirdness that is worth documenting in case others run into it as well and are wondering if it is normal or will cause issues.  To review, my current ReadyNAS Ultra is configured as follows:

 

  • Ultra 6
  • RAIDiator-x86 4.2.26
  • All bays full 6 x 2TB (mix of WD and Seagate “green” series drives)
  • X-RAID2, single redundancy mode

For anyone not aware, X-RAID is a clever protection scheme which brings an added layer of flexibility while still providing standard RAID protection.  As a quick primer, the benefits are:

  • allows a mixture of disk sizes
  • allows for dynamic expansion of an array
  • provides single disk redundancy (RAID 5 analog) or dual disk redundancy (RAID 6 analog) while maintaining the above benefits

Some caveats to X-RAID are:

  • volume can only grow by an order of 8TB from it’s original size (for all of these caveats it is usable space being measured, not raw… so measure post protection volume size)
  • volume cannot be larger than 16TB without initiating a factory reset.  So for example, if you started with a 10TB volume, even though technically you could go to 18TB without violating the “no larger than +8TB from inception” rule, you would be stopped because the final volume would be larger than 16TB.  Factory reset is data destructive so, while not a show stopper, this restriction is definitely one to watch as it can turn a simple expansion project into a very complex one requiring full backup/restore and a companion device that can absorb a potentially massive amount of data
  • drives are grouped into layers by size.  What this means is that if you add 4TB drives in as replacements for 2TB drives, you create a 4TB disk layer and a 2TB disk layer that are, transparently to you, contributing to a single virtual volume.  The minimum require disk count per spindle size is 2 in order to retain protection.  Disks are replaced one at a time.  So in the case of 6 2TB drives, 1 is replaced by a 4.  Once sync’d, a second has to be replaced before the array is protected again.  Once 2 4TB drives are part of the volume, protection will be in sync again.  At this point the space sacrificed to volume protection will jump from a single 2TB spindle to one of the 4TB spindles (space inefficient, but at least it allows mixing).  In dual disk redundancy modes, the spindle counts are doubled.
  • drive sizes can only go up, not down.  So if you jump from 2TB disks to a mix of 2TB and 4TB, you can no longer add a 3TB disk

With all of the above in mind, I decided to move forward with the 2TB to 4TB scenario, replacing 2 of my Seagate disks with the new Hitachi HST efficiency series 5400 RPM 4TB disks.  Following the process as prescribed by Netgear went well.  I pulled one of the 2TB disks, swapped the 4TB into the carrier (this took longer than the required 10 seconds you need to wait before swapping back in) and then installed the 4TB disk into the array.  The Netgear immediately flipped to “unprotected” and “disk fault” upon removal of the 2TB disk and switched over to “resyncing array” about 5 minutes following installation of the 4TB disk.  The first resync took 26 hours.  This is on a 9.25TB usable array which was about 33% full.  After resync, I did a reboot just for good measure.

After reboot I repeated the procedure with the second disk.  This time the process took about 18 hours.  So things improved which is good!  Upon completion of this resync, the array status flipped back to “protected”, but each of the 4TB disks was only utilized at the 2TB level (1875GB usable).  This is because the 4TB disks were added into the 2TB layer as 2TB disks.   At this point, a reboot is required in order to get the array to actually resize.  Following this reboot, the status of the ReadyNAS switched to “array expansion” and the GUI started updating progress against the eventual target size.  This is where things got weird:

Screenshot 2014-08-06 16.27.40

As you can see, the GUI was reporting that the new size would be 10TB usable – a mere 750GB up from the old array size of 9.25TB.  Some quick math shows that this is either incorrect, or something is wrong:

ORIGINAL ARRAY:

  • 6 x 2TB = 12TB RAW
  • 6 x 1875GB usable = 11,250GB usable
  • 1 disk sacrificed to protection = 9375GB usable
  • 100GB for snapshot storage = 9275GB usable as expected

Now let’s consider the new array:

  • 4 x 2TB = 8TB RAW
  • 2 x 4TB = 8TB RAW
  • 4 x 1875GB usable = 7,500GB usable
  • 2 x 3750 usable = 7,500 usable
  • Total RAW = 16TB, Total usable = 15TB
  • 1 4TB disk sacrificed to protection = 11,250GB
  • 100GB for snapshot storage = 11,150GB usable

This is pretty far off from the flat 10TB being reported.  Binary/decimal translation aside (10TB vs 10TiB), we’re looking at over 1TB “missing”.  So what gives?  Well before panicing, I decided to have a look at the console.  Check out what a quick df in Linux reported:

Screenshot 2014-08-06 16.30.08

 

Ah ha!  11,645,691,704 1K blocks so, in other words, 11.6TB!  Much better.  The good news is that as I copy about 5TB up to the array, df is, as expected, reporting spot on accurate usage whereas the GUI is staying very fuzzy and very wrong.  The conclusion?  Something is up with the GUI post expansion (and post reboot as I rebooted twice to attempt to remediate).

So some final notes:

  • be mindful when expanding of the 8TB and 16TB limits
  • note that the minimum spindle size to maintain protection is 2 and that you will sacrifice a full 1 of those to reach the new protection size requirement
  • reboot as many times as you want during resync.  it won’t cause any issue
  • do not reboot during expansion as it might cause an issue
  • expect that the GUI might not report size correctly

This is where things stand so far.  As the situation develops or changes, I will update!

Sync My Clouds!

Posted: July 29, 2014 in Computers and Internet

As cloud services mature, one of the trickiest problems is definitely data sprawl. Issues of rationalization and migration of data become a challenge as information spreads across multiple services. If you consider music as an example, it is definitely possible to end up with a collection that spans Amazon Music, Google Music and iTunes. One of the only real ways to keep those particular services synchronized is to source them from a common distribution point, preferably living on a pure storage service. Of course depending on the size of your collection, this can require a fairly significant investment in cloud. In recent months, though, there has been an incredible land grab for consumer business that has seem rates for storage drop dramatically. Currently, this is how my personal spend/GB looks:

 

Base Storage for Subscription Tier Extra Storage (bonus, referral, etc) Monthly Cost Note
DropBox 100GB 7GB $10
OneDrive 1,020GB 10GB $11 Office365 Home Sub – lots more than just storage in here – plus 20GB base storage
Google Drive 100GB 16GB $2 Includes Gmail and Google+

Pretty impressive! Tallying things up, we’re looking at a total spend of $23 which provides:

  • 1253GB storage across 3 providers
  • Office 365 access (mail, SharePoint, Office Web Applications)
  • Office local install for Mac, PC, Android, IOS (multiple machines)
  • Live Mail, GMail, Google Plus
  • Desktop/device integration for all providers

To me this seemed like a fantastic deal for less than $25 a month and 1.24TB in the cloud is a ton of storage.  As a result, over the past few months, I have been shifting to a cloud only model for data storage.  The way I decided to run things was to make DropBox my primary storage service.  Despite having by far the worst economics (ironically DropBox has become ridiculously expensive compared to the competition), it has the best client integration experience as a result (IMO) of the service maturity.

So with DropBox in prime the next challenge was figuring out a plan for the secondary services.  At first I tried a model where I would assign use cases to each service.  So music in Google only, pictures on OneDrive only, documents across all 3.  This quickly fell apart as you wind up in a model where you need to selectively sync the secondary services, and you lose redundancy for some key use cases.  In analyzing my total usage pattern though, I found that as a high watermark I consume 75GB of space in the cloud (including documents, photos and music).  With the current $/GB rates, this data volume can easily fit in all 3 providers.  Realizing this I quickly moved to a hub/spoke sync model where I utilize OneDrive and Google Drive for backup/redundancy and DropBox becomes the master.  Of course the logistics of this proved very challenging having to utilize a middle man client to funnel the data around.  There had to be a better way. Wasn’t this a great idea for a startup? Well… Enter CloudHQ!

CloudHQ aims to provide a solution of the monumental task of cloud data sync.  As a premise it sounds amazing!  Just register with these guys, add your services, create some pairings, and let their workflow (and pipes) do the rest.  I’ve been tracking these guys for a while and it appears they are delivering. Of course the challenge is that to do meaningful work (more than one pairing) you need to pony up to the commercial level.  I held off a while to see how their service would mature.  Recently, though, they had a price drop that I feel represents a fantastic deal.  I was able to get onboard with the Premium level subscription for $119 by committing to 1 year. $10 a month is just a terrific price for a service like this so hopefully this price will lock-in moving forward.  Of course the service does have to work or it’s not such a great price right?  Well let’s see how things went!

First off… The sign-up and setup process was fantastic.  I actually went through the entire setup on an iPhone over lunch using my Google oID as a login.  Once signed up you can jump right in and get started.  Here is a shot of the basic mobile UI:

2014-07-29 17.05.00

 

I love how clean this is. Very clear how you can get started creating sync pairs using the supported named services.  Clicking one of those options will trigger a guided workflow.  In addition, you can setup your own sync pairs manually.  Either option brings you to service registration:

2014-07-29 17.06.22

CloudHQ currently supports a very nice set of services.  Supported services view from the desktop UI:

Screenshot 2014-07-29 21.38.15

 

Once services are registered and sync pairs registered, the service will start to run in a lights out fashion.  Updates are emailed daily and a final update message goes out once initial sync is completed.  The stages break down as follows:

  • Initial indexing and metadata population
  • Service sync (bidirectional)
  • Initial seeding complete
  • Incremental sync process runs indefinitely

In my case, there was about 75GB of data or so in play.  The biggest share was on DropBox and there was a stale copy of some of the DropBox data already sitting on both OneDrive and Google Drive.  In addition, there was a batch of data on both OneDrive and Google Drive that did not exist on DropBox.  The breakdown was roughly as follows:

  • DropBox – 56GB or so of pictures, documents and video
  • OneDrive – subset of DropBox content, roughly 5GB of picture data and 3GB of eBooks
  • Google Drive – subset of DropBox content, roughly 12GB of music and 5GB of picture data

The picture data was largely duplicated.  In approximate numbers, about 40GB had to flow in to OneDrive and Google Drive and about 15GB had to flow into DropBox.  Keeping an eye on sync status in the UI is terrific:

2014-07-29 17.06.35

 

In the desktop UI, there is great detail:

Screenshot 2014-07-29 22.20.02

 

The email updates are great.  Here is a sample of the initial email:

Screenshot 2014-07-29 22.25.24

These updates are very straightforward and will come daily.  The pair, and transfer activity for the pair, is represented.  In addition, there is a weekly report which provides a rollup summary:

Screenshot 2014-07-29 22.26.07

So how did the service do?  Quite well actually.  Here is my experience in terms of performance:

  • Account Created, services registered, pairs added:                                                           7/26 – 12:30PM
  • Indexing and initial metadata population complete, Evernote backup complete:     7/26 – 9:52PM
  • DropBox to GMail Complete, DropBox to OneDrive partial – 63GB copied:               7/29 – 10:30PM

No conflicts occurred and there have been no problems with any of the attached volumes.  I have to say I am extremely impressed with CloudHQ so far and pushing 63GB of bits around in a matter of 3 days is a fantastic “time to sync state”.

As my experience with the service increases I will continue to post updates, so stay tuned!

Upgrades!

Posted: July 12, 2014 in Computers and Internet

Well there is truly no rest for the weary. Or is it the wicked? Let’s compromise and say in this case it’s both! It’s no surprise that even a really sweet piece of kit like the Dell T620 isn’t going to stay stock for long at ComplaintsHQ where “live to mod” is a life motto. Luckily the recent generosity of family members wise enough to provide MicroCenter gift cards as presents provided just the excuse required to get some new parts.

It was hot on the heels of the initial install of the Dell that we added an SSD for VSAN testing and two ATI cards for vDGA View testing. Honestly though, vDGA isn’t cool. You know what’s cool? vSGA! For those saying “uh, what?”, both of these are technologies which allow a hardware GPU installed in the host to be surfaced in the guest OS (View desktops generally). With vDGA, a single GPU is dedicated to a single guest OS via Intel VT-D or AMD-Vi (IO MMU remap/directed IO technologies which allow a guest OS to directly access host hardware). This does work, but obviously isn’t very scalable nor is it a particularly elegant virtualization solution. vSGA, on the other hand, allows for a GPU installed in the host to be virtualized and shared. The downside is that there is a (very) short list of boards supported none of which I had on the shelf. The last item on the “to do” list from the initial setup was to get some sort of automated UPS driven shutdown of the guests and host in the (likely around here) event of power failure.

The current status to date (prior to the new upgrades) was that I had an old Intel X25 80GB SSD successfully installed and shared to the nested ESXi hosts (and successfully recognized as SSD) and vSAN installed and running. I also had a View config setup with a small amount of SSD allocated for temporary storage. With aspirations of testing both vSAN and running View 80GB of SSD really is tight so beyond saying “OK, it works!” not much could actually be done with this setup. Since SSDs are cheap and getting cheaper, I decided to grab this guy on super sale at MicroCenter for $99:

2014-07-12 15.52.02

While there I also picked up a small carrier to mount both SSDs in. I decided to also utilize some rails and mount the SSDs properly in one of the available 5.25 bays:

2014-07-12 16.00.03

The vSGA situation is certainly trickier than simply adding a budget SSD, but perusing eBay the other day, I happened upon a great find so, since I was upgrading anyhow, I jumped on it. Not only one of the few supported cards, but an actual Dell OEM variant for $225:

quadro4000

 

Another refinement I’ve been wanting to do to the server is to add power supply redundancy (mainly because I can leave no bay unfilled!).  I’ve committed to definitely resolving my UPS driven auto-shutdown challenge this round, so while not necessary, the redundant supply fits the theme well.  Luckily eBay yielded some more good results.  Dell OEM at $145:

2014-07-12 14.32.23

On the UPS side, you may remember that during the initial install of the server I had added in a BackUPS 1500 to run the ReadyNAS and the T620.  Unfortunately,  APC is a pain in the ass and VMware doesn’t make it any better.  Getting the ReadyNAS on managed UPS backup is as easy as plugging the USB cable in and clicking a checkbox using any APC unit.  In VMware, this is pretty much impossible.  Unless you buy not only the highest end of the SmartUPS line, but also buy the optional UPS network card (hundreds more), there is really no native support to be found.  I had explored some options using USB passthrough from the host to a Windows guest, combined with some great open source tools like apcupsd and Network UPS Tools.  I never quite got things working the way I wanted though.  More on that later…

OK, so that is the part list!  Total damage for all of the above was $900.  Steep, but almost half of it was actually the UPS.  As always, there is no better way to start healing from the emotional trauma of spending money than to start installing!  Let’s begin with the super easy stuff; the PSU.  I can honestly say that installing a new hot-swap supply in a T620 actually couldn’t be any easier.  First step is to access the back of the case and pop off the PSU bay cover (it pops right out):

2014-07-12 16.02.19

With the bay open, you literally just slide the new supply in and push gently (you will feel the connector catch and seat):

2014-07-12 16.03.06

Once installed, head into iDRAC to complete the power supply reconfiguration.  The options are very basic.  You can either enable or disable PSU hot sparing once the new one is in (and set which one is primary) and you can enable input power redundancy:

Screenshot 2014-07-12 18.28.55

OK, back to the UPS quandary! The general idea of VM based UPS control is as follows:

  • plug in UPS, plug server into UPS
  • attach UPS USB cable to server
  • enable passthrough for the USB channel (requires AMD-Vi or Intel VT-d, under Advanced Options in the Server Configuration in the VIM client)
  • add the USB device to a Windows (or Linux) guest VM
  • install the open source APC driver
  • install NUT
  • develop a script that fires off scripts on the ESX host prior to executing a VM shutdown (the host scripts will ultimately pull the rug out from under the UPS host VM which is fine)
  • make sure that VMware tools is installed in all VMs so they can be gracefully shutdown by the host
  • utilize either WOL (or an awesome ILO board like the iDRAC) to ensure that the server can be remotely brought back

Since I was in a spending mood, I decided to add a companion to my BackUPS 1500 just for the server.  Here she is:

2014-07-12 19.49.55

That is the SmartUPS 1000 2RU rack mount version.  So problem solved right?  Yeah no.  But before we get into that, let’s get this beast setup.  First the batteries have to be installed.  The front bezel pops off (it actually comes off and I popped it in for this photo) revealing a removable panel:

2014-07-12 19.49.36

A single thumb screw holds the panel in place.  Removing it allows the panel to be slid left and pulled forward revealing the battery compartment.  As always, the battery is pulled out by the plastic tabs, flipped over, and put back in where it will now snap into place (it’s own weight is enough to really seat it well if the unit is a bit angled).  The final product will look like this:

2014-07-12 19.49.02

In terms of connectivity, here is what you get (not joking):

2014-07-12 19.50.15

Yes, this is *one* USB cable and thats *it* for $450!

Now, let’s take a look at what APC requires for VMware host support:

  • a SmartUPS unit – check, we have this one
  • the optional network card – bzzzt… nope
  • serial only connection to the host – bzzzt… nope! (THIS one really pissed me off)

So somehow APC can’t figure out how to get a USB connected UPS working on ESXi, and the latest SmartUPS somehow has no included serial cable.  Really fantastic!  I considered a few options including attempting to do a DB9 to USB conversion using the RJ45 to USB cable from my lesser BackUPS 750, but I shot all of the options down.  USB to serial requires driver support and there is zero chance of getting that working on the host.   Some of the other options I considered were publishing serial over network, but this seemed like a poor approach also.  At this point, I was stumped and seriously considering returning the seemingly useless SmartUPS to MicroCenter.  Before packing it in, I decided to try one more approach.

Returning to the basic architecture I had planned for the BackUPS, but this time using the native PowerChute Business app included with the SmartUPS (at least it comes with something useful!), I setup UPS support on my vCenter.  Passing through USB worked from the host and PowerChute server, console and agent installed without a hitch and successfully located the UPS.  So far so good!

The critical step was now to figure out a way to get the vCenter guest to shutdown all of the VMs and the server once PowerChute detected a power event.  Luckily, it wasn’t too difficult and I was able to find this awesome script to handle the ESX side.  Here is the logic:

  • add a custom command in PowerChute.  The custom command calls Putty from the command line with the option to run a script on the host upon connection.  The command is inserted into “batchfile_name.cmd” in the APC\agents\commandfiles directory and should be formatted like this:
@SMART "" "C:\Program Files (x86)\putty\putty.exe" -ssh -l login -pw password -m C:\script.sh
  • the contents of “script.sh” is that amazing script above.  The gist of it is:
    • use the ESX command line tools to enumerate all running VM’s to a temp file (basic string processing on the output of a -list)
    • pipe that file into a looped command to shut them down (a for or while loop construct)
    • shutdown the host

Here are the contents of the script:

#/bin/sh
VMS=`vim-cmd vmsvc/getallvms | grep -v Vmid | awk '{print $1}'`
for VM in $VMS ; do
 PWR=`vim-cmd vmsvc/power.getstate $VM | grep -v "Retrieved runtime info"`
 if [ "$PWR" == "Powered on" ] ; then
 name=`vim-cmd vmsvc/get.config $VM | grep -i "name =" | awk '{print $3}' | head -1 | cut -d "\"" -f2`
 echo "Powered on: $name"
 echo "Suspending: $name"
 vim-cmd vmsvc/power.suspend $VM > /dev/null &
 fi
done
while true ; do
 RUNNING=0
 for VM in $VMS ; do
 PWR=`vim-cmd vmsvc/power.getstate $VM | grep -v "Retrieved runtime info"`
 if [ "$PWR" == "Powered on" ] ; then
 echo "Waiting..."
 RUNNING=1
 fi
 done
 if [ $RUNNING -eq 0 ] ; then
 echo "Gone..."
 break
 fi
 sleep 1
done
echo "Now we suspend the Host..."
vim-cmd hostsvc/standby_mode_enter

I am happy to say that it worked like a charm and successfully shutdown all VMs cleanly and brought down the host!  You can set some delays in PowerChute and I set them to 8 minutes for the OS shutdown and 8 minutes as the time required for the custom command to run, but it really won’t matter since the custom command will kill the VM (and PowerChute) anyhow.

A couple of things to be aware of with this approach:

  • the PCBE Agent Service needs “interact with desktop” checked on newer versions of Windows (2k8+).  Make sure to run the SSH client once outside of the script first to deal with any interaction it needs to do (saving fingerprint, etc)
  • the USB passthrough can be a bit flaky in that the USB device doesn’t seem to be available right at first OS boot (so the service may not see the UPS).  Eventually it does refresh and catch up on its own, however

Coming up soon will be the Quadro install and the SSD setup, followed by some (finally) notes on VSAN and accelerated View (both vDGA and vSGA), so stay tuned!


The VMware NGC client is definitely super convenient being entirely browser based, but the legacy client undoubtedly had its charms. Chief among those charms is the ability to manage an actual ESXi host rather than just a vCenter instance. Except on a Mac where it doesn’t work at all. Admittedly this isn’t a huge issue for production where vCenter will be highly available and the admin console is unlikely to be a Mac, but in a home lab, it becomes a huge issue. The solution? Enter WineBottler!

For those not familiar, WINE is a recursive acronym that stands for “WINE is not Emulation”. It dates back to the early days of Linux (1993) and the idea is to provide a containerized Windows OS/API experience on *NIX systems. In a very real way WINE is one of the earliest runs at application virtualization. It’s an extremely nifty idea but, as with all cross-platform “unofficial” app virtualization technologies, it is not 100% effective. The VIM client falls into the edge cases that require some tweaking to get to work. The good news, though, is that it can be done:

Screenshot 2014-07-11 04.37.44

OK, with the proof of life out of the way, let’s walk through exactly what it takes to get this thing working step-by-step.  Note that it will not work straight out of the box.  It will fail and need to be remediated.

Step 1: Download and install WineBottler.  This article is based on the (at time of publication) current stable release 1.6.1.

Step 2: With WineBottler installed, download the MSXML Framework version 3.0  and copy it into the “Winetricks” folder (/Users/username/.cache/winetricks/msxml3)  “Winetricks” are component installs that Wine can inject into the container during packaging (middleware, support packages, etc).  VIM requires .NET 3.5 SP1 which WineBottler has standard, but also requires MSXML version 3.0 which it does not.  The first pass through packaging will generate an error if this step isn’t completed, but the errors are extremely helpful and will provide both a download link for the missing package and the path to copy it to (so no fear if you miss this step)

Step 3: We’re now ready to bottle up some WINE!  Launch the WineBottler app and click the “Advanced” tab:

Screenshot 2014-07-11 10.42.58

Lots to explain here, so let’s take it one component at a time.

Prefix Template:  this option refers to the actual app container (the virtual environment that WINE bottler creates during this sequencing step for the application).  This can be either a new container, or based on a previously created one.  For now we are creating a new template, but later we will be reusing it.

Program to Install: this is the application we are virtualizing.  In our case, at this stage, we want the actual VIM install package (VMware-viclient-all-5.5.0-1281650.exe) which can be downloaded directly from the host at https://esxi-hostname.  This is an installer, so we want to select that option.  Later on we will be repeating this with the actual app, but for now we are going to use the installer to lay the groundwork.

Winetricks: as discussed, these are optional component installs.  Here we want to check “.NET 3.5 SP1″.

Native DLL Overrides:  as the name implies, this powerful option gives us the ability to supplement and standard Windows DLL with an out-of-band version we would include here.  Huge potential with this one, but we do not need it for our purposes.

Bundle:  another powerful option, this gives us the ability to create a stand alone WINE container app.  With this option, the OSX app file created could be copied over to another machine and run without having to install WINE.

Runtime Options, Version, Identifier, Codesign Identity:  these are our important packaging options.  Runtime as implied allows us to tweak settings at time of packaging.  None required for our case here.  Version is an admin process option that allows you to version your containers.  Identifier is extremely important because the container path in the OSX filesystem will be named using the Identifier as a prefix, so use a name that makes sense and make a note of it.  I used “com.vmware.vim”.  Codesign Identity is also an admin process field allowing for providing validation of the package via unique identifier.

Silent Install:  allows you to run silent for most of the install (WINE will “auto-click” through the installers).  I left this unchecked.

Once you have checked off .NET 3.5 SP1 Winetrick and assigned an Identifier, click “Install”.  You will be asked to provide a name and location for the OSX app that will be created by the sequencing process:

Screenshot 2014-07-11 10.59.23

 

Step 4: walk through the install.  The install will now kick off in a partially unattended fashion, so watch for the dialogue prompts.  If the overall sequencer Install progress bar stalls, there is a good chance a minimized Windows installer is waiting for input:

Screenshot 2014-07-11 10.59.36

The Windows installer bits will look familiar and will be the base versions of .NET that WINE wants, the .NET 3.5 SP1 option that we selected, and the MSXML 3.0 package that is required.  The process will kickoff with .NET 2.0:

Screenshot 2014-07-11 10.59.58 Screenshot 2014-07-11 11.00.16

You’ll have to click “Finish” as each step completes and at times (during .NET 3.0), the installer will go silent or will act strangely (flashing focus on and off as it rapidly cycles through dialogues unattended).  At times you may need to pull focus back to keep things moving.  Once the .NET 2.0 setup is done, you will get a Windows “restart” prompt.  Weird I know, but definitely perform this step:

Screenshot 2014-07-11 11.10.51

During the XPS Essentials pack installation (part of base WINE package) you will also be prompted about component registration.  Go ahead and register:

Screenshot 2014-07-11 11.12.42

The XML Parser component install (part of base WINE package) will require user registration.  Go ahead and complete it:

Screenshot 2014-07-11 11.14.25

 

.NET 2.0 SP2 will require another restart. Go ahead and do that:

Screenshot 2014-07-11 11.20.34

 

 

With all of the pre-requisites finally out of the way, the core VIM install will finally extract and kickoff:

Screenshot 2014-07-11 11.21.47

You will see the VIM Installer warning about XP.  You can ignore this.  I was able to connect to vCenter without issue:

Screenshot 2014-07-11 11.22.40

The install will now look and feel normal for a bit:

Screenshot 2014-07-11 11.24.22

Until… dum dum duuuuuuuum.  This happens:

hcmon error picture

HCMON is the USB driver for the VMRC remote console (a super awesome VMware feature).  Long story short, for whatever reason, it doesn’t work in WINE.  Have no fear though, this entry is all about getting this working (minus the console capability, sorry!).  Do not OK this dialogue box.  Pause here.

Step 5:  once we acknowledge that dialogue, the installer will rollback and delete the installation which is currently being held in temp storage by WineBottler.  We want to grab that before this happens and put it somewhere safe.  So before clicking OK, go over to /tmp/winebottler_1405091227/nospace/wineprefix/drive_c/Program Files/VMware.  Copy the entire “Infrastructure” folder and paste it somewhere safe, then rename it:

Screenshot 2014-07-11 11.34.11

I dropped it into my Documents folder and renamed it “VMW”.  What we are looking for is to make sure that “Infrastructure/Virtual Infrastructure Client” is fully populated:

Screenshot 2014-07-11 11.36.24

We can now click “OK” to the HCMON error and allow the installer to rollback and WineBottler to complete.  It will look for us to select a Startfile.  There is no good option here since our installer actually didn’t finish correctly (WineBottler doesn’t actually know this).  It doesn’t matter what we select as we just want to get a completed install, so go ahead and select “WineFile”:

Screenshot 2014-07-11 11.39.09

 

This dialogue will complete this step:

Screenshot 2014-07-11 11.40.31

 

Step 6:  At this stage, we do not have a working install.  What we do have is a usable template on which we can build a working install.   First go ahead and launch the app (the shortcut will be where the container was saved in step 4).  Nothing will happen since there is no app, but the environment will be prepared.  This is the important piece.  The next step is to go back into WineBottler, and run a new sequencing, but with the options slightly changed:

Note, we are now selecting the newly created environment as the template (/Applications/VIM Client.app/Content/Resources in my case).  For our “Program to Install”, we are now selecting: /path to saved client files/Infrastructure/Virtual Infrastructure Client/Launcher/VpxClient.exe and we are letting WineBottler know that this is the actual program and that it should copy the entire folder contents to the container.  We can now go ahead and click Install (it will be quicker this time).  At the end of this install, be sure to select VpxClient.EXE as the “startup program” before completing.

Step 7: unfortunately, we’re not done yet!  The last step is the do some manually copying since the container will still not be prepared quite right.  Once again, copy the “Infrastructure” hierarchy.  Head over to /Users/username/Library/Application Support/ and find your WinBottler container folder (com.vmware.vim_UUID in my case).  Navigate to drive_c/Program Files/VMware and paste Infrastructure over the existing file structure.

With this step you should be complete!  The original environment can now be deleted and a new shortcut should exist that works.  Here is a final shot of VIM client managing vCenter via WineBottler on OSX:

Screenshot 2014-07-11 20.05.38

 

 


Depending on how things go, the title for this entry might more appropriately be “the self healing Mac”.  Only time will tell!  So what is this all about?  Well recently my trusty companion of 2 years, the “mid 2012 MacBook Pro Retina 15″, decided to have a near (as I can tell) death experience.

It all started with a single kernel panic while doing some boring daily tasks in Chrome.  Within a 24 hour period the problem accelerated to a continuous kernel panic loop.  My first thought was “recent update”, but searching high and low for clues didn’t yield much.  Basic diagnostics (read as the highest of high level) seemed to imply the hardware was OK, but it really felt like a hardware issue.  Or if not hardware, possibly drivers.  But of course neither of those made much sense.  This was OSX running on a nearly new Mac, after all!  It’s like suggesting that your brand new Toyota Corolla would up and completely die 3 miles off the lot (heavy sarcasm here).

Searching around I discovered that there were possibly some issues with Mavericks and the Retina that I had maybe been dodging.  It had also been 2 years of accumulating crap (dev tools, strange drivers, virtualization utilities, deep utilities, games) any of which could be suspect.  So I decided I would try a time machine rollback to before the first kernel panic, and if that failed, take a time machine back to the 1995 and do the classic Windows “fix” – wipe and re-install (ugh).

The time machine restore took literally ages thanks to the bizarrely slow read rates of my backup NAS (detailed here), but eventually completed (400GB, 24 hours).  Unfortunately, the system wasn’t back for more than 10 minutes before the first kernel panic!  That meant that either the condition actually pre-existed the first known occurrence and had just been lurking, or the issue was in fact hardware.  I moved forward with the clean install.

First I deployed a new version of Mavericks.  Boot up holding command R, follow the linked guide, and you’re off to the races.  The reinstall was pretty smooth (erase disk, groan, quite back to recover menu, install new OS) and first boot just felt better.  Of course you know what they say about placebos!  After an hour of installing my usual suite of apps, upgrading to Mavericks and grabbing the latest updates, the dreaded kernel panic struck!  Things were looking grim.

With little to lose I decided to maybe try rolling back to Mountain Lion on the outside chance that the latest Mavericks update was causing issues.  One more reinstall, followed by an app install only and I was feeling good.  Until terror struck!  Yes, another kernel panic.  Incidentally these kernel panics were all over the place (really suggesting RAM).

At this point I became bitter.  Suddenly the “it looks amazing and is all sealed and covered in fairy magic!” Apple approach didn’t seem so great.  Changing out a DIMM on a PC laptop is a cheap and very easy fix.  Hell, in these pages I’ve covered complete tear downs of PC laptops (down to motherboard replacements).  Compounding the issue was that I never opted for Apple Care (yes yes, I know that failing to spend more money no top of a premium $2500 laptop means I deserve what I get if said premium hardware somehow completely dies within 3 years).  Apples decision to solder the memory to the motherboard meant I’d be looking at an extremely expensive motherboard swap out and a good sized chunk of downtime (the latter being a really big issue for me).  Starting to feel truly grumpy, I decided to run a few tests.

First, memtest in OS.  Lots of failures.  Instant failures too.  As a matter of fact I’ve never seen such a horrific memtest result!  It was honestly a bit of a wonder the thing could even boot!  Thinking that maybe the software result was anomalous (memtest for OSX is a bit old at this point and in theory doesn’t support anything newer than 10.5.x) I decided to do the old faithful Apple Hardware Test.  If nothing else that utility is always a cool walk down GUI memory lane!

Well depressingly enough, AHT wouldn’t run.  I didn’t think to snap a pic at that point (I wasn’t planning on this entry), but this gives you an idea (stolen from the Apple support forums):

Image

Disclaimer: Not my pic. Error numbers have been changed to protect the guilty!

The actual error code I was faced with was -6002D.  Yep.  That’s generally memory.  So it looked like a total bust.  My Apple honeymoon appeared to be officially over.  I decided to do one final wipe, in preparation for the now seemingly inevitable hospital visit, and this time lay down only the bare minimum footprint needed to keep doing a bit of work in the meantime since one positive outcome of all of this wiping was the kernel panics had gone from continuous loop to fairly rare.

After turning in for the night, and struggling through a restless sleep fraught with nightmares of Genius Bar lines stretching to the horizon, I crept downstairs to discover that I didn’t see this:

Again, not mine… But you get the idea.  Seen one, seen em all!

Again, not mine… But you get the idea. Seen one, seen em all!

The Mac had made it through the night!  Now this was interesting.  Could it possible be that something in this lineup had become toxic?  With the cloud and “evergreen” software, it was possible.  After all, since our software library is now real time and online, it can be hard to avoid the newest version right?

  • Chrome
  • Lync
  • Skype
  • Office 2011
  • Camtasia
  • Omni Graffle
  • iMovie
  • Garage Band
  • Unarchiver
  • Evernote
  • Dropbox

That is literally the “slim” list that was in place every time the problem would happen post system wipe.  The new list, that seemed stable (against all odds), was solely Office, Lync and Skype.  It was time to do some testing!  Well the results were interesting to say the least! I decided to beat the Mac up a bit.  First, Unigine Heaven 4 in Extreme mode left running overnight (I was always curious how it would do anyhow):

Screen Shot 2014-06-19 at 7.28.35 PM

A great score it’s not, but banging through maxed out Unigine and left running overnight without a hitch kind of implies that the GPU (and drivers) are not an issue.  Well we did suspect that after all, right?  How about taking a closer look at memory?

Screen Shot 2014-06-19 at 7.29.10 PM

Hmmm… OK so far so good…

Screen Shot 2014-06-19 at 7.29.24 PM

Well it’s not ECC anyhow and who knows what this code is actually doing right?  For all we know this is just a register dump.  Time for the big guns.  How about some Prime 95 max memory torture testing?  This is another thing I’ve always wanted to subject the Mac to. No way it survives…

Screen Shot 2014-06-19 at 7.28.19 PM

 

Uh, ok.  It just got serious.  How the hell could this heap, which was unable to even start AHT one clean install ago, somehow now banging through hours of Prime 95 torture?  There was only one thing left to do (well OK two).  First up, memtest.  Keeping in mind that it might be incompatible of course!

WTF?!

WTF?!

What… the…. heck!?  This time the test ran like a charm; exactly as expected.  So not only does it appear that memtest does in fact work fine on Mountain Lion, but the MacBook passed.  With a cautious glimmer of hope starting to form, and more than a bit of fear, it was time for…. AHT!

2014-06-19 20.27.58

You have got to be kidding me!  This time not only did AHT run, but it passed the damn test!  At this point I started checking for hidden cameras, aliens and paranormal activity.  It just didn’t make any sense!

So where does this leave us?  Well at this point I have added everything back in except CHROME and have successfully repeated all of these tests!  Is it somehow possible that CHROME caused this?  But how?  Chrome certainly can’t survive reboots.  Or can it?  With modern laptops and the way they manage power, it’s hard to know if the machine is every really off. Is it possible that some software anomaly was leaving the Mac in a state that prevented it from being able to enter AHT and survived reboots?  It really does seem impossible and it doesn’t make sense, yet none of this makes sense.  How could the Mac have gone from being so unstable it couldn’t even enter AHT, to passing it over and over with flying colors and surviving brutal overnight torture tests with only a software change?  I’ve been doing this a long time (hint… Atari 400, Timex Sinclair 1000, etc) and have never seen something like this.  Is it a self healing Mac?  Is it software so insidious it can survive reboots?  I almost don’t want to know.  One thing is for sure though and that’s that I will be keeping a close eye on this and providing any updates on these pages.  And if I should suddenly vanish?  Tell them to burn the Macbook!