Of Bottlenecks, Bandwidth and Erroneous Tribal Knowledge…


It’s time to set down a few thoughts on a couple of pet peeve topics that continually pop up on enthusiast PC forums so I don’t need to endlessly repeat the same points.  Anyone interested in a primer on these admittedly complex topics will hopefully find this useful.

Bottlenecks

Specifically, in this case, CPU/GPU but really the same logic applies anywhere two subsystems interact.  First, let’s define a bottleneck.  In real terms, a bottleneck simply means when the overall performance of an entire system is being solely determined by the performance limit of a single subsystem.  You can be in a 900HP car, but if you have inferior tires that cannot grip the road, your tires would be an awesome example of a real bottleneck.  In a modern computing system, when talking about the 3D pipeline, the term is actually non-applicable because of the deliberate balance engineers across the industry target between subsystems within a given generation.  I think the typical misunderstanding in this space comes mainly from a fundamental ignorance with regards to how the 3D pipeline, in a modern architecture, actually works.  There are lots of great overviews of this complex topic.  ExtremeTech still keeps an “oldy but a goody” online here: 

http://www.extremetech.com/computing/49076-extremetech-3d-pipeline-tutorial

In a nutshell, a computer system is made up of a number of subsystems which all contribute fundamental capabilities to the overall architecture.  Storage controllers, network controllers, IO controllers, etc all provide access to communications and storage peripherals so the computer can import and export data bot from users and other systems via disk drives, network cards, key boards, etc.  Sound and graphics subsystems take the burden of creating audio and video output off of the CPU.  And the CPU sits in the middle of all of these subsystems orchestrating their interaction, running the primary operating system and user interface, and running application code.

In the case of gaming, the CPU is responsible for running the OS and all of the “application programming interfaces”, or “APIs”, which games require to run.  An API is simply a set of pre-written functions that developers can utilize in order to not reinvent the wheel.  Taking a super simple example, JPEG is a pretty universal format for encoding photos into a file.  Back in the old days, if you wanted your application to be able to load and save (and display) JPEG files, you would need to write lots of code to do that.  A modern operating system, on the other hand, might have a “picture handler” API which you could simply talk to with well defined and documented functions that you could use.  In this case your application could start by asking the API “can you decode a JPEG?” and if the support is there, the answer will be affirmative, and the application can then utilize this API facility to provide JPEG support.  With gaming APIs, the types of functions provided might be things like “draw a circle” or “map a texture to a circle”.  Any modern OS has lots and lots of APIs that the CPU runs.  On Windows systems the DirectX family provides pretty comprehensive support for gaming and multimedia.  On the Macintosh OSX operation system, open source APIs like OpenGL and OpenAL provide similar functionality.

In addition to running the OS and its APIs, and orchestrating all of the foundation subsystems, the CPU is also responsible for running the actual application code (in this case a game).  That includes the algorithms for the game logic, the user interface of the game, managing the games database of objects, etc.  A key part of this logic, for a modern 3D game, are the algorithms that move and track the position of the user in virtual 3D space.  Lots of math goes into this.  The CPU runs big batches of floating point math that determine where the user was, where they want to go, and where they will be.

At that point, the scene presented to the user must be drawn and sent to the monitor.  In the early days of 3D, the video card was little more than local memory and a frame buffer.  The CPU pretty much did just about all of the work and sent a bitmap (just a single image) out to the video cards local RAM and the video card circuitry would convert this image data into a signal that the monitor could then display.  The video card would keep the monitor refreshing the same image until that image was updated by the CPU.

Over time, more and more functionality was implemented on the actual video card freeing up the CPU to do other things.  Initially 2D functions were developed, in order to make rending graphical user interfaces quicker, and over time the complex 3D functions started to be implemented on consumer video cards as well.

There are many steps in the workflow of a 3D rendering pipeline.  Geometry must be calculated, a triangle mesh must be built (for polygon based rendering), the scene must then be whittled down (“culled”), based on what the user can “see” from their perspective in order to reduce the size of the scene being rendered, texture data must be applied so the grid of triangles look like more than a wireframe, the entire scene is then also lit and refined through techniques like bump mapping (to make uneven surfaces actually look uneven), shadowing (to make shadows look like they are really being cast by the light sources) and anti-aliasing and various types of filtering to make the entire image look cleaner and more real.  On a modern graphics card, nearly all of this occurs on the GPU.

Understanding this,  one can see why it is exceedingly rare for a CPU to actually be a bottleneck for a GPU that can actually physically work in its motherboard (meaning roughly the same generation).  All the CPU really needs to do is stage the 3D scene.  Meaning run the application code and stage the image and then hand off to the GPU and say “ok render this”.  On a modern card, even triangle setup is on the GPU.

For a CPU to be a true bottleneck, it would mean that it is unable to stage the graphics data fast enough for any one GPU, across a set of GPUs to be able to render it any quicker, than the other.  In other words, “bottleneck”, when used correctly in this context, is primarily a comparative term.  The CPU would be a bottleneck if one were to take 4 different GPUs, all at different performance levels, and get the exact same performance on each of them.  At that point, the inability of the CPU to deliver frames at a faster rate is preventing the system from rendering faster.

These days that pretty much never happens.  Why?  Because the GPU is doing 90% of the heavy lifting on the 3D pipeline.  And these systems are completely interrelated.  The quicker the CPU gets finished the quicker the GPU can get started and the quicker the GPU gets finished the quicker the CPU can get the next scene teed up.  This leads to lots of misconceptions.  In a nutshell:

  • a quicker CPU will make any GPU give better frame rates 
  • a quicker GPU will make any system get better frame rates
  • the more burden you put on the GPU (super high res, big detail), the more of a bottleneck it is
  • the less burden you put on the GPU, the more you are testing the performance of the CPU

So lets consider a scenario.  A Core i5 CPU paired with a GTX Titan and with a GTX 680.  For fun, take any modern game and set the details down to low and the resolution at 1024×768.  At this point, you might be creating an artificial bottleneck because both the GTX 680, and the Titan, might be able to complete rendering the scene quicker than the i5 can serve them.  Meaning if the i5 can do 165 fps staging frames, and the GTX680 can technically be doing 295 while the Titan can technically be doing 380, the system will only be able to do 165 since that is what the i5 can stage.

Looking at it this way you can see how unlikely it is to encounter a true bottleneck scenario.  Does a i980x “bottleneck” tri-SLI Titans as many seem to want to believe?  Of course not.  Not if you use the Titans for what they were intended.  I would argue that tri-SLI Titans are nowhere near enough GPU power to handle 3D Vision at 5760×1080 and above at max detail.  Why?  Because at that level of settings the framerates are pretty lousy.  You can search this blog for the incremental improvements in this area from GTX480, to 580 to 680 in tri-SLI at 3D surround res as I have walked that evolutionary path and have yet to find really acceptable performance.  Reviews of the Titan indicate that it would deliver probably in the neighborhood of 40% more performance in tri-SLI then my 4GB GTX680 tri-SLI which is still not nearly enough consider where we are coming from.  The key here is that the complexity is all on the GPU side.  Any CPU that can fit in a socket that has a PCI-E slot is going to be able to stage frames quicker than those Titans can render them when you are really brutalizing them.

Bandwidth of the Memory

So I think that is enough on bottlenecks.  The next topic, thankfully, is much simpler.  Processing units read and write code and data from main memory.  Memory, in turn, is loaded up with code and data transferred from some long term storage medium.  In each step of this process “bandwidth” is involved.  If the data is coming from an SSD or a hard disk, SATA interface bandwidth determines the maximum rate at which the data can come off of the disk.  In turn, PCI-E bandwidth available to the storage controller determines the rate at which that data can make it into main memory.

Most of those paths are (reasonably) well understood.  Where things seem to get fuzzy for people is when the CPU reaches out to main memory to access all of that code and data.  Leaving cache out of the equation, there are simply two primary factors that determine how quickly data can move from memory to CPU back to memory:

  • data rate
  • data width

Thats it.  To use an old analogy, data width is the number of lanes on the highway and data rate is its speedlimit.  Consider two scenarios:

  • 100Mhz data rate, 16 bit wide data bus
  • 200Mhz data rate, 8 bit wide data bus

Which is more bandwidth?  It’s a trick question.  They are the same.  The math works this way:

  • X Mhz = X million cycles per second, Y bit data bus / 8 bits per byte = Z bytes per cycle: X*Z=bandwidth

Using that formula you can see that in both cases above the bandwidth is 200MB / second.

The other, vital, component of this however is the number of processors attached to that memory. This is extremely important with GPUs and is a concept many do not understand.  In average consumer systems (even high end), there is a single CPU package which hosts a number of discrete physical cores (2, 4 and 6 being common).  This means 2, 4, 6 or however many CPUs are attached to that one memory bus.  There are other memory architectures which dictate that each CPU has its own memory bus and its own block of RAM, but Intel and AMD do not utilize those on their consumer parts (AMD does for its Opteron, multi-socket, server parts, but that is out of scope for this discussion).  So it is important to realize that the more cores you have sharing that bandwidth, the more it is divided up.  In traditional x86/x64 code this isn’t an issue thanks to caching and really thanks to the type of operations traditional microcomputers are performing in running the OS and drivers and app code.

On a GPU, however, things are a lot different.  Memory bandwidth is vital to keeping the cores fed a steady stream of data.  Hence many folks have slammed the gk104 for “only” having a 256bit bus (192GB/s), and point to this as a key “bottleneck” (ugh that word again) compared to the gk110 (Titan) which has a 384bit bus (288GB/s).

What is missing from that logic, however, is the fact that the gk110 2680 CUDA cores.  The gk104 only has 1536.  That means there are 75% more cores that need to be fed in the Titan.  When you consider the increased core density, the 50% increase in memory bandwidth makes a lot of sense.  It isn’t a differentiator, its a mandatory part of the bigger architecture.  This is why you see performance of Titan vs GTX680 roughly equating to the increased number of CUDA cores.  If you dropped the CUDA cores down to 1536 (too bad it isnt possible to programmatically do that), I am quite confident that performance would be identical.  The final take away here is that bandwidth is something which can be objectively measured as numbers of bytes transferred per second and in overall system design, this throughput is specifically matched to the requirements of the subsystems that need to utilize it.  Too little bandwidth here creates an absolute bottleneck that no system designer would want, and too much bandwidth wastes silicon and increases cost.  You can bet that any modern semiconductor design team consistently aims for “just right” and this is an area well understood enough that they generally hit the target.

Thats it for now!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s