Understanding VRAM and Bandwidth… Do the Maths!

I see lots of discussion around the impact of PCI-E revisions on performance that are all weirdly speculative.  Not surprising given how often technology debate (which should be scientific) ends up breaking down to be completely anecdotal even in the era of Google and Wikipedia.  The accepted “conventional wisdom” seems to be that you “need” PCI-E 3.0 at super high resolutions.  As far as I can tell, the reason for this stems from literally one person, on one forum, doing a series of benchmark tests using a really nice high-end setup, and the findings becoming gospel.  These were great contributions to be sure, and I’m not one to suggest that the pro review sites like Anand, [H]ard and Tom’s should be gospel (especially since they all started as hobbiests also), but I do think that a single set of findings should never be considered the final word on anything when it comes to PC testing.  This is especially true when the source is a single contributor and not a well vetted testing bench with documented history.  The resolution tested in this case happened to be 1600P surround.  Well since I have now gone to 1440P surround, and that qualifies as “pretty high”, I figured I’d do some testing of my own.  But first, the promised maths:

Each Frame:

2560h * 1440v = 3686400 pixels

32 bits per pixel = 4 bytes per pixel * 3686400 pixels = 14745600 bytes per frame (14.7MB)

In surround we have 3 of these 1440P frames being stitched together:

14.7MB * 3 = 44.2MB

Now the next exercise is to determine how much actual data transfer we’ll have since, obviously, a single frame is only useful if we’re talking about a static image.  The 1440P QNIX panels can run 120Hz, but in surround mode, thanks to yet another NVIDIA driver bug (YANDB!), you can’t create a custom resolution.  Therefore we are stuck at a maximum of 60fps.  Of course at this resolution even tri-SLI TITAN, as we saw, isn’t going north of 60fps regularly anyway, so it’s ok:

44.2MB / frame * 60fps = 2654208000 bytes per second (2.654GB / second)

As an aside, this is a good point to talk about the path of this data out to the display:

2.654GB / second * 8 bits / byte = 21233664000 bits / second (21.2Gb/s)

We can start to see here how 4k (which piles just a bit less than this resolution into one panel) requires a massively high bandwidth connection to feed.  Consider:

  • VGA (DVI-A) = ~9.63 Gbit/s (~400 MHz RAMDAC)
  • DVI-D = 3.96 Gbit/s
  • DVI-D Dual Link = 8.16 Gbit/s
  • HDMI
    v1.0/1.1/1.2 = 3.96 Gbit/s
    v1.3/1.4/1.4a = 8.16 Gbit/s
    v2.0 = 14.6 Gbit/s
  • DisplayPort
    v1.0/1.1 = 8.64 Gbit/s
    v1.2 = 17.28 Gbit/s
    v1.3 = 25.93 Gbit/s

As you can see, with a single panel, we would need DisplayPort 1.3 at least to handle 60Hz refresh of this much data.  As mentioned, 4k is actually a bit lower than 1440P surround and requires about 20Gb/s for 60fps.  HDMI color encoding is more wire efficient (4:2:2 chroma) and reduces the bandwidth requirement down to about 7Gb/s or so.  In our surround scenario, we are using three DVD-D dual link connections (1 per panel, 1 per TITAN) and so have plenty of bandwidth

On the assembly side, assuming that we can sustain 60fps we’re going to need 2.654GB per second of bandwidth to move around the data.  Keep in mind, however, that we’re talking about multiple bandwidth paths here.  The whole point of a high end video card having a massive amount of local GDDR5 (6GB at 288GB/s in the case of the TITAN), is so it doesn’t have to pull data from PCI-E.  At this point in the evolution of real time 3D rendering, pretty much the entire 3D pipeline processing happens on the GPU.  PCI-E bus transfers occur to load texture sets for consumption by the 3D rendering process.  Texture data is typically heavily compressed as well since modern GPU’s have dedicated hardware to deal with texture processing.

So let’s think about what’s actually occurring via PCI-E…  The CPU is executing game logic and making a determination on what the current viewport show look like based on player input and the game logics response.  Did you move forward, did you turn, are you outside, etc.  The next step is for the game logic to stage the image.  What that amounts to is a whole bunch of math happening in the 3D engine.  The next step is for the 3D engine code to start the rendering process and update the presentation layer for the player.  3D engines are designed to interoperate with versions of graphics APIs (DirectX, OpenGL, etc).  They do their rendering work by making calls to the API.  In turn, the API interacts with the graphics processing hardware (GPU) by way of its device driver.  The level of functionality in hardware exposed to the API is determined by the driver and anything else is left up to software.  Either way the 3D engines expect to make API calls and have a standardized set of return code (the whole point of a standardized API).  We’ve mentioned that at this point pretty much all of the tasks involved in 3D rendering (geometry setup, triangle setup, transformation, clipping, lighting, texturing, rasterization, rendering, etc) occur on GPU, so the graphics API modules (DLLs) are primarily asking the device driver to tell the GPU to do things.  What the GPU needs in order to get its work done is the data for the staged scene (all of the math) and a set of directives for how the scene should be  manipulated (how it is to be lit, what special effects should be applied, etc).  Most of these inputs are based on viewer positioning, game engine settings and game logic.  For the most part this all amounts to a tremendous amount of floating point math that needs to be done in a very short period of time in order to maintain a high framerate.  Incidentally this is why GPU is far more important these days than CPU and the notion of a “CPU bottleneck” is wildly mis-stated most of the time.  Legitimate “CPU bottlenecks” that actually impact real world performance (and aren’t just synthetic benchmark proof points) are pretty rare.

So far most of what we’ve discussed isn’t extremely bandwidth intensive; it’s instruction traffic and primitives.  What is very bandwidth intensive is the last stage of the 3D pipeline, before the image is turned into a 2D surface, and that is texture mapping.  Textures are the magic that make modern 3D images look realistic.  Texture data in a game like Crysis 3, for example, is a heavy amount of data. This is where engine efficiency comes into play.  GPU VRAM is used to pre-load textures, they are decompressed and applied in real-time, and 3D games tend to be naturally pretty texture efficient.  You stay in one environment for a while (indoors, outdoors, in the air, etc), and it tends to look fairly similar.  Textures can be re-used without it breaking suspension of disbelief (office hallways are repetitious in real life, as are forests).  For hypothetical purposes though, let’s assume that every single pixel value needs to be loaded from PCI-E every single frame.  So in this case a full 2.654GB of data would need to be loaded each second from RAM.  Let’s look at our bandwidth using x79 as a reference:

System RAM bandwidth:

4 channels at 12.8GB/s per channel using 1600Mhz DDR3 RAM = 51.2GB/s bandwidth

Total PCI-E bandwidth:

40 PCI-E lanes at 500MB/s assuming PCI-E 2.0 = 20GB/s bandwidth (bidirectional)

40 PCI-E lanes at 1GB/s assuming PCI-E 3.0 = 40GB/s bandwidth (bidirectional)

PCI-E bandwidth per GPU:

Assuming Tri-SLI and even one other PCI-E card (in my case the Creative Fatal1ty) you get 16/8/8 (32 lanes consumed)

At PCI-E 2.0 this means: 4GB/s per card (remember, data is duplicated across cards in SLI so one card having more bandwidth doesn’t really help)

At PCI-E 3.0 this means: 8GB/s per card

As you can see above, even PCI-E 2.0 should have sufficient bandwidth, even in the bonkers 1440P surround scenario, to feed the GPU’s at 60fps.  How did this hold up in testing?  Quite well actually.  I saw no difference at all (literally) switching between PCI-E 2.0 and PCI-E 3.0.  I did verify using GPU-Z that PCI-E 3.0 was in use, but at this resolution, and with game engine efficiency being decent, the TITANs just can’t maintain a high enough framerate for PCI-E bandwidth to matter (especially since PCI-E 2.0 x8 is actually quite a lot of bandwidth).

The most interesting case occurred when I dropped Crysis 3 down to “global High” settings from “global Very High”.  At that point the TITANs were able to sustain a pretty regular 60fps with occasional dips down to 45fps.  At this point I switched back to PCI-E 2.0 just to see what would happen.  In switching to PCI-E 2.0 Crysis 3 with “global High” settings was still able to sustain 60fps.  The conclusion here is that even in 4k, or 1440P surround, PCI-E 2.0 is fine.  Even with GPU’s that were able to push more pixels (sorely needed in my opinion), PCI-E 2.0 with current high end games would still be fine since any shortfall would be occurring well north of 100fps.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s