Let’s Code Part 16: Fun with Shaders

 By Shamus Apr 10, 2011 58 comments

letscode1.jpg

It’s been a while since I talked about this series, but Goodfellow has been putting out a steady supply of really interesting work while my attention has been elsewhere. Part 16 of the series is now up, and it’s full of interesting ideas. Let me give a cliff notes version of what he’s doing:

Sending data to your graphics card is slow. (Relatively speaking.) Your graphics card is sort of like another computer. It has its own memory and its own processors. Your PC sends it a fat wad of data describing the position of the polygons in the world, and the GPU (your graphics card) has a jolly good think. When it’s done, it sends back the finished image. (Basically.) The problem is: There’s a limit to how fast data can be moved between the two. It’s like two bustling cities with vast ten-lane highway systems, but between the two is just a dirt lane.

The traffic between these places is measured in bytes. One byte can hold an integer from 0 to 255. That’s it. In C++, you can make a variable to do exactly this. If you’ve got two bytes, you can store values from 0 to 65535. Most of the time in graphics programming, we’re using variables called float. A float is 4 bytes, and can store non-integer numbers like 3.14 or 0.00001. When you send a vertex off to be rendered, it needs 3 float values, one for the x, y, and z values that say where the vertex is located. At four bytes each, that works out to 12 bytes. We also need three more floats to describe the texture. And we need three more for the surface normal, which is used to describe which way this vertex is facing, for the purposes of lighting.

That’s nine float values. At 4 bytes each, that’s 36 total bytes. If you try to render an object with 1,000 vertex points (chump change) you need 36,000 bytes, which is just over 35 kilobytes. Again, not a big deal. But once you start pumping millions of the dang things through the system you end up with a horrible bottleneck. You can send that data every frame and clog up your dirt road, or you can try to store it all on the graphics card and eat up all your GPU memory, but either way, you’re dealing with a glut of data.

But Goodfellow has implemented are really clever idea. Unlike more traditional games, a Minecraft-style world is made from cubes that are (assuming the programmer is not an idiot) exactly 1 unit in size. So even though you’re using float values that can store stuff like “12,552.08423″, the values are all 1.0, 2.0, 3.0, and so on. They’re simple whole numbers. They would fit in a single byte. In fact, less than a byte. You don’t even need the whole byte. Likewise, surface normals are usually able to define verticies facing any direction – you can make a sphere that is smoothly shaded. However, we’re rendering cubes, and the sides of a cube face in one of six different directions. Instead of three floats at four bytes each, we only need part of a byte.

So what he’s doing is reducing all of these values to integers, and “packing” them together. That is, several different pieces of data are sharing a single byte.

Imagine a guy doing the books for his company. In most cases. he’d fill in each bit of paperwork with the employee’s full name. But because of some freak of luck or extreme nepotism, everyone at the company is named either Adams, Smith, or Zoidberg. He can then save himself some hassle by filling out the paperwork with A, S, or Z as the last name, as long as they translate it back into the full last name when they go to fill out the paycheck.

Now, this takes some extra thinking on the part of both the CPU and GPU. Goodfellow has to write his game to condense everything into this shorthand. Then he has to write another program for the GPU (called the “shader”) that will take the shorthand and turn it back into the full 36 bytes of data for rendering.

I’ve never heard of anyone doing something like this. I would normally be worried that a process like this would slow things down, but it turns out that Minecraft-style rendering isn’t really taxing the GPU. It has plenty of time for this sort of business. (Remember, the GPU’s of today are made to draw bump-mapped polygons with several textures and all kinds of exotic lighting effects on them. Simply drawing flat cube faces leaves the GPU feeling bored and under-appreciated.

So he’s getting all this for “free”.

There’s a lot more going on. Be sure to check out the full article.

20201858 comments. It's getting crowded in here.


  1. S. Richmond says:

    That guy continuous to amaze me with his low level graphics programming hackery. Always a good read.

  2. Tizzy says:

    Are pictures recomputed from scratch for every single frame? If it is the case, there would be a tremendous savings in computation right here.

    • Piflik says:

      Yeah…the GPU computes the color of each displayed pixel for every frame…there is not much you can do about it…if you would store old information (like the position of polygons), all you would do is outsource the computation from polygons to pixels to the CPU, which is much slower computing such things compared to the GPU.

    • kerin says:

      You might think so, but the only way to tell if the frame you’re trying to render is identical to the last one is… to render it and compare. Which wouldn’t actually save time.

      • Tizzy says:

        I did not mean just checking if the two frames were identical but simply to compute the new frame by modifying the old frame rather than start from scratch. When you’re getting 30 frames per second, they cannot be very different from each other most of the time.

        Of course, it’s a lot easier to restart from scratch. Also, maybe it just doesn’t look good that way.

  3. Eric says:

    So what you’re saying is… voxels are the answer to everything? :D

    Okay, maybe not, but I appreciate your translation of the article. Graphics programming has always fascinated me despite not being a programmer or even much good at math… we look at games and 3D graphics all the time, but rarely do we sit down and try to understand just how the hell it’s all working under the hood, or how many millions of calculations are happening per second. The fact that it even works as well as it does is even more incredible.

  4. Daemian Lucifer says:

    “A float is 4 bytes, and can store non-integer numbers like 3.14 or 0.00001. When you send a vertex off to be rendered, it needs 3 float values, one for the x, y, and z values that say where the vertex is located. At three bytes each, that works out to 9 bytes.”

    Shouldnt that be “At four bytes each, that works out to 12 bytes.”?

    Also,with the current technology of parallel cores,why arent we fusing processors with graphic cards?It would decrease the need for data to go through the motherboard,and thus remove many of the bottlenecks.

    Though,I guess its irrelevant now that we are on the verge of analogue computers.

    • John Magnum says:

      Also,with the current technology of parallel cores,why arent we fusing processors with graphic cards?It would decrease the need for data to go through the motherboard,and thus remove many of the bottlenecks.

      I don’t really know a ton about it, but isn’t this kind of what AMD’s Fusion thing is supposed to be about?

      • Zak McKracken says:

        Yep, it’s exactly that.
        The reason why it hasn’t been done a long time ago is that two heat sources in one processor die mean twice the work for cooling. But since processors with multiple corse have become the standard (and at the same time, processors aren’t increasing clock speeds anymore, for other technical reasons), it was mostly a question of time ow long it took to develop such a thing.
        They’re mostly low-power (consumption) processors right now (think integrated graphics chipset), but higher-powered versions are coming up shortly.

        Highend-graphics will still rely on separate graphics boards because an enthusiast is not going to just accept the graphics ship that came with the processor, much less the fact that it should have to share the RAM with the CPU (graphics RAM ist much much faster than your regular old PC RAM) — so these things will never really beat high-end graphics cards, even though they don’t have the bus bottleneck, because they gain the shared-RAM-Bottleneck.

    • Shamus says:

      Fixed.

      I have no idea how I managed to mangle it that bad.

    • Ben says:

      Also,with the current technology of parallel cores,why arent we fusing processors with graphic cards?It would decrease the need for data to go through the motherboard,and thus remove many of the bottlenecks.

      Because die space is still at a premium. Single monolithic dies are hugely problematic to make for a number of reasons. First there is the simple fact that its harder to get consistently usable devices out of large dies. If you figure that there is some low probability of a defect per transistor on a die then the larger your die gets the higher chance of show-stopping defects. Second we have literal space limitations, without shrinking the process there are only so many transistors we can put on a die. Finally large dies tend to have problems with power draw and dissipation, for a number of reasons they tend to use more power then multiple smaller dies.

      AMD’s Fusion is a step in this direction but this technology will likely be a low to mid-range technology because putting two very high performance devices on the same piece of silicon would be an engineering nightmare.

  5. Ambitious Sloth says:

    That is really amazing way to handle graphics in Minecraft but I don’t think that what you just described is a cure-all, for instance what about the non-cube blocks like torches or steps? They would have to have floats at numbers like 1.5 or so on. It would still be a relatively simple number but it ruins the new shorthand method.

    Looking at the original post I see Goodfellow hasn’t figured it out yet either. Oh well, I’m sure there’s a good solution out there somewhere.

    • MichaelG says:

      I just thought of a good way! Now I have to try it and see if it works…

    • Zukhramm says:

      Actually, in Minecraft at least, even those types of block can only fit the place of one meaning the position should still only need to be an integer.

    • Benny Pendentes says:

      Think of the torch as a cube of glass that has a torch embedded in the center (or wherever). Then make the glass completely transparent. The torch still has to align on the cube grid, no fractional position values are required… the apparent ‘offset’ is not an issue that the CPU or GPU need to be aware of, since it is only relevant in that other processor, our heads.

  6. I have absolutely no interest in coding, nor any ability to do it or even properly understand it. It’s always been inherently boring and nonsensical to me.

    But Shamus, not only do I read and enjoy every one of these posts, but I’m slowly understanding more of it, which in turn makes me enjoy the next post more. Must be your natural talent.

    What I’m saying is, don’t stop posting this sort of thing!

  7. James Schend says:

    What? I know you’re trying to simplify this, but … it’s wrong in a couple areas.

    First of all, as Daemian says, you have a typo where a float suddenly becomes 3 bytes instead of 4. (BTW Daemian, we had analog computers in the 50s and 60s– the digital ones replaced them because they were more precise.)

    Secondly, if Minecraft transmitted cube coordinates using only bytes, the maximum size of the world would be 256,256,256 blocks– which it’s clearly not. The view in that screenshot above actually looks larger than an int coordinate (65536,65536,65536) would provide.

    If Minecraft tried to pack more data into a single byte than simple a coordinate, the world would be even smaller– max of 128x128x128 blocks. Either I’m an idiot, or something here is just not adding up.

    • Daemian Lucifer says:

      “(BTW Daemian, we had analog computers in the 50s and 60s– the digital ones replaced them because they were more precise.)”

      Yes,yes,but thats not what the article is about.This is about next gen analogue components that are similar to current ones in size and price,but offer multiple states instead of just 2,and permanent memory.

      • James Schend says:

        We have permanent memory now. (The MacBook Air I’m typing on has 120 GB of the stuff.)

        It sounds like you’re talking about computer chips that run on base-3 or higher, instead of binary… that would work a lot better than analog does. (As I’ve said, we’ve already tried and discarded analog.) But it would also require starting from scratch on… everything, from the motherboard, to the memory, to the CPU, to the GPU, to the OS, to the drivers. The workload alone, and the fact that our good ol’ binary computers are still getting better and faster, makes a switch seem pretty unlikely in the near future.

        (In true Wikipedia fashion, that article on memristors is *awful*, so maybe I’m not getting the idea. Based on what I’ve read, I could see it possibly replacing traditional memory in a SSD, since the driver could do the “translation”, but I doubt we’ll ever see it in a CPU or GPU.

        • Daemian Lucifer says:

          Yeah,but like any wikipedia article its useful mostly for the links.Also,considering that the component was found in 2008 and memories using it should appear in 2013,its safe to assume that by 2020 we should have processors using it as well.And by permanent,I mean memory akin to a dvd,only as fast as current ram.

          Also,memristors do provide for a true analogue computers.The way it functions it allows for all the states between its minimum and maximum to exist.For example,integrated in a gpu,a memristor would allow for subtle changes in colour on its own.Its only limitation is the power output and its actual size.However,if I understood correctly,plans are to use them to make computers based on hexadecimal system first.

          • Eroen says:

            If memristor memory is created, it would be exactly as analogue as DRAM, which incidentally also supports multi-level storage if your’e clever. In fact, pretty much the same way as multi-level (read: cheap) SSDs have done for a couple of years now.

            The real issue, and why this is generally bad, it that you will always have noise. All electric components have it, at different degrees. I’m not familiar enough with memristors to have a guess at how much they are affected, but if you make a small enough change in value significant, you will have errorenous results. Case in point; if you operate in the true analogue mode (infinite valid levels) you will never ever get the result you wanted out. If this is is not acceptable, you introduce quantisation.

            • Daemian Lucifer says:

              Unless you are using water cooling,your fan will drown out any noise memristors would put out.

              Well its initial advantages are not tied in with that,it will operate just like any other flash drive,only with bigger capacity.It will also allow for computers to start without booting,because it doesnt require current to keep the settings.So even if you lose power,your interrupted data would be saved.

              And afterwards,even if they stop at base 16 computers,it still will allow for much faster and more powerful components than the ones we have now.At roughly the same price.

              The down side,of course,is that this will allow for more powerful gpus,and that will lead to another boom in graphics first games.*sigh*

              • Rob Maguire says:

                Er, not the ‘sound’ kind of noise, this kind of noise.

              • James Schend says:

                It’s been 25 years, and we still haven’t been able to replace BIOS. I think you’re being really, really optimistic. I also doubt your electronics are going to be able to handle base-16, at least not for a long while… maaaybe base 4.

                And again, half of the advantages you list already exist… computers already can “start without booting”, it’s called hibernation, and it’s been perfected for over a decade now. Especially on a SSD computer, like my aforementioned MacBook Air, it takes about a third of a second to fill RAM from the SSD, so it pretty much always hibernates right away.

                • jwwzeke says:

                  There’s a HUGE difference between hibernation and what memristors are capable of. To hibernate, your MacBook Air needs to “know” it’s shutting down, save RAM to the SSD and finish any operations it was working on.

                  memristors (from what I understand) save their state AT ALL TIMES, at the circuit level, not as part of the intelligence built into the operating system.

                  Here’s the difference: Imagine that you have the power cord unplugged from your Air and you rip the battery from it during the middle of playing a video while also doing some massive calculation. A “hypothetical” machine built from memristors would, when power was reconnected, startup in a few nanoseconds, resume playing the video, and finish whatever math operation it had been in the process of working on when the power went away. It doesn’t need to think to startup, or load things back from disk, it just goes.

                  Basically the concept behind this type of electrical component is as close as you can get to just “stopping time”. When the power stops, time stops, when the power is back, time starts back.

                  Course it’s going to take some time to get this stuff going… but it’s likely going to basically BREAK all existing computer tech when it does.

                • Steve C says:

                  I’m not sure if it will be Memristors, organic computers, quantum computers, optical computing or something else. But one thing for certain is that standard transistors (aka CPUs) made via photo-lithography will be dead as vacuum tubes sooner than we think. BIOS dead along with it. Most of use never used a computer pre-Commodore 64 and certainly not pre-transistor. We just can’t imagine it because we are so used to transistors.

                  Moore’s law is surprising accurate when you compare punchcards to Watson and everything in between. Just look at the graph. It’s logarithmic. It’s hard for the human mind to conceptualize logarithmic consequences. We aren’t wired for it. But it’s safe to say bye-bye to Base 2 and BIOS within a decade.

            • Zukhramm says:

              “Memristor”, really?

              I don’t care if that thing can travel at the speed or light or make me immortal, I don’t want to use something with that name!

    • Christopher M says:

      The Minecraft world is bigger than 256×256, but the Minecraft visible area is not. All you have to do is do everything in local space – point 0,0,0 is the far upper left of the visible portion of the world – and adjust the data you send accordingly.

      • MichaelG says:

        I work in chunks of 32 by 32 by 32. Each chunk has an origin, which the shader adds in during processing. So the individual blocks can be small integer coordinates without putting any limit on the size of the landscape.

      • James Schend says:

        That makes more sense, but… that’s a pretty small visible world, isn’t it?

        • MichaelG says:

          At 1 block = 1 meter, it’s 1000 meters across. Not huge, but not bad for fully-detailed landscape. I’m working on it… :-)

          In addition to buildings made of cubes, there will be a procedural landscape in the distance, made of polygons. So mountains, etc. Although I’m planning on asteroids in my game. Buildings can be within hollow asteroids, if you happen to live on one.

    • Halceon says:

      I’m not the leading authority on this, but I’m pretty sure Minecraft doesn’t calculate the positions of the whole world when drawing, just the relevant chunks.

  8. Dys says:

    I was briefly saddened by nothing of mine being in that screenshot from twentymine. Then I realised the inventory I dug is right next to the snowglobe. Only problem is, like so many of my works, it is almost completely invisible.

    On topic, I do believe the hardware people are indeed working to fix the bottleneck issue, by integrating the gpu and cpu. I’m not sure exactly why a gpu is technically different to a cpu, presumably something to do with the architecture making assumptions about the work it will be required to do, a luxury the cpu cannot afford. Perhaps a post on that would be enlightening?

    Either way, I have heard suggestions that a decent cpu would be capable of doing everything the gpu is doing now, particularly if you had a multi-chip mainboard.

    • some random dood says:

      Very briefly – present CPUs have large numbers of possible instructions they can execute, over the whole range of possible computer operations used in applications (from finance applications, CAD/CAM, games, e-mail etc). They typically have low core counts (only now starting to get into double figures) so can only operate on very few “things” at the same time (pretty much – they can do one “thing” per core [ignoring hyperthreading and similar tech]). So basically – very flexible, but limited in the number of calculations it can do simultaneously.
      GPUs have a very limited number of specialist instructions that are relevant to graphics processing (though recently this has been expanding as it turns out that the key instructions/techniques in graphics processing are also very useful in scientific and financial simulation work, so instructions are being added to aid these functions). They are also set up to process these limited commands by the hundreds (see advertising from ATI or NVidea about the number of stream processors or whatever their marketing department calls these things now). So briefly – a very limited set of instructions, but dozens to hundreds of processing units to calculate these features.
      As to CPUs getting more cores and doing all the graphics, that’s the route that Intel has been trying to go with Larrabee. So far, it got to be about 2 years late, and pretty much scrapped as a releasable product, but research is underway on the next version of it. (Intel – great CPUs [well, apart from their netburst stuff], crap GPUs. And drivers.)

      The path being taken by mixing CPU and GPU parts on the same piece of silicon is interesting (AMD’s version is called “Fusion” – can’t remember what Intel are calling theirs), but I don’t know the details (and this post is more than long enough already). Presently the tech seems to be aimed at low-to-middle-end graphics and possibly acting as a co-processor, taking on the types of computational load if suitable. I think it is to do with present middle-to-high end graphics cards come with special graphics memory that can be written to (to allow the frame to be updated), while at the same time the rasteriser is reading the memory to be able to put its contents onto screen. When CPU, GPU, and rasteriser are all sharing access to the same pool of generic computer memory that can only do one thing at a time… (Haven’t seen anything yet on how the manufacturers are going to resolve this.)

      Sources to check out if you want to get more detail: arstechnica.com and anandtech.com are both pretty good for the techy stuff on hardware.

    • silver says:

      Well, originally, there weren’t GPUs – the CPU had to figure out how to draw the whole screen and put that info into place for the graphics card which was limited to a direct translation function of “pixel 0,0 is color 0,128,128″ to something the monitor understood to mean the same thing. The problem was that the CPU got to be really busy handling everything else in the system — the CPU always getting interrupted by the disk or the network card or the keyboard or some other silly component saying “I have your next piece of data now!” and the CPU has to drop everything and shuffle that data into a memory buffer somewhere. Oh, and it has to run your bloody program, too. And figure out how to turn 3D representations of things into “pixel 0,0 is color 0,128,128″.

      So they said, “hey, instead of the CPU having to do that work, we can make a specialized CPU that doesn’t know jack about handling interrupts or dealing with user programs, all it knows in the world is handling 3D graphics and turning them into flat screen representations.” The GPU is born.

      Now even with multi-processor CPUs and such, they still have a ton of non-graphics-related work to do. I don’t actually imagine they’ll be moving graphics work back onto the interruptable chips anytime soon.

      Also, totally ninja’d by “some random dood” who had a better answer and more up-to-date one on what they’re doing about the dirt road.

  9. MrWhales says:

    Shamus also explained to me what a shader is, which is what i’ve always wanted to know. Its like having 10 fingers to type with, instead of 1. Sure you could do 1, it would just take longer, be harder, and more frustation occurs when something should be Capitalized.

  10. Ben Munson says:

    If you use hardware instancing, you can specify the cubes vertex offset from center, texture coordinate and normal details in one stream. Then you can render one chunk at a time, and upload one of that chunks corner world space positions to a shader constant. Then you can fill your vertex stream with offsets from that shader constant position, which requires three bytes. And one byte for a texture index of some kind. I think that halves the amount of memory needed to four bytes?

  11. Zak McKracken says:

    So the bus is actually a bottleneck?
    I remember lots of articles when the change from AGPx1 to AGPx2 came, and later to AGPx4 and AGPx8 … and each new test on toms hardware (were they better back then, or am I more critical now?) came to the same result that with then-existing games the link between graphics card and CPU was not the bottleneck, and the AGP bandwidth was always never fully used.
    Then came PCIe, and PCIe 2.0, and now we have so much more bandwidth that I’m wondering why it is apparently now a problem.
    Or is that because minecraft.like games just have a completely different set of load profiles? Lots of geometry, few textures, almost no special effects. So the ratio of data vs. things that need to be computed by the GPU is much more on the side of data. Is that right? But then … if your world fits into 256x256x256 blocks, then why doesn’t it also fit into the graphic card’s RAM? Should I have read Michael Goodefllow’s posts about it? Dammit, I’m gonna have to, I guess…

    • Ben Munson says:

      You normally can’t tell until you profile what the bottleneck is, but IIRC he uploads a full buffer of all the blocks each frame, which essentially more or less stalls the gpu each frame. I’d wager there is a big performance win to be had by separating the blocks into regions that are far from players that don’t need to be updated a frequently and ones near players that might.

      • MichaelG says:

        Actually, I load the chunks in the background, create a vertex list (and index list) then send those to the display. The only per-frame stuff is a call to render the vertex buffer for each chunk, and the transparent data.

        As I mention in the earlier writeups, transparent data has to be sorted, and so it’s potentially different each frame.

        The GPU is spending time rendering. That image Shamus copied is 1024 by 1024 by 128 set of blocks. Not all are visible, but it’s a lot of work rendering that, at two triangles per exposed face.

        The compression pays for itself by cutting the size of the data in the display. The complete vertex and index lists are around 600 meg for that image. That’s a lot of data for a GPU to crawl over, even if it’s doing next to nothing for those distant points.

    • Shamus says:

      “Lots of geometry, few textures, almost no special effects. So the ratio of data vs. things that need to be computed by the GPU is much more on the side of data. Is that right? ”

      You nailed it.

      Normal Game: Tightly controlled set of very optimized polygons, drawn with extremely complex lighting / texturing.

      Minecraft: Immense set of procedurally generated, extremely lightweight polygons.

      The difference between painting one large complex oil painting vs. scratching out 10,000 doodles.

      Or something like that.

  12. JPH says:

    This is completely unrelated, but how do I change my picture here?

    Presumably it’s using my avatar from some other website that I made a username for a long time ago, but I have no idea what that site is.

  13. Airsoftslayer93 says:

    I always get excited when i see pictures of the server up, you can just about see my buildings in this pic as well… still very interesting stuff, even if i only understand a tenth of it

  14. Spectralist says:

    “We also need three more floats to describe the texture.”
    3 floats for the texture? Shouldn’t that only be two?

  15. Brandon says:

    Shamus, this reminds me of something. When are you going to do more with your own hex terrain series? I found that fascinating and I’ve always loved hex strategy games.

  16. K says:

    I cannot resist but point out that a float cannot store 0.0001 exactly, but only something close to it, due to how the IEEE precision system works. That is one of the examples that work badly. You could use 0.0002, which works fine, but you had to chose one that doesn’t. Blame Murphy! :P

    http://support.microsoft.com/kb/42980 (exact same example)
    http://en.wikipedia.org/wiki/IEEE_754-2008 (math)

    I am such a smart-arse.

  17. But Goodfellow has implemented are really clever idea. > “a really clever idea”?

Leave a Reply

Comments are moderated and may not be posted immediately. Required fields are marked *

*
*

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun.

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!