Procedural City, Part 8: Optimization Tests

By Shamus Posted Monday Apr 27, 2009

Filed under: Programming, Projects 29 comments

The project is now officially over budget. I expended vast quantities of time obsessing over framerates and benchmarking this weekend. My budget was about 30 hours, and while I haven’t been keeping a rigorous tally of hours, I’m well past the deadline and not yet done. But the end is in sight. I’ve determined to get this thing done this week. All told, it looks like I’ll have sunk 40 hours into it. For perspective, if this was a game with an eighteen month budget, I would just have missed going gold and admitted that we were going to need another six months. And that our entire staff spent six months of the budget playing Left 4 Dead. Good thing I don’t have investors.

I’m afraid this stretch of the project is likely to be a bit dry. It can’t all be colored pixels and bloom lighting. Sometimes I have to go and fuss over dull numbers and time things, which makes for unspectacular screenshots. I’ll do my best to make this interesting.

By now the program is running like a slow pig, and has been for a while. 30FPS for a little city like this is appallingly slow, and before I move on to grander things I need to know how it’s going to run.

The first step in speeding up a program like this is finding out where the bottlenecks are. There are several aspects of rendering that I look into when facing slowdowns:

  1. CPU bottleneck: The program just isn’t running fast enough and so it isn’t sending polygons to the GPU fast enough.
  2. Throughput bottleneck: The software is capable of sending the polygons faster, and the GPU is capable of rendering them faster, but because of drivers or hardware limitations you just can’t send the data fast enough. You’ll run into problems like this if you’re trying to render a whole bunch of lightweight, easy-to-draw polygons. (With “bunch” in this case meaning “hundreds of thousands” or even “millions”.) If you’re drawing a bunch of tiny little triangles with no special lighting effects or texturing, (what the heck are you drawing, anyway?) you can run into this bottleneck. I don’t encounter this very often due to the fact that I’m usually working on the lower end of the tech curve. (Although I’m pretty sure I ran into it during the terrain project. I can’t remember now.) In any case, I understand this bottleneck is one of the reasons for the move from PCI to AGP, and from AGP to PCIe slot interfaces. The newer bus interface lets you pump more data to the GPU. I have… wait, let me write some code to do a count…. Okay: 40,804 polygons. That’s not counting a few special case groups like the cars, ground or sky, but that number is probably within 1,000 polygons of the total. So I don’t think I have to worry about throughput on anything built in this century.
  3. Fill-rate bottleneck: This has always been where I do most of my work. Problems like this usually come from filling the screen with expensive polygons. In contrast to the earlier case where I might have been drawing millions of cheap polygons, perhaps I’m slowing things down by drawing just a few expensive ones. Fifty polygons sounds pretty pathetic compared to a million, but if all fifty polygons cover the entire screen and do complex blending, then you’ll end up with a fill rate problem. Your program is sending those 50 polygons nice and fast, and there are so few they won’t clog up the hardware pipeline, but the graphics card is going to take a long time to draw them. Full-screen effects (like bloom lighting and depth-of-field) can cause fill rate problems.

An analogy I like: The CPU is a waitress taking orders. The throughput is the rate at which she can put order strips up to be read by the cook, who is the GPU in this case. Fill rate problems mean he’s not cooking stuff fast enough. (Yes, I know they’re called servers today, and not “waitresses”. But the restaurant used in this analogy is a roadside greasy spoon diner in 1978. She doesn’t care if you call her a waitress, as long as you tip well and the kids don’t make too much noise.)

The third type of slowdown is easy to spot. If making the window smaller speeds things up, then it’s a fill rate problem. If not, then it’s probably a CPU problem. I’m sure I’m not having a fill-rate problem, but I check anyway because you don’t begin research by assuming you know everything.

I shrink the window. No change in the framerate. I shrink it to almost nothing. Still no improvement. Just for fun, I turn off writing to the z-buffer and change the program to draw the buildings as translucent. This will make all of those polygons many, many more times expensive and will ensure that the GPU has to draw every. single. one. Then I make the program run full-screen.

Take that, fancy-pants hardware! Let’s see how you like choking on 40,000 two-sided alpha-blended, screen-filling polygons!

Hm.  That actually looks kind of cool.
Hm. That actually looks kind of cool.

No change.


Not only is my graphics hardware not the bottleneck (which I already suspected) but it’s not even being reasonably challenged. Going back to the waitress analogy, here we have a cook that can prepare meals faster than the waitress can write them down. She writes down a four-order meal with appetizers and desserts, and the food is done before she can get back out on the floor to take another order.

As someone pointed out earlier in the series, these new cards are designed for rendering with complex pixel shaders that do fancy bump mapping, specular mapping, texture layering, dynamic light passes, lighting objects at grazing angles, and a whole bunch of other intense math. On every single pixel it draws. Here I’m simply asking it to take one lone texture map and apply it to the polygon, and I doubt I could hope to keep the thing busy with such a lightweight job.

Actually, tests reveal there is one thing it’s slightly sensitive to, which is changing textures. Back in step one I made texture maps for the buildings. Think of rendering as painting. If I ask you to paint a red stroke on the canvas, then a blue one, then a red one, it will take longer than it would to do both red strokes and then the blue. It takes a moment to lower your brush, clean it off, load it up with paint again, and bring it back up to the canvas. I can get a small performance boost by making sure I render all of the buildings that share a common texture at the same time. With eight textures and (roughly) 3,000 buildings, rendering them in a random order will cause the graphics card to have to change paint over 2,500 times. If I sort the buildings, it will only have to do so 8 times. This gives me a modest performance boost of around 10fps. That’s nice, but it’s trivial compared to the real optimizations I’ll need to do. I should have gone after inefficiencies like this one later on, and go after the low-hanging fruit first. But I did this one more or less by accident as part of my tests.

(Note that I’m writing this after the fact, and I didn’t keep a perfect record of how much of a performance boost I got from each step. The numbers I give for framerates are vague recollections and guesses.)


From The Archives:

29 thoughts on “Procedural City, Part 8: Optimization Tests

  1. Eldiran says:

    This is actually still very interesting. As someone currently in an algorithms analysis class, it’s nice to see a real application of such an analysis. That said, I should really understand these terms better than I currently do… I don’t think I’ve even heard of fill-rate bottleneck.

    1. WJS says:

      It isn’t really surprising that an algorithms class wouldn’t cover graphics bottlenecks, is it? I mean, they’re pretty different subjects.

  2. Debaser says:

    That screenshot of the floating windows looks AMAZING. Again, I hope this turns out as a screensaver, or at least some background-able screenshots, because it’s looking great.

  3. Marmot says:

    Ah well, a 25% budget overrun is not so horrible (for now). At least it is a time budget.

    Though I have to say, now optimization is now becoming a mystery to me. If this thing gets down to 30 fps, I’m wondering how some shiny new games manage to process much more complicated scenes and hundreds of NPCs inside them without sinking to something like 2 fps.

    1. WJS says:

      Because they process much more complicated scenes with hundreds of NPCs (although very few games actually have that many on screen at once) after the optimisation, not before. You can’t really compare a simple scene in an unoptimised program to a complex scene in an optimised one in any meaningful way. There’s no guarantee that Shamus wouldn’t find a bug that, when removed, would boost the fps a thousandfold.

  4. Knowing a little more about what task is done by which component, where the bottlenecks are and how to tell which bottleneck is in play is super useful for all sorts of troubleshooting. More!

  5. Michael Mchenry says:

    I’m really looking forward to learning where your bottleneck is. Aside from trial and error, are you using any profiling tools?

  6. Veloxyll says:

    Streetlights still look a tad high in transparentville! But it does look kinda cool

  7. wererogue says:

    You’re not generating the textures every frame, or somesuch are you?

    If you’re CPU-bound, I suggest having a poke around with a profiler, to see where in your code the time’s being spent.

  8. Rutskarn says:

    Hey, you’re over time budget, trying to deal with the vagaries and capabilities of new technology, and desperately mired in technicalities.

    Just like a programmer on a real game!

  9. ClearWater says:

    That reminded me of Marvin: “Here I am, brain the size of a planet, and they ask me to take one lone texture map and apply it to the polygon. Call that job satisfaction, ’cause I don’t.”

  10. Sean says:

    Yeah, I also wonder if you’re using any profiling tools, and would be curious to see what you’re using (but, judging from the description of the testing you were doing, I’m guessing that you’re not using one).

    In a case like this where you’re strapped for time, the tools will likely point you towards the most egregious bottlenecks, which will save you lots of “fiddle-and-test” changes to try to optimize.

  11. Blurr says:

    Going back to the waitress analogy, here we have a cook that can prepare meals faster than the waitress can write them down.

    I loved that one. I just love the way you mix comedy and information in your articles.

  12. Ell Jay says:

    “And that our entire staff spent six months of the budget playing Left 4 Dead.”

    Ah well, could be worse… Duke Nukem Forever, anyone?

  13. Chris says:

    What about looking into some General Purpose GPU coding? See if you can offload some of that CPU work to the GPU – or, in your analogy, get the cook to start taking orders.

  14. Cybron says:

    You have a randomly capitalized ‘is’ right before the ‘greasy spoon diner’.

    The transparent city looks pretty cool.

  15. Jos Metadi says:

    I’ve done some optimizing before. The problem isn’t necessarily making any one part more efficient. The problem is finding those few areas which have absolutely huge impact instead of wasting time on lots of areas that gain you a percent here, a percent there.

    From the old engineers anecdote, $1 for the chalk, $49999 for knowing where to put the “X”.

  16. Felblood says:

    So, basically, the cook would be having a much harder time, if everyone wasn’t ordering nothing but cold Pop-Tarts, still in the wrapper?

  17. Julian says:

    I don’t know if the streetlights are any lower than last time, but it’s certainly not noticeable that they’re in the 6th storey anymore.
    I find this very interesting, and your analogies are quite understandable for someone like me, who knows what optimizing is, but doesn’t know how it’s done.

  18. King of Men says:

    Marmot: That’s why big-name games have LOTS of programmers devoted to nothing but optimising the engine. But also note that they have pre-generated, not procedural, content, so in Shamus’s analogy the pressure is on the cook, not the waitress.

  19. Jeff says:

    Turning transparency off turns it into an astoundingly beautiful “data” city. Like how they imagined the internet looks in the future. Really, a screensaver would be awesome, especially if it’s one of those that flies through the city, the city being generated as you go…

  20. Auraseer says:

    Here are two silly and obvious questions you have surely already thought of, in increasing order of silliness.

    First: are you running with a debugger attached? That would slow things way, way down. (Relatedly, make sure you are compiling in release mode with compiler optimizations turned on.)

    Second: you wouldn’t happen to have included some debug code to intentionally slow down the framerate, would you? I saw that happen when I worked at a major game company, in development on a MMOG you have heard of. The guy responsible had inadvertently checked in some testing code containing sleep() calls, and then forgot he’d done it.

    Assuming neither of those is the culprit, it does seem you must be CPU-limited. As those above me have suggested, slap a profiler on ‘er and see where all the time is going.

  21. pwiggi says:

    @Eldiran: It’s unlikely you’d encounter concepts like fill-rate (and even throughput) in Algorithms Analysis. With Algorithms, you’re looking at the run-time complexity, which is the theoretical performance, independent of any hardware. Here, Shamus is dealing with hardware-specific issues.

    In other words, the stuff they neglect to mention in most CS programs :P

  22. gtb says:

    That floating window thing is badass looking. The shareholders and I think you should scrap the project you’re working on right now and get a team together to synergize about a new direction for this project based on your floating windows image. Also since you went over budget, we’re going to have to ask that you only use 1/2 your current crew (we’ll swap the rest over to our seventh football game franchise sequel, they need some help over there.) We want you to really own this. Lastly, we need to talk marketing. We’re going to go ahead and cut your budget in half again, so that the marketing team can get this floating window image onto some boxes. We’re throwing the name “Boob Blasters” around, but right now you don’t need to worry about that.

  23. Eldiran says:

    Pwiggi: Ah, you’re right. I’m taking Computer Architecture under the same teacher at the same time, so I get the two classes confused constantly… it’s in Architecture that I’m learning about throughput. I’m not a very good computer scientist… <.<

  24. DGM says:

    Going back to the waitress analogy, here we have a cook that can prepare meals faster than the waitress can write them down. She writes down a four-order meal with appetizers and desserts, and the food is done before she can get back out on the floor to take another order.

    I work at a restaurant in the evenings, so I have to ask: would your graphics card be interested in moonlighting as a cook? It sounds like it has cycles to spare for the job.

  25. Blake says:

    Profiler? I barely know ‘er!

  26. Alexis says:

    This is a really interesting project. I’ve really enjoyed watching it develop. Just a few thoughts on the performance side:

    I’m not surprised you are software limited here. If you are drawing each building in it’s own draw call, then that’s only a handful of triangles per batch. You probably want to try to get thousands of triangles per draw if possible.

    You definitely want to minimize state changes. It sounds like you have progress there with the texture grouping. You could reduce texture changes even more by grouping multiple textures together into a large texture atlas and modifying the texture coordinates. Another option would be to bind separate textures to each texture unit and then select the per-building texture in a shader (you can fake this in fixed function as well).

    If you are changing the modelview matrix per-building you will want to pre-transform the buildings and just send down world space verts.

    If you are setting per-building material settings you could send those down as vertex attributes instead.

    If you can get rid of the state changes, and merge many buildings into a single vertex buffer, you should be able to get down to only a handful of draw calls per frame. You should be able to draw all the cars at once as well, maybe with point sprites if you want to cut down on buffer overhead.

    This was in OGL right? As a general rule, make sure you are using VBO if you aren’t already. I don’t think you’ll see much savings with the small buffer sizes, but vertex arrays will have some driver overhead.

    I second the suggestion above of using a profiler to look for any hotspots in your code. There are also some good graphics performance tools out there to spot any problems with the workload you are sending to the driver. If you have an nvidia card, they have some nice tools that work with OGL. Also, gdebugger has a free trial and has some performance analysis tools.

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

Leave a Reply

Your email address will not be published.