## Procedural City, Part 9: Speeding Things Up

 By Shamus on Apr 28, 2009 Filed under:Programming, Projects 53 comments

 ↶ Procedural City, Part 8: Optimization Tests Procedural City, Part 10: More Performance ↷

Time to optimize the program. If you’ll recall, it’s running at (maybe) 30fps. I want it running at 100fps or better. The big thing that needs to be done is to take a lot of the weight off of the CPU. It’s thrashing through 4,000+ buildings every time it draws the scene, gathering up polygons and sending them to the renderer. Let’s see if we can’t cull those numbers.

The most obvious way to take some of the load off of the CPU is to not attempt to draw stuff that’s out-of-view. Right now when I’m rendering, I go through the list of 3,000 buildings and cram all of the vertex and polygon data into OpenGL. It then takes each and every vertex and runs it through a bunch of mathematical mumbo-jumbo to figure out the answer to the question: Where – given the current location and orientation of the camera – will this particular building end up on-screen? Nowhere? Oh? This building is behind the camera? Oh well. Chuck it. Let’s see the next one.

With a 90 degree viewing angle, the camera isn’t going to see more than 1/4 of the stuff around it. But the program is doing matrix math on everything before it gets to the point where OpenGL realizes I’ve been wasting its time.

There are a lot of techniques you can use to figure out what’s in view, but I want something simple to code and easy to manage. I have a lot of flexibility in my design and not much in my timetable, so it’s better to go for a rough system that solves 90% of the problem than for a robust and complex one that solves 98% of it.

Observation: The city data is more or less 2-dimensional. I can simply look at the angle to the object from the camera position on the horizontal plane. If it’s more than 45 degrees to the left or right, then it’s out of the 90 degree cone of stuff we can see, and it can be discarded before we go bothering OpenGL with it. But, in order to get an angle…

Let’s look at the code I use to get an angle from A to B on a 2d plane. Let’s see… a bunch of floating-point math. Some more floating point math. And an arctangent operation. Crimey. Finally there are some annoying comparisons and faffing about to make sure the angle stays between 0 and 360. This is not something you want to do to 4,000 objects in the scene. (Actually, that’s how many buildings there are. The total count of objects eligible for a ride through the OpenGL funhouse is over 9,000 once you take into account street lights and cars and some other bits.)

Doing an angle check to an object would possibly be faster than cramming it through OpenGL, although that might not be true in all cases. It might be faster to simply draw a one-polygon car than to do a check on all and then draw the genuinely visible ones. But even in a best-case scenario, it would still be giving the CPU a brutal workout while the GPU naps on the couch. To alleviate this, I divide the world into regions:

 Overhead view. The world is divided up on this grid. The red dot is the camera, and the blue is the cone that will be visible to the viewer.

Now instead of doing an angle check on all of those thousands of objects, I’ll just do the check on my 256 regions to see which ones are in view. At render time, I go over the visible regions and render their contents, ignoring everything else.

 Objects inside of the green regions will be drawn. Everything else will be ignored.

This is not without its drawbacks.

1) Buildings sometimes straddle the line between regions, meaning the center of the building might be sitting in a red (hidden) region but a bit of it is poking into a green (visible) region. It would mean that you’d see an empty lot on the edge of the screen, but as you turned to look a building would pop in out of nowhere. This is very noticeable and ugly. Identifying and dealing with these cases can get very tricky.

Solution: I widened the assumed viewing angle to 100 degrees. This means I’ll be rendering (or attempting to render) a lot of extraneous crap just to catch a few difficult objects. If I’m still having speed problems later, this is one thing I can revisit and try to improve.

2) Doing these calculations on a 2D plane is fast, but it assumes you’re looking straight ahead, level. Tilting the view downward will bring nearby red regions into view. As I look down and pan left and right I can see buildings popping in and out of view. That wouldn’t be bad, except that looking down on the city is pretty much the entire point of the project.

Solution: I mark all adjacent regions visible. No big deal. It’s only a few dozen objects. Doing the math to truly sort this stuff out wouldn’t be worth it. I doubt I could even measure the performance improvement from a dozen stragglers.

3) I’m drawing a lot of extra buildings. As soon as the corner of a region enters view, everything in that region gets drawn. I can make the regions smaller, and thus more accurate. This would cull out a lot of stuff that’s still way out of view, at the expense of giving me more angle checks to perform. Right now the world is divided into a grid of 16×16 regions, for a total of 256. If I made the world (say) 32×32 regions, I’d have 1,024 checks to do. I’d have twice the accuracy at the expense of doing four times as many calculations. I could also move to a 8×8 grid and do one-fourth of the calculations, at the expense of having even more overdraw. Hm.

This is a fiendish one, because the sweet spot – the optimal spot – is somewhere in the middle. A 2×2 grid would be worthless and cull no objects. A 128×128 grid would require 16,384 angle checks, which is actually greater than the number of objects in the scene. That’s worse than doing the angle check on each and every building individually. I want the region size to be big enough that every region will contain more than one building, but not so big that I’ll be attempting to render a lot of off-to-the-side stuff. (Note also that this interacts with the texture-sorting I talked about last time. If I’m rendering by region, I can no longer render all of the same-textured objects together. Smaller regions mean more texture switching. Sorting them before each rendering pass would fix this by having the CPU spend a lot more effort in order to save the GPU a little effort, which is the very opposite of what I want to do.)

Solution: Lots of tests later, and it is apparent that my initial choice of a 16×16 grid was already the optimal one. There are reasons why the optimal setting might be different on a different machine. I must not think about this or it will drive me crazy.

So far, I’ve worked the framerate up to 130-ish. I’ve technically met my goal, although framerate is still under 100 when the bloom effect is turned on.

I’m sure I can do better than that. I still have another round of optimizations to make. Ideally, I’d like to get it working so that it’s running at over 100fps even when bloom is on. We’ll see how the next round of changes go.

I’m sure some people will notice that I re-worked the sky to be a single gradient, added some simple radio towers on top of buildings, and added some other bits of sparkle power. None of it is set in stone. These were just experimental visual changes I made over the weekend and not related to any of this optimization work. I’ll have time to fuss with that stuff once I have the rendering working smoothly.

#### Footnotes:

 ↶ Procedural City, Part 8: Optimization Tests Procedural City, Part 10: More Performance ↷

1. Chris says:

I love this series, Shamus. Keep up the good work!

2. Chargone says:

aside from being an interesting look into such things, i really don’t have a thing to say about the project itself.

however, i really must say, that cityscape looks pretty darn spiffy :D

ok, that just didn’t sound right… let’s try something else…

it’s shiny.

very, very shiny.

in a good way, not in a ‘you made it highly reflective!’ way, because, you know, you didn’t….

I’ll stop now :D

3. Groboclown says:

Nice to see this coming along.

There’s some interesting math that can be done on a per-item basis for clipping purposes, but it looks like you’re on a good path for this one. Simple array-lookups based on line intersection will go a long way.

To include the missing buildings when looking down, a simple trick would be to include some pre-computed angle ranges. For example, below a certain amount, switch to calculating a transformed square on the grid; above a certain amount, use the triangle approach; and so on. Many of these calculations don’t need complex exact math.

4. Groboclown says:

Another thought that just came to me. For the buildings that straddle the lines and corners (or come close to them), they can be included in each of the squares that come close to including them. You’ll need to do a small bit of “throw away duplicate” work when processing these steps, though.

5. Agaib says:

It looks very nice, the only thing I think it’s missing is probably the fact that most cities have some unique buildings of some sort. Ones that really pop out as taller, more modern, or just strange. On the other hand, the point is that this is all procedurally generated…

6. JMcNeely says:

Very nice! I can actually stand the bloom at this point, which in my book means that it’s done right. Good job on the sky as well… I actually don’t have anything to say at all about the optimization but as far as the general ‘fiddly-bits’ you put in… I like them… my 2 cents worth.

7. Sashas says:

“But even in a best-case scenario, it would still be giving the GPU a brutal workout while the GPU naps on the couch.”

Unrelatedly, I really like this series. The terrain project is probably my favorite section of this site, and I would love to see more along these lines.

8. Dev Null says:

Need some neon signs for FrobozzCo…

9. Ingvar says:

Ooooh! Believably lit-windows patterns! Any interesting heuristics used?

10. Couldn’t you sub-divide your regions and then perform the check again, but this time only on the sub-regions that are crossed by the edges of the viewer? That way you may get away with the same grid resolution but with fewer checks. I love the way this is looking, especially with the bloom.

11. chuko says:

It’s a lot faster and more straight forward to use vector math than calculate the angles. You can take the cross product of the vectors: one from the camera to the vertex or region, the other from the camera along the edge of the viewing plane. The sign will tell you if it’s visible. It just takes a few products and sums, and you can optimize that by hard coding z=0.

12. Dave says:

Just to say I’m very impressed. That’s one mother of a city! I would say it reminds me of Shanghai, except that visibility there never exceeds 1/4 mile – it’s so convincing. I would have never guessed it was procedural if I didn’t know it.

Any chance you could simulate air traffic? Something similar to the road traffic except nav lights for helos among the buildings and jets on the horizon?

13. This is definitely reaching “seven kinds of awesome” territory. :D

14. Xak says:

I love the irony that an article entitled “Speeding Things Up” took about 5 minutes to load with my internet.

It would be really awesome if you could release some wallpapers of this cityscape as well. Maybe do some ultra-hi-res renders just for the hell of it, and then snap some piccies. I’m always in need of a good cityscape picture for my desktop. =P

15. Factoid says:

Very interesting read. Makes me wish I’d had some REAL programming classes back in my CS days. We wrote almost exclusively in C++, command-line based programs. While those are definitely the fundamentals of programming, I only had one class (an elective) where we did anything GUI based.

I would have loved to learn how to program using APIs, 3d interfaces, etc…

I’m much too out of practice to write that kind of code on my own now. I wouldn’t even know where to begin.

16. Simplex says:

“it’s shiny.

very, very shiny.”

You meant in in Firefly meaning of “shiny”? :)

“Ooooh! Believably lit-windows patterns! Any interesting heuristics used?”
It was explained in earlier post from this category.

17. SatansBestBuddy says:

Huh, I’d have thought that 9000 vid would be over 9 million views by now, given how huge it’s gotten.

18. Groboclown says:

@chuko:
The issue with gradient culling in this scenario is that this problem is about culling objects out of the view fulcrum, rather than individual planes on those objects. It was my original thought on the matter, too.

19. Kell says:

As I was reading through the first paragraphs I was thinking to myself “stick a label under each city block and do an angle check to see which ones are in the fov.” It’s the same principal as checking visdata in the id lineage of FPS engines, isn’t it? ( BSP uses precompiling as well of course )
I’m a total non-coder, so I’m chuffed that I can grasp the basics at least.

Once you’ve got a some vigorous optimisation going on, I think you’re going to find it much easier to work through the final details and finishing touches. I think you’ll realise that the images you’re trying to create are made entirely of light; no surface illumination, just jewels on velvet.
The sky is also better as a simple grade.

Also, sorry to nitpick but I can’t hold it in any longer: the clouds should not be green. There is no reason to have any green in these images at all. Though one could argue some of the office lights might have a greenish tinge, it isn’t really necessary.

I’m looking forward to seeing the cityscape in motion.

20. Nevermind says:

Um… lots of math and arctangent? To figure if objects fall into 90-degree angle? Ur doin it wrong.

Let’s say those edges of field-of-view are directed along vectors (a1,b1) – left edge and (a2,b2) – right edge (assuming camera is at (0,0)). If the unit vector in the direction of view is (u,v) you can trivially find those a and b. Actually, a1=u-v,b1=u+v,a2=u+v,b2=v-u.
Once you have these, any object at coords (x,y) is to the right of first edge, iff a1y-b1x is less than 0. Likewise, it is to the left of the right edge, iff a2y-b2x is greater than 0 (stupid wordpress ate all the signs).
You should only draw objects that are to the right of first edge and to the left of second. That’s just a little math requiring only addition and multiplication. No CPU is gonna choke on that.

21. Groboclown says:

A few more optimizations could be added here – pre-calculate the maximum height of each bounding box, to determine if it should actually be sent to the GPU (by comparing the z-axis on the view fulcrum).

22. Do you have a cap for how tall the buildings can be? (You must, I’m not trying to be insulting.) Anyway, if it’s manageable I thought it might look a bit better if you have some areas with a lower height cap, that way you can get more dramatic distance views. Or it could totally screw everything up.

It’d be interesting to see, though.

23. This is a very expensive form of math, and it’s all being done by the CPU. I could use vertex shaders to offload most of this work onto the GPU.
Well, there’s no need to use vertex shaders to do transform & lighting on the GPU, you could use the fixed shader pipeline (the one included even before shaders were programmable) and still have all work done on the GPU.
In fact, if you did that you could probably dump all your optimizations done here, since shader processors are incredibly fast in doing vector transformations (whereas the CPU is way slower)… and you wouldn’t have to write a single line of HLSL shader code.

• Shamus says:

Lorenz Cuno Klopfenstein: I’ve never come across that for OpenGL. Google searches only link me to high-level diagrams or modern programmable vertex shader examples.

24. Carra says:

I’m not an expert on computer graphics but isn’t clipping usually used to show only the parts that are visible to the camera?

25. Duffy says:

I must say I am thoroughly impressed. The bloom pic looks significantly better.

26. Wolverine says:

I tried to watch the video you are linking, but apparently it is not available in my country (Czech Republic). What was it? Was it yours?
“Was it the video of project in question?” He asked eagerly.

27. wererogue says:

“The most obvious way to take some of the load off of the CPU is to not attempt to draw stuff that’s out-of-view.”

Yikes! I read this and thought “Nonononono… CPU-side frustrum culling isn’t going to help your CPU load!” Then I realized that you were actually trying to *avoid* frustrum culling, and felt a lot better about it!

28. Alexis says:

HW TnL:
In opengl there is no way to determine (via the API) if a given implementation runs the TnL parts of the pipeline on the GPU or the CPU. This choice may even change based on workload or 3d state. Nearly all discrete 3d hardware will do this on the GPU, but it’s sometimes emulated on integrated graphics.

Part of the confusion here is that per-object frustum culling is not done in opengl (either in fixed function or in shaders). This is because the vertex processing parts of the pipeline work on a vertex or primitive level. Your driver is already doing aggressive culling here, but it doesn’t have the information needed to do efficient per-object culling.

Long story short: you are already getting HW TnL, but it won’t help with per-object culling.

If you want to optimize your culling approach a bit more, take a look at quad trees. It’s a hierarchical approach similar to the 2d culling you are doing here.

Having said all that, I still think the geometry of these buildings may be too simple for this culling to be efficient. How many verts/draw call are you getting? What method are you using for the draw calls?

As an experiment, if you increase the number of verts per building and make no other changes, do you see a slowdown in frame rate? Make sure you don’t increase the number of draw calls, just the number of verts per call. (This assumes you aren’t in immediate mode of course). If you don’t see much slowdown, then you are limited by the draw overhead. That means you may get as much benefit by batching together buildings into fewer draw calls as by culling them, and you can save on the frustum culling (say by working at a higher granularity).

Note that all of this is trying to optimize the vertex end of the pipe. Frustum culling (and batching) isn’t going to help any fill rate problems, because offscreen pixels are clipped anyway. Your earlier testing showed that you didn’t have a fill problem, but you’ll want to test that again after you optimize the vertex processing.

29. Julian says:

Here are some things I noticed:
1) That video isn’t available in Argentina, it seems. What is it about?
2) I think the sky actually looks better without the bloom lighting. Is it possible to only bloom windows and lights but keep the sky like normal?

Other than that, as usual, this is fascinating, and I’m looking to seeing it in action, in an available-to-all video.

• WJS says:

I’m pretty sure you’d need to be using shaders for that. I’m also pretty sure that bloom usually requires shaders period, however, and Shamus is getting bloom here in the fixed-function mode, so I’m clearly out of my depth a bit.

30. Avaz says:

To Wolverine #26 and Julian #30:

The video in the link was some footage from a DragonBallZ cartoon where one character asks the other about a foe’s power level, and the answer is “OVER NINE THOUSAAAND!”

31. Miral says:

I know that cards exist that provide hardware T&L but don’t provide programmable shaders. So there must be some way to do it. But there my knowledge ends :)

32. Are ya’ll still speaking English?

33. Christian G. says:

Did you try using Display Lists to store the drawing commands for your objects on your graphic card? I seem to remember them helping a lot with performance in a small program I made.

34. Dr. Strangelove says:

I know nothing about programming or rendering, but re: the use of procedural content to create a city that looks more complex than it really is…

It looks really good. I would not mind that as a backdrop in a current gen game (but I’m less fussy than some).

35. Shamus says:

Alexis: I’m using quad strips whenever I can, but there’s only so long you can make a strip on stuff this simple. So, you can probably get pretty close to the vert count by just assuming all quads. 2 polys = 4 verts, or about 80k verts.

80,000 verts, with just texture mapping info and color. No normals or material properties. Should be pretty lightweight. They’re compiled into a GL list during a three-second loading screen.

Thinking it over: You’re right. A quad tree would be the fastest. But I’m not sure how much more speed it would yield.

I’ll see how much more speed I need.

36. I’ve never come across that for OpenGL. Google searches only link me to high-level diagrams or modern programmable vertex shader examples.

I was just assuming that the programming model was similar to the one in DirectX, but I might be (very) wrong. :)
In fact, based on what Alexis said, you are already getting hardware T&L. You could improve performance by decreasing the Level of Detail of far away buildings, but in that case you’d have to generate more models and everything could get very complex.

By the way, isn’t the performance impact of the bloom filter a bit high? I don’t think that a single pass post-processing effect should cost you half of the frames per second.
Anyway, the result (especially with bloom) looks great! :)

37. Gary says:

The Bloom nightscape looks gorgeous! I love it!

38. Elzair says:

From pictures of cities at night, they seem to look like a damn circus! If you want a bit more realism, you could add in a light source with random colors at the top vertices of each building.

39. Anon says:

You know, once you’ve passed all your geometry to the GPU, you can leave it there and just move your camera. That’d completely remove your bottleneck.

40. Michael Mchenry says:

Sorry for the length, Shamus. Suggestions follow:

Do you actually know whether you’re CPU bound or GPU bound? Are you queuing up your draw operations and executing them all at once? That will enable you to easily track how many milliseconds executing draw calls vs. how many sorting and culling. Maybe that would be a major change to your pipeline, but it’s the sort of thing you can set up in an hour.

Can you easily comment out your draw calls? If you all of a sudden get 5000 frames per second, that tells you something.

I haven’t seen your code, but I have a gut feeling that your bottleneck is making the draw calls.

I’m not sure if you answered whether you’re using vertex buffers. If you’re not, it’s hard to imagine this wouldn’t speed you up tremendously. If your vertex buffers and index buffers are already in GPU memory, you don’t have to pump that triangle list over every time.

Batching a block of buildings in the same index and vertex buffers can dramatically reduce your draw calls further.

Alternative proposals: You might be able to use the same index buffer for many buildings if they have the same topoplogy, but different vertex buffers to account for buildings that are taller/wider/whatever.

You might have some buildings that are identical and therefore use the same buffers.

I also have to agree with Chucko’s suggestion of using the cross product for culling, rather than expensive trig functions despite what Groboclown says. It only culls on one plane, but it still throws out the half of the scene you don’t have to render.

Nevermind’s idea is interesting, but it is 2D, and if the camera is pointed down, you’ll have that problem of some popping you described above. Chucko’s plane is fixed to the camera, so it slants when the camera slants.

If you’re determined to cull 90 degrees instead of 180, you can still use the cross product against two planes (left and right) to make sure your buildings are between the two planes.

Also, if you are sorting your scene in a distinct phase, you can sort based on z to reduce overdraw.

41. Michael Schmahl says:

It seems to me that you should only need two arctangents. You have two lines, with slope (for example, looking at your picture above) -1/2 and +2. Now for each object that needs to be rendered, you only need to check that Δx-Δy/2>0 and 2Δy-Δx>0. (Where Δx is the horizontal distance between the object and camera, and Δy is the vertical distance.)

Still some floating-point math, but no messy arctans, except the two you need to find these lines in the first place.

• WJS says:

Shamuses unit scale is only one window. That should be small enough that casting the numbers to integers wouldn’t reduce accuracy significantly. That is, assuming that the int-conversion wouldn’t be just as expensive as the float calculations, and that the speed difference between int maths and float maths is big enough to make it worthwhile.

42. Cybron says:

Looking good, Shamus. I like the new skybox.

43. Nevermind says:

To Michael Mchenry:

The same method works in 3D – the left and right edges become planes, their normals can be found trivially, and dot product of (x,y) and said normal will tell us, on which side of the plane our object is.

The problem is, a building is more than a single point. But maybe checking all the vertices of a buildig’s bounding box will do the trick. That means 8 checks per building, not that much really.

And the same method can be probably used to cull using the top and bottom planes.

44. kjones says:

Factoid:
Sorry to nitpick, but computer science isn’t actually about any sort of programming – and it definitely isn’t about graphics programming or optimizations.

Programming is to computer science what math is to physics – it helps you get to where you’re going, but it isn’t the point.

45. Michael Mchenry says:

@Nevermind:
Ok, you and I are suggesting the same thing.

And yes, you’d have to check the bounding box to totally prevent popping, but then once you’re doing 8 checks per building, it becomes more important to use some structure like an quadtree to avoid checking every single building.

Given 8 checks per object, it may very well be more efficient just using one plane for culling to get rid of the objects behind the camera than to check against the left and right view frustum.

The truth is that once you’re using index and vertex buffers, and you’re sorting your draw calls so that you’re not swapping textures unnecessarily, the GPU does that frustum culling itself pretty efficiently.

Using a texture atlas might help. It looks like all the building textures might actually fit into one atlas texture.

I can only imagine that every time someone comes up with some great suggestion, Shamus just sees his budget slipping away.

The first 90% of the project was implementing the basic idea, and the other 90% became optimization.

46. Elethiomel says:

Why not just determine what regions each screen edge goes through (using a 2D line drawing algorithm that marks each region the line touches), mark those as “border” regions, determine one region that lies in view (go along the middle of your view from the camera until you hit a region that is not a “border” region), and 2D flood fill from there (stopping at border regions) to mark all regions to be rendered? There are plenty efficient 2D flood fill algorithms.
That’s a total of three angle checks to feed your line drawing algorithm.

47. Dan S. says:

A few suggestions, since everyone else is piling on. :)

(a) Rather than a fixed grid, you can use a quadtree for culling — start with a simple 4×4 grid, determine which quadrants might be visible, divide each into a 4×4 grid, and repeat until you are at a reasonably small resolution! (Note that there is no need for the recursive subdivide step if the quad is entirely (in)visible!)

(b) You can cull grid squares accurately — no fudge factor required — with (8 points at the extrema of each grid “cube”) * (4 culling planes) * (one vector subtraction + one vector dot-product per plane). Fast! Others have commented about this.

Cheers from news.ycombinator.com…

48. Dan S. says:

Quick note on the performance of my note (a) that I meant to add but didn’t get in before the edit period ended:

If you assume a 128×128-sized “finest” region grid (much finer than your 16×16), you can reason as follows with some number fudging:

At each level of the quad-tree, the grid squares that you’ll need to visit will be those cut by the view frustrum — in 2D, two lines; thus at the finest 128×128 grid level, the most it can be is around 256 and on average probably more like 128.

At the 64×64 level of the quadtree above that, you’ll need to visit at most 128 or so… at the 32×32 level, it’s 64, 32 at the 16×16 level, etc.

So you end up with about 512 is-quad-visible tests being pessimistic and probably more like 256, being a bit more optimistic — giving you a 128×128 resolution for easily within an order of magnitude the same # of tests as your current 16×16 resolution!

49. […] – Procedural City, Part 7: The Street-Level Trap – Procedural City, Part 8: Optimization Tests – Procedural City, Part 9: Speeding Things Up – Procedural City, Part 10: More Performance – Procedural City, Part 11: Demonstration Video – […]

50. SteveSegreto says:

Hi Shamus,

Great blog, I’m recommending it to my friends.

One thing I was curious about is why you didn’t just sub-divide the 2-d world into axis-aligned bounding boxes, like say 32×32 of them. Then put the buildings, streets, etc into those AABBs. Then frustum cull the AABBs using a standard plane dot vector algorithm off the net?

A handy one is to take the view (camera) transform every frame and decompose it into six frustum planes, then for each frustum plane, find the box diagonal that points in the same direction as the normal along each axis. Then test if the bounding box is in front of the plane or not. If its behind, its culled.

If a bounding box is culled, don’t submit any of the buildings to the GPU for drawing, and since you only have 32×32 thats just 1024 quick axis-aligned bounding box vs frustum checks per frame.