Programming Vexations Part 12: SOA vs. AOS

By Shamus Posted Thursday Dec 5, 2019

 Programming Vexations Part 11: The SchismPrevious Post

Next PostProgramming Vexations Part 13: Project Files 

In previous entries, I talked about how you sometimes need to worry about small blocks of memory. In some cases, you may even want to worry about how the data is arranged in memory. To show how this works, let’s look at our old friend the space marine:

Space Marines

class SpaceMarine
{
  bool dead;
  char[8] name;
  Vector3 position;
  float stubble;
  short bullets;
  short kills;
  short dead_wives;
  bool cigar;
};

This is a contrived example, but it’s good enough for illustration. This is what the data structure looks like for a single space marine. Here is that same list of fields, except I’ve color-coded them and made boxes illustrating how much memory each field takes up:

You can see that the name takes up 8 bytes of data. The position is a collection of X, Y, and Z values. Each of those is 4 bytes, so the total size of position is 12. We’re using 4 bytes to store how much chin stubble he has and 2 bytes for his bullet supply. You get the idea. If you add these all up, we need 32 bytes to store a single marine.

When these are stored in memory, they’ll be packed together like so:

Just remember that memory addresses aren’t arranged in a grid like this. Memory is a linear thing, which would mean it’s just one row that stretches on for billions of bytes. But that’s kind of hard to show in an image, so I’m having the memory wrap to another line to make this readable.

Conceptually, this is how programmers want to think of a space marine. He’s this self-contained thing and we manipulate his data every frame to make gameplay stuff happen. We run some sort of game update on him and change his position, subtract some bullets as he shoots stuff, add to his kill counter, and maybe grow out his chin stubble a little. Whatever.

But this probably isn’t the most common use-case for a space marine. This most common thing isn’t updating the marine, but checking for interactions with other objects in the scene.

The Most Common Operation

Judging by Gears of War, I'm betting the most common operation is Marine.Shout()

Let’s say in our scene, we’ve got a bunch of shit flying around that needs to interact with space marines. Bullets and lasers need to damage themAfter creating the above diagrams, I’m finally noticing I didn’t give the marines a hitpoint variable. So I guess all hits are an insta-kill in this game?. Explosions need to shove them around. Debris and physics objects need to bounce off of them. Particles need to avoid passing through them. Crushy machinery should squish them. Vehicles should run them over. Landmines should go off if a space marine stands on them. And so on.

So every frame, every one of these objects needs to crank through the entire list of space marines and test to see where it is so it can decide if an interaction is needed. For the vast majority of these tests, the answer will be “no”.

Like I said earlier in this series, your processor has a tiny pool of memory called the L1 cache. If the L1 cache is 64 bytes, then it can hold exactly 2 space marines at a time. So as all those game objects are cranking through the list, it will look like this:

check SpaceMarine #1

check SpaceMarine #2 (cache miss) check SpaceMarine #3 check SpaceMarine #4 (cache miss) check SpaceMarine #5 check SpaceMarine #6

(cache miss)

And so on…

If there are 16 SpaceMarines and 100 gameplay objects that need to interact with them, then you need to crank through the list of marines 100 times. That means we’re going to do 1,600 position checks, which will result in 800 cache missesActually, the L2 cache is really important and will help us out a lot here, but I’m trying to keep this as simple as possible..

These cache misses are invisible in the code. The programmer doesn’t write a line of code saying to get another block of memory. That’s all handled by the compiler. They just write a couple of lines to loop through all the space marines and check their positions. This invisibility is one of the reasons that programmers like me are often blind to their true cost. The thinking will go something like this:

“Man, all these landmines are doing a distance calculation every frame, and that involves doing a square root. Those are expensive! Maybe I should optimize that somehow.” But meanwhile the square root operation is a minority of the cost. Most processor cycles are being spent idle while waiting for things to arrive from main memory.

So What is the Programmer Supposed to Do?

Here’s the problem you’re trying to solve:

All of your various game objects (landmines, bullets, explosions, etc) need to do per-frame checks to see if they’re interacting with space marines. Every object needs to check itself against every marine. At this stage, you don’t care about anything except the marine’s position. The problem is that you’ve got all this other data – like the marine’s name and number of wives – mixed in with those position values. You don’t care about that data, but it winds up taking up space in the cache and causing performance-destroying cache misses.

A list of three SpaceMarines would look like this in memory:

The long distance between the address of the position variable for SpaceMarine #1 and the position variable of SpaceMarine #2 is the problem. Keep in mind that this is a very simplified structure. In GoodRobot, each robot had dozens of properties. And that was just a simple 2D game. Your real-world 3D game is going to have a huge distance between the position of SpaceMarine #1 and the position of SpaceMarine #2. Very likely, you’ll get a cache miss for every one.

I’ve never worked on a AAA project like this, but based on discussions I get the sense that the most common way to handle this is to copy the data into an array. Just before you do your checks, you copy all the position values into a temporary buffer. Then you do your checks against the buffer. In the rareRare from the program’s point of view. To the player, bullets are hitting marines CONSTANTLY. To the program, it did tens of thousands of checks before it found a case where a bullet hit somebody. occasion that a bullet actually hits someone, it can then interact with the full SpaceMarine data structure to see what happens.

The problem is that there’s an overhead to copying this data into a buffer. You’re now storing the same data in two different places. If the position of a SpaceMarine changesExplosions are notoriously shove-y. then you have to synchronize the data with the temp buffer or you might get weird behavior. This adds a lot of complexity to the code, particularly if you’re interacting with many different data values, some of those values need to change, and sometimes you have to deal with cascading changes like I discussed earlier in this series.

This is messy. Isn’t there some way we can have the best of both worlds? Is there any way we can skip this messy, confusing step with the temp buffer without all the performance problems?

Like I said, a list of three SpaceMarines would look like this:

If all of the SpaceMarines were packed in such a way that similar variables were stored together, then it wouldn’t be a big deal to crank through the list and check all the positions. A list of three SpaceMarines would be packed like so:

If the L1 cache is 64 bytes and a single position value is 12, then the positions of five consecutive SpaceMarines could be held in the L1 cache at the same time. Instead of triggering a cache miss with every other check, you’d get this:

check SpaceMarine #1

check SpaceMarine #2

check SpaceMarine #3

check SpaceMarine #4

check SpaceMarine #5

(cache miss)

That’s going to be more than twice as fast!

So why can’t they be packed like this?

Well, they can. Sort of.

Vexation #6: Refactoring Code is Not Programming

The problem is that in C++, packing the attributes of your marines together would require you to completely re-write the code. Everything related to SpaceMarines – every allocation, every check, every interaction, every deletion – would need to be changed. You’d have to give up your object-oriented class structure and break the variables into seperate lists. You’ll no longer have a list of 50 SpaceMarines. Instead you’ll have a list of 50 names, 50 positions, 50 cigar flags, and so on. Instead of deleting marine #5, you’ll have to delete name #5, position #5, stubble #5, bullets #5, and so on. If you forget one, then the lists will get out of sync with each other and you’ll get a bunch of strange bugs. If another programmer comes along, they won’t have that nice class definition to help them understand how this part of the code works.

In short, doing things the “right” way makes for lousy, verbose, clumsy, confusing code. Worse, switching between these two memory layouts is not easy. If you’re trying to compare the two, then it’s going to take a lot of fussing to gather the data.

These two different ways of storing data are referred to this:

Array of Structs:(AoS) A list of objects (like marines) where objects are stored one after another.

Struct of Arrays: (SoA) An object that contains lists of fields, so similar fields are stored together.

You can think of this as a list of containers vs. a container of lists. These are two conceptually different ideas and the implementation for them is very different. If you realize your project needs to use the other memory layoutOr if you’re curious and want to compare the two. then you need to re-write basically everything.

In gaming, you’re very often doing batch operations where you only care about one part of a large structure. Like, maybe we have a series of loops that look over all of the space marines and checks for a particular situation.

//Some C-ish pseudo-code
//Let's assume these marines have a hitpoint value, even though the earlier ones didn't.
foreach (SpaceMarine Jack in MarineList) {
  if (Jack.Hitpoints () <= 0)
    Jack.die (); 
}

If you have a list of 50 space marines and you’re using AoS, then you get a cache miss every single time you check a marine. However, if you do happen to find a marine that needs to die(), the die() function will probably run nice and fast because the entire soon-to-be-dead marine will already be loaded into the L1 cache. If that same list is stored in SoA format, then checking marines will be blazing fast, but processing the death of a particular marine will be slow because his individual fields are all scattered around in memory.

So to know which system is best, we need to ask ourselves: Which is the most common operation, checking or die()ing?

Assuming this is your typical video game, then obviously checking will be massively more common than die()ing. If you’re got 50 marines, the game runs at 60fps, you check for death every frame, and marines die an average of one every ten seconds or so, then you will need to check hitpoints 30,000 times for every time you need to call die(). Checking hitpoints is massively more common than calling die(), so it makes sense to optimize for checking.

Sadly, none of the C languages offer a way to store your data as a SoA without giving up all of your object-oriented design. To the best of my knowledge, no major language does.

I don’t know why this is. I’m not being snarky. I’m not saying, “I don’t know why those dum-dum language designers didn’t think of this.” There’s probably a shining good reason why compilers work this way, it’s just that those reasons fall outside of my area of knowledge. My guess is that our modern-day concerns over cache misses are relatively new in terms of language development.

If a language designed for games offered you a way to switch between a SoA and AoS as needed, then it would be really useful in cases where you need to do bulk checking on game objects.

JAI will supposedly have this feature, and I wonder if it’s not going to pop up in some other languages in the next few years. I find it hard to believe that video games are the only domain that would benefit from this.

Footnotes:

[1] After creating the above diagrams, I’m finally noticing I didn’t give the marines a hitpoint variable. So I guess all hits are an insta-kill in this game?

[2] Actually, the L2 cache is really important and will help us out a lot here, but I’m trying to keep this as simple as possible.

[3] Rare from the program’s point of view. To the player, bullets are hitting marines CONSTANTLY. To the program, it did tens of thousands of checks before it found a case where a bullet hit somebody.

[4] Explosions are notoriously shove-y.

[5] Or if you’re curious and want to compare the two.

Shamus Young is a programmer, an author, and nearly a composer. He works on this site full time. If you'd like to support him, you can do so via Patreon or PayPal.

 Programming Vexations Part 11: The SchismPrevious Post

Next PostProgramming Vexations Part 13: Project Files 

From The Archives:

84 comments

84 thoughts on “Programming Vexations Part 12: SOA vs. AOS”

Olivier FAURE says:

Thursday Dec 5, 2019 at 7:01 am

I for one am proud to be the first in this thread to say “D totally has this feature”.

Well, sort of. D has a very powerful compile-time reflection and metaprogramming syntax which allows for libraries that can make it easy-ish to switch between array-of-struct and struct-of-array. I’m told there’s a pretty good SoA library, but I haven’t tried it.

Reply
1. Retsam says:
  
  Friday Dec 6, 2019 at 2:25 pm
  
  It seems to be supported in Rust in basically the same way as D, with a compile-time macro. You define your struct, e.g. Marine, and put a “derive” annotation on it, and the compiler automatically generates a MarineVec data-structure that has basically the same API as an AoS, but is stored as an SoA. (“Vec” is basically “array” in rustaceanese)
  
  What’s neat about this approach is that it’s not something that needed to be baked into the language explicitly, and it seems that unlike the Jai approach, where a struct is either marked as SOA or not, with this approach you can use MarineVec for an SoA, and still make a normal AoS as Vec<Marine>. It would be a little harder to switch back and forth with the Rust approach (though it might be as easy as a find/replace of Vec<Marine> for MarineVec), compared to just adding the SOA keyword in JAI.
  
  Reply
Dev Null says:

Thursday Dec 5, 2019 at 7:15 am

Ok, so I get that you’re giving a simplified example, and maybe I’m leaning on the cracks in that simplification a bit too hard but…

Whats wrong with an object-oriented approach that stores the data in different places. Marine has a location attribute that’s just a pointer to the world (or map, or whatever) object that contains it. WorldMap contains an attribute which is Contents which is a list of coordinates and pointers back to the various objects in it. Coordinates are only in one place – so you don’t have the multiple updates problem – and they’re all in one tight-packed lump in memory. I guess the question is, how often do you start from a single object and want to ask where it is (which is now going to require a secondary lookup it didn’t need before), _without_ wanting to compare the locations with a bunch of other objects. And no, drawing everything isn’t on that list, because to draw everything you’re going to start with WorldMap and say WorldMap.Contents.Draw().

Hmmm. I can already see something of a problem. Collision detection requires not just your coordinates, but your dimensions. Still (and given that I’m making this up on the fly at 1am after a beer or two) it seems like some variation on this theme ought to help solve our memory caching problem.

Reply
1. Addie says:
  
  Thursday Dec 5, 2019 at 7:55 am
  
  Proper collision detection does require your dimensions, but it’s normally done in two passes. Assuming no space marine is more than eg. 2m tall, then no part of them is going to be more than 1m away from their centre. So do a first pass of all bullets to eliminate the ones which are more than 1m away from them: that’s a single test, and quite cheap. You can then do a second pass on a much shorter list of bullets, usually against a box over each of their limbs, torso, … to test whether it’s actually a hit, and does extra damage for a headshot etc.
  
  The reason that there’s not an OO approach that stores the data in different places is basically just that that’s not how it’s been done in the past. The compiler generates a memory layout for all the variables in your structure* and then passes it around to all interested parties – in C/C++ this is done with header files. The exact details are not important per se, as long as it’s consistent. No reason that you couldn’t organise it the other way, except that every 3rd-party library and external bit of code would have to have to understand this arrangement as well. Ideally you’d be able to change between them on the fly, and have special cases – your graphics card wants to see colours as a list of (red,green,blue,alpha) structures, positions as (x,y,z,w) structures, for instance, which is kind of an array of structures of arrays. You’d probably want to keep all of your structs together for debugging, too – that’s already difficult enough – and spread them out in a more efficient way for release testing.
  
  No reason that you couldn’t add this in to C or C++, although those languages are already a tangle; Jon Blow obviously thinks the better solution is to start over. That’s generally a common refrain of software engineers, though – Q. How many programmers does it take to change a lightbulb? A. Why change the lightbulb when we’ve the perfect opportunity to rearchitect the whole building!
  
  * actually, Shamus’ example is too simple – each variable would probably be lined up to the nearest 4-byte boundary, take up even more space and cause more cache misses. Similarly, there’s no point in using shorts for anything, unless they’re in an array – they take up the same amount of space as an int, otherwise.
  
  Reply
  1. methermeneus says:
    
    Friday Dec 6, 2019 at 9:44 am
    
    * actually, Shamus’ example is too simple – each variable would probably be lined up to the nearest 4-byte boundary, take up even more space and cause more cache misses. Similarly, there’s no point in using shorts for anything, unless they’re in an array – they take up the same amount of space as an int, otherwise.
    
    Well, unless he uses __attribute__((__packed__)) or #pragma pack or whatever the appropriate equivalent is for the compiler he’s using. Which is probably a complexity he was hoping the audience wasn’t pedantic enough to care about, since then he’d have to explain that line to the non-programmers (probably also the higher-level programmers) in the audience.
    
    Reply
    1. methermeneus says:
      
      Friday Dec 6, 2019 at 10:00 am
      
      Missed my edit window:
      To be fair, in a program that has to store massive amounts of data, like a video game, packing is a pretty fair assumption, and almost no one demos a pattern with packing, on the assumption you’ll use it as appropriate in your own code.
      
      Reply
2. Bloodsquirrel says:
  
  Thursday Dec 5, 2019 at 8:48 am
  
  Marine has a location attribute that’s just a pointer to the world (or map, or whatever) object that contains it.
  
  I don’t think this would help you very much. First off, your pointer is going to take up almost as much memory as the vector itself, and now in order to access it you have to go to two different places in memory (first the SpaceMarine, and then where vector itself is stored), almost guaranteeing not just an L1 cache miss, but probably an L2 cache miss as well. But it’s even worse than that, because you can’t just have a pointer to your coordinates in World’s list of coordinates, because World’s list is going to have to change when you add/remove items from it, so your pointer might wind up being invalid.
  
  Second, if you want to iterate through each SpaceMarine in order to see whether one is being hit by a bullet, then how does this help you? The only thing you have that’s continuous in memory is the coordinates. Is each one a SpaceMarine, though, or is it some other object? Do you have a specific list of SpaceMarine coordinates?
  
  Third, this kind of pointers-to-pointers and complex, interwoven structure of data references is a real bitch to manage, especially when it comes time to save/load your game, since now you have to both somehow store all of those links in a save format and then rebuild them all when you load the save. I’ve gone down this path while screwing around with writing a MUD, and found out that it’s far, far easier to have simpler data structure, even if they’re slightly less performant, than it is to have to reduce complex ones to a save state and reconstruct them again.
  
  Reply
3. Blake says:
  
  Thursday Dec 5, 2019 at 6:20 pm
  
  I’ve worked on console games that use an approach similar to what you’re proposing, but I’d call it a data-oriented approach not an object-oriented one.
  In this case it was one version of an ‘Entity Component System’.
  
  Basically each type of gameobject was an id and a list components. For example ‘SpaceMarine’ might have a ‘positionComponent’ and a ‘physicsComponent’.
  The components didn’t own their relevant data, but instead listed what kinds of data they needed. The data itself was stored in contiguous pools of memory which the components registered themselves with.
  
  For example ‘positionComponent’ might say it needs an entry in the ‘positionPool’ and the ‘directionPool’, while the ‘physicsComponent’ might also need the ‘positionPool’ as well as the ‘modelPool’. The space marine might be id:4, so the position, direction and model pools would all end up with an entry for id:4.
  
  The physics system then would be initialised with pointers the positionPool, the directionPool, and the list of ids registered of active physics objects, and would then be able to work on the tightly-packed positions and models and do the calculations it needed without even knowing if something was a space-marine or a surprisingly explosive crate.
  
  This system was originally set up to deal with the PS3 SPUs, but it works well for any multi-core architecture as it allows you to write lots of small processes that can run in parallel – you just need to ensure that you don’t have any overlapping outputs. If you do it probably means you need to break things up further.
  
  Again I wouldn’t call this an object-oriented approach, you’re never telling a single space-marine to update itself for example, it’s very much a data-oriented approach that considers what each system needs to work with and then registering each objects id with that system.
  
  Reply
Groboclown says:

Thursday Dec 5, 2019 at 7:35 am

My favorite computer book, Inside the C++ Object Model, doesn’t describe why the struct of arrays isn’t used, but instead discusses the internal mechanisms behind the array of classes.

Now, classes have vastly different needs from the compiler than structs, so this isn’t a good comparison. However, for objects, compilers have huge amounts of analysis and smart handling for virtual functions, multiple inheratance, and other object aspects that are passed with the argument and carefully managed by the compiler. Managing the virtual table is immensely complex so it works with all kinds of use cases.

Additionally, a structure of arrays, you would require two variables to declare a pointer to a structure – the pointer to the array and the index in the array. Most C++ internals require this to be sent as two descrete arguments.

On another note, that structure you defined for the space marine is extremely sub-optimal, because the commonly accessed elements do not align on word boundaries. I believe C++ is supposed to keep structs memory exactly as spelled out, but in some cases (which ones?) the compiler will expand the memory taken by the data structure to word align the values, causing the structure to take up even more memory than the programmer thinks.

Reply
1. Blake says:
  
  Thursday Dec 5, 2019 at 6:41 pm
  
  “classes have vastly different needs from the compiler than structs”
  In C++ the only difference between classes and structs is that one has private members by default and the other has public members by default. Perhaps you meant something else here?
  
  And to clarify the alignment issue, the compilers will (unless you explicitly tell them not to with #pragma pack directives) ensure that each variable starts at an offset that is a multiple of its own alignment. For all basic types that will be equal to its size. There will also be padding at the end of a structure to ensure consecutive ones remain aligned.
  
  Looking at the space marine example:
  
  bool dead; // bools are 1 byte
  char[8] name; // chars are 1 byte, this is 8 of them
  Vector3 position; // 3*float – 4 bytes each
  float stubble; // 4 bytes
  short bullets; // 2 bytes for a short
  short kills; // 2 bytes
  short dead_wives; // 2 bytes
  bool cigar; // one byte
  
  This would result in:
  
  bool dead; // +1 = 1 byte total
  char[8] name; // +(8*1) = 9 bytes total
  // 3 bytes of invisible padding so that position starts at a multiple of 4 (12)
  Vector3 position; // 12 + (3*4) = 24 bytes toal
  float stubble; // +4 = 28 bytes total
  short bullets; // +2 = 30 bytes total
  short kills; // +2 = 32 bytes total
  short dead_wives; // +2 = 34 bytes total
  bool cigar; // +1 = 35 bytes total
  // 1 byte of invisible padding, so that the total is 36, which will keep the position value aligned if 2 marines are in an array.
  
  This problem can get worse if you start using larger objects that require alignment, like in SSE programming you might use a __m256 which is a 32-byte sized type that needs 32 byte alignment. if you had just one of them and a bool in a structure you’d always be wasting 32 bytes on padding.
  
  The rule of thumb is just to align your members from largest to smallest, as the smaller members will always be aligned if the larger ones were. This means you can only get padding at the end and only the minimum required for that structure.
  
  // ‘Fixed’ space marine
  Vector3 position; // (3*4) = 12 bytes toal
  float stubble; // +4 = 16 bytes total
  short bullets; // +2 = 18 bytes total
  short kills; // +2 = 20 bytes total
  short dead_wives; // +2 = 22 bytes total
  bool dead; // +1 = 23 byte total
  char[8] name; // +(8*1) = 31 bytes total
  bool cigar; // +1 = 32 bytes total
  // No bytes of padding as 32 is already 4-byte aligned.
  
  Reply
  1. Leeward says:
    
    Thursday Dec 5, 2019 at 9:34 pm
    
    Shamus, you could have saved yourself this if you had included a note for pedants saying you understand alignment. Do you?
    
    Reply
    1. Shamus says:
      
      Thursday Dec 5, 2019 at 9:45 pm
      
      Given how ridiculously contrived the example is and how the whole thing is described for non-programmers, I was sure that NOBODY would be crazy enough to ask about padding. I mean, that’s obviously way beyond the scope of the article, right?
      
      Apparently not.
      
      For the record, I’m aware of padding but I very rarely worry about it. I’ve never worked on AAA projects, so for me padding is a bit of fun compiler trivia and not a daily concern.
      
      Reply
      1. Leeward says:
        
        Thursday Dec 5, 2019 at 10:18 pm
        
        Good to know. By the way, I didn’t intend to come off as a jerk. Your path is pretty unconventional and I genuinely wondered if you’d come across it.
        
        Reply
  2. Groboclown says:
    
    Friday Dec 6, 2019 at 4:38 pm
    
    Blake says: “In C++ the only difference between classes and structs is that one has private members by default and the othr has public members by default. Perhaps you meant something else here?”
    
    Classes also have inheritance, which introduces casting complexity, such as having the parent method work correctly with a child object with different fields. (The multiple inheritance ‘diamonds’ like IOStream are fun!) Once you introduce a “virtual” keyword, classes contain the virtual table, which also needs special consideration for inheritance. The best example of this is with a virtual destructor – there are Very Bad side effects for getting this wrong.
    
    Reply
    1. The Puzzler says:
      
      Saturday Dec 7, 2019 at 1:27 pm
      
      C++ structs can also do inheritance. Traditionally we use classes for stuff like that, but if you wanted to use virtual struct functions and so on, you could.
      
      Reply
      1. Richard says:
        
        Sunday Dec 8, 2019 at 4:49 am
        
        Honestly, in C++ the difference between struct and class is now entirely semantic.
        
        People tend to use struct for things they want to be POD (Plain Old Data), despite the fact that they very rarely actually are POD anyway.
        
        Reply
  3. Roger Hågensen says:
    
    Sunday Dec 8, 2019 at 10:55 am
    
    x86 (and x86-64) do not need alignment nor padding to read. SIMD would need it tough (can’t recall how much penalty you’d get if it’s not aligned).
    
    Ordering things from largest to smallest is not that dumb, but it would be better to group related data together, like 1 long 4 chars then 1 long etc. This could allow faster looping (by skipping or reading in 4 bytes at a time).
    
    Another thing is bitpacking. I’m sure that in this example the stubble states could probably be stored in 2 or 3 bits (0-3 or 0-7 beard lengths), dead could be a single bit, so could cigar.
    
    And is really 2 bytes needed for dead wives? I quick calculation tells me that if the marine actually reaches retirement before dying then 2 bytes would mean he would be limited to marrying only twice per day for 80 years. Even with 1 byte the marine could marry each 4 months. 2 bits (0 to 3 wives) would probably be enough or maybe 3 bits if it polygamy marriages. (the salary of a marine would not be able to sustain a very large family.)
    
    So bitpacking that stuff would reduce from 8 bytes to 4 and actually still have bits left over for future flags and states.
    
    There is also a second vector missing related to orientation/rotation/angle of the marine. And speaking of vectors, half precision (Float16) might be enough, this would make it possible to have position+angle take up half the space.
    
    Also the name should probably be a pointer to a string instead, this would halve the used space in the structure.
    
    As to the structure itself, due to both position and angle vectors totalling 6 values. Using a pointer to a vector in a vector list might be better. That way the vector list could have a id matching the marine ids and the hit loop could just loop through all marines, if a marine is hit it could then lookup the id in the marine list.
    
    Just because a language is object oriented does not mean you have to make everything a self contained object, OOP is just a tool in the toolbox. As long as the code is readable and efficient (and stable) you can do whatever you want to reach your goal.
    
    Reply
Knut says:

Thursday Dec 5, 2019 at 7:36 am

I guess cache misses are not so important in general application programming, which is what most programming languages are made for. Also it’s easier for the programmers when the different attributes for a struct (or object in OOP) are together, and as we know, programmer hours are expensive, and most programming is actually reading and understanding code. So it makes sense to store data in the most programmer friendly form.

Also, and I’m speculating here, maybe it’s more difficult to make good debugging tools etc with a SoA language?

Reply
1. Blake says:
  
  Thursday Dec 5, 2019 at 6:49 pm
  
  If the language is aware of the feature, then SoA isn’t a big deal, you just need smarter pointers that hold the array length and a bool for whether or not it is SoA. From there the math is pretty trivial.
  Retrofitting it into an existing language would be painful though, as you’d want SoA to be an option on a type not a requirement, which would mean (if you want things to be fast) you’d need that class to have some kind of [[SoA-aware]] attribute, then everywhere that could reference an array of them would need to be using a different pointer type.
  In C-family languages that would mean anywhere someone took a single pointer to one would also need to change as it would need to be a pointer to the start of the array, an index into the array, and whether or not it was an SoA one.
  
  It could certainly be done, but it’d be a large feature change used by a very small number of people who already have workarounds for the problem so it likely wouldn’t be worth the cost.
  
  Reply
  1. Decius says:
    
    Thursday Dec 5, 2019 at 9:14 pm
    
    You can’t put a bool into a pointer.
    
    At that level, it’s all bytes, and you could use a bit for the equivalent… but in reality you’ll end up using a whole byte anyway.
    
    Reply
    1. Addie says:
      
      Friday Dec 6, 2019 at 4:17 am
      
      Like Blake say, if you’re talking smart(er) pointers, you most certainly could write your own versions of std::unique_ptr and std::shared_ptr that carry a bool around with them as well.
      
      Since 64-bit raw pointers have a lot more bits then anyone has memory at the moment, there’s nothing stopping you from encoding additional information into the top couple of bits, either – you’ll just need to remember to strip them off before dereferencing the pointer. Nothing stopping you except good common sense, that is – C++ programmers have certainly been known to do worse things.
      
      Reply
      1. Decius says:
        
        Friday Dec 6, 2019 at 11:22 am
        
        ‘bool’ is a higher level construct than a pointer. You can’t put a bool in a pointer any more than you can put a comment or a file in one.
        
        Reply
        
        The Puzzler says:
        
        Saturday Dec 7, 2019 at 1:31 pm
        
        They’re talking about advanced structures for containing a pointer, rather than the low level pointer itself.
        
        Reply
        
        Decius says:
        
        Saturday Dec 7, 2019 at 2:13 pm
        
        You can’t talk about the abstractions in a discussion about cache misses and metal.
        
        Reply
        
        The Puzzler says:
        
        Saturday Dec 7, 2019 at 5:55 pm
        
        Smart pointers are a pretty std thing at this point…
        
        Reply
        
        Decius says:
        
        Monday Dec 9, 2019 at 11:38 am
        
        Everything that makes them smart is stuff that occurs far above the level of cache misses.
        At the level of cache misses, they’re just sequences of bits, because that’s all that anything is at that level.
        
        Reply
      2. Roger Hågensen says:
        
        Sunday Dec 8, 2019 at 11:02 am
        
        Yeah don’t do that please. Using the “top part” has caused issues in the past and still today.
        
        Some games and programs can’t be easily updated to utilize the full 4GB RAM that is available to 32bit apps on x64 systems since doing so will suddenly cause use/overwriting of those appropriated bits.
        Some programs can have a small tool set the exe flag that expands max memory from 2GB to 4GB but if a program does tricks with memory pointers it may crash or corrupt data.
        
        While 64bit pointers are much larger the OS will randomly give addresses to apps (as anti-hacking/anti-virus measures) , there is no guarantee the OS won’t suddenly put the game right smack in the middle of the pointer black magic the game is using and it’ll corrupt itself.
        
        The safest thing is to treat a memory pointer as a opaque and don’t mess with it at all.
        
        Reply
        
        Decius says:
        
        Monday Dec 9, 2019 at 11:40 am
        
        That’s safe, but low-performance.
        Dealing with an OS that doesn’t require or let you do your own memory management is a different situation.
        
        Reply
        
        Ruethus says:
        
        Friday Dec 13, 2019 at 11:55 pm
        
        In a college class this past spring, my Operating Systems class had a whole segment and project built around pointer virtualization, where the “pointer” the program gets and the real pointer are different, and page tables were used to translate the virtual addresses back into the real ones. This sounds like the random distribution you mention, but as part of the virtualization process, you shouldn’t be able to corrupt your own data unless you mess up your own internals, because the OS shouldn’t be re-allocating memory pages that are already in use.
        Admittedly, my knowledge of OS-level functionality is mostly derived from things professors told me, but I’m not sure that the corruption you’re talking about should be possible without OS-level problems (it’s also possible that what I learned was OS-specific and the professor didn’t say).
        
        Reply
    2. Chad says:
      
      Friday Dec 6, 2019 at 4:30 am
      
      That approach is common but not universal; there are systems that use ‘tag bits’, where they reserve (via convention and accessor functions/macros) a bit or two for specific purposes. Language implementations do this, for example. The most common conceptual example is probably signed versus unsigned types in C-like languages; one bit is reserved for whether the whole number is negative or not (yes, this reduces the available range of numbers that fit in the type).
      
      There are language designs that play around with “smart pointers” that are more than just memory addresses, but that’s not common in C, since (among other reasons) pointer arithmetic is one of the things that makes it possible to write really fast/efficient C code. (Also, not coincidentally, why C can break so hard when errors creep in.)
      
      Reply
      1. Decius says:
        
        Friday Dec 6, 2019 at 11:24 am
        
        CHANGES the available range of numbers that fit in the type, or at most reduces the range by 1 (if you count -0 as not in the range).
        
        Reply
      2. Richard says:
        
        Sunday Dec 8, 2019 at 5:28 am
        
        It is very common (on x86, amd64 and ARM) to use the least-significant bits as a flag, because it is either very slow (x86/amd64) or simply Not Allowed (ARM) to access unaligned data.
        
        So because an odd pointer Cannot Exist, if the LSB is set then the ‘real’ value must actually be with it cleared, and the flag is set.
        (In fact you usually have at least two bits, details vary by platform)
        
        With this idea pointer arithmetic still works as long as all items in the list have the same flag value – eg “is in a SoA”/”is in an AoS”
        
        It’s also commonly used for techniques like small-string-optimisation.
        – 8 ASCII characters will fit in the 64bits of a modern pointer, and short strings are common, so wouldn’t it be great if we could just store the whole string in the pointer we’d normally be using to point at the memory where the long string would be? For “free”?
        
        Reply
        
        Decius says:
        
        Monday Dec 9, 2019 at 11:47 am
        
        So, pointer arithmetic doesn’t work using that technique unless you make assumptions that preclude the low-level optimization that you’d want to use.
        
        Having a pointer to static data is just wrong to begin with; sure, it technically works, but you could just as well put the data in the place where you named the pointer.
        
        Reply
    3. Kyte says:
      
      Sunday Dec 8, 2019 at 1:29 pm
      
      You can, actually. That’s what tagged pointers are. For example, if you assume all pointers are aligned in 2-byte boundaries, you guarantee the least significant bit will always be zero. At that point, you’ve got one free bit to use to carry data, and you just spend a tiny bit of computing to AND it out before dereferencing the pointer.
      
      Reply
      1. Decius says:
        
        Monday Dec 9, 2019 at 11:53 am
        
        Why read it and AND it out with 0000000000000000000000000000000000000000000000000000000000000001 (assuming you’re using a 64-bit pointer) and then compare the result with 0000000000000000000000000000000000000000000000000000000000000001 and then JNE instead of reading it and doing a JPE or JPO?
        
        Someone actually skilled could probably dereference the pointer and get the parity flag from it using zero additional cycles.
        
        Reply
2. Leeward says:
  
  Thursday Dec 5, 2019 at 9:40 pm
  
  Cache misses are important for any domain that’s got performance requirements. That includes everything from network switches (not ready for the next packet? It gets dropped.) to spreadsheets (who wants to spend time waiting for their cells to update?)
  
  I don’t know why this pattern isn’t built into languages either. It’s not because other domains don’t care about performance though. Cache profilers are a thing.
  
  Reply
Joshua says:

Thursday Dec 5, 2019 at 7:55 am

Spelling error: “Array of Stucts:”

Reply
Bloodsquirrel says:

Thursday Dec 5, 2019 at 8:26 am

There’s probably a shining good reason why compilers work this way, it’s just that those reasons fall outside of my area of knowledge. My guess is that our modern-day concerns over cache misses are relatively new in terms of language development.

I think it has more to do with how an SOA breaks polymorphism. In an object-oriented language, you don’t just have SpaceMarine. You have the base class SpaceMarine, and the derived classes SniperMarine, SargeantMarine, and DriverMarine. Each of those base classes has additional properties, which would mean that you couldn’t create a SOA that contains a mix of all four classes, since those additional properties aren’t provided for in the arrays that you would need just for SpaceMarine. But that’s exactly what you want to do in OO programming. You’re supposed to be able to have an array of SpaceMarine, where you don’t know and don’t care whether it’s a regular SpaceMarnie, SargeantMarine, or DriverMarine.

Of course, this is where even your AOS breaks down in an OO language: You usually wouldn’t have an array that consists of SpaceMarines sequentially stored in memory. You’d have an array of pointers to SpaceMarines, with each SpaceMarine being stored somewhere else in memory and the value of each entry of the array merely containing the address to the full SpaceMarine. This allows you to have a mixture of classes that inherit from SpaceMarine in a single array, even though each one may take up a different amount of memory.

This is, of course, even worse for cache misses than an AOS would be, but most OO languages are designed primarily for maintaining OO methodology. C++ is the odd duck out, in that the manual access it gives you to memory allows you to do things like specify whether you want to declare a “SpaceMarine spaceMarine” or “SpaceMarinePtr spaceMarinePtr”, where the former means you’re allocating the memory specifically for a SpaceMarine (and not any derived class) on the stack, and where latter means that you’re just declaring a pointer to a SpaceMarine, which can be given the memory address of any derived class of SpaceMarine that you either create on the heap with “new SniperMarine()”, an existing SpaceMarine from some other list, or, if you’re desperate to create trouble for yourself, a SpaceMarine that you created on the stack with “SpaceMarine spaceMarine” and then deferenced.

In Java, none of this is possible. The only way you can declare or handle a class is through a pointer. Except that they’re not called pointers, because all of the manual manipulation of them that you can do in C++ is unavailable. Structs don’t exist at all. In C#, you can have a struct, which will operate in the way that the article’s example would, or you can have a class, which behaves in the way that Java classes do. You can’t create a subclass of a struct, however, and the generally recommended practice in C# is to reserve structs for simple types (such as a vector).

So why don’t languages implement SOA? Well, for most of them, it would mean accessing and worrying about memory in ways that the language is otherwise designed to avoid. Java doesn’t even guarantee that a simple array of integers is actually made up of a continuous block of memory. So the real question is “why doesn’t C++ implement SOA?”, and the answer to that question is probably that C++ consists of C with OO features bolted onto it. The truth is that the SOA is sort of a niche thing- in most applications, the performance-critical parts of your code don’t need OO features, and the parts of your code that need OO features are engaging in far more random memory access that makes a SOA useless from a performance perspective.

Case in point: You wouldn’t just be iterating through a list of SpaceMarines to check whether each one is at location X,Y,Z. You’d have the SpaceMarines stored in a quadtree (or some other data structure), that will allow you to determine how many SpaceMarines are in a given area far more efficiently.

Reply
1. Groboclown says:
  
  Thursday Dec 5, 2019 at 8:45 am
  
  Excellent analysis. I’ll extend this a bit by adding that the SOA vs AoS only has meaning when dealing with the low-level data stores, i.e. structs and arrays, and does not apply to objects/classes. At this level, you’re writing C code and the compiler assumes you know what you’re doing, abd it’s not much more than macros over assembly. All management of memory, and mechanisms for iterating through the memory are left to the programmer.
  
  On a side note I don’t know when the compiler is allowed to add padding to structs for word boundary alignment. To support things like file format data structures it needs to allow explicit memory mappings.
  
  Reply
  1. DavidJCobb says:
    
    Thursday Dec 5, 2019 at 2:26 pm
    
    On a side note I don’t know when the compiler is allowed to add padding to structs for word boundary alignment. To support things like file format data structures it needs to allow explicit memory mappings.
    
    I’ve not double-checked, but to my knowledge, MSVC at least will try to keep a struct’s members aligned to their own size. If you have three bools and an int16_t, then you’d end up with one padding byte between the bools and the short (i.e. the two-byte value gets aligned on a two-byte boundary). There are declarations you can place before a member to specify a custom alignment, though I don’t remember offhand what they’re called.
    
    Generally, if you need to read or write files, you’ll just manually write the code to load/save field-by-field (such that there is no padding in the file), or (if you’re in control of the data in the first place) you may just blindly read or write entire structs and count on you always using the same compiler with the same alignment behavior. (Bethesda did the latter for some specific data structures in Skyrim’s game data files.)
    
    Reply
2. Blake says:
  
  Thursday Dec 5, 2019 at 6:55 pm
  
  I think it’s even worse than that for cpp because you can normally take a pointer to an object, but what would that mean if you were taking a pointer to the 10th space marine in an SoA array of them?
  Your ‘pointer’ would actually need the pointer to the start of the array, the length of the array, the index into that array, and whether or not the array was SoA to begin with as the data for that marine would be in different places based on all those factors.
  
  It could be done, but I think it’d be best to leave it out of the language and instead make it a library based solution once compile-time reflection gets into the language (hopefully C++23).
  
  Reply
  1. Echo Tango says:
    
    Thursday Dec 5, 2019 at 8:04 pm
    
    Would you actually want a normal pointer to an SoA-style space marine? The indexes into those arrays are already the the useful way to get to a specific space marine’s data, and if you wanted to free the memory or move it around, that’s probably an operation best handled by whatever code is maintaining those arrays in the first place. (Not walking off the ends, keeping track of which space marines are dead, and can have their memory re-used the next time a marine is spawned in the world, increasing the sizes of the arrays if a level needs to spawn more than the default length…)
    
    Reply
  2. Decius says:
    
    Thursday Dec 5, 2019 at 9:11 pm
    
    Your ‘pointer’ consists of the address of the location in memory where those things are stored.
    
    Because, as you suggested, the size of the pointer is not fixed.
    
    Reply
3. Richard says:
  
  Sunday Dec 8, 2019 at 5:36 am
  
  It doesn’t actually break polymorphism.
  
  The ‘data-orientated’ way of looking at the problem is to think of your objects as being composed of ‘behaviours’.
  
  So a SpaceMarine is composed of the behaviours “HasPosition”, “CanJump”, “CanRun”, “CanDie”, “HasStubble”
  
  Behaviours inherit other behaviours – “CanRun” inherits “CanWalk”.
  
  The Unity game engine is based around this idea – all the ‘GameObjects’ have a list of behaviours, and you can add and remove behaviours at both compile and runtime.
  – Eg if your SpaceMarine falls too far, you might remove “CanJump” and “CanRun”, and add the behaviour “CanCrawl” until they get a medkit.
  
  Reply
  1. Bloodsquirrel says:
    
    Sunday Dec 8, 2019 at 9:59 am
    
    It doesn’t actually break polymorphism.
    
    Well, yes, it does. What you’re describing is a programming methodology that uses inheritance and polymorphism in a different way. But on the level of the language itself, you still have the same problem: The language is designed to allow inheritance. A SOA is incompatible with inheritance. It doesn’t matter whether you’re extending “SpaceMarine” or “CanWalk”- either way, you’re faced with the problem that your “CanWalk” might have a subclass with additional data, and since your language is designed to always allow a “CanRun” to be used wherever a “CanWalk” has been declared, supporting a language-level SOA for it is impossible.
    
    Or, perhaps more importantly, it doesn’t matter whether or not you have chosen to extend SpaceMarine or not- the language designers don’t know whether you will or won’t when they’re designing the language. The language has to support doing in a consistent manner- and in the case of an SOA, that means restricting it to “sealed” or “final” (non-inheritable) classes. Which- of course- means that you’re breaking polymorphism.
    
    It’s still possible, sure, but it goes against the design goals of the language for the sake of a niche application that your language isn’t generally used for in the first place. And, again, most OO languages don’t even have the basic prerequisite for making a SOA useful in the first place- a guarantee of continuous memory allocation. If you’re programming in Java or Python that’s been abstracted completely away from you. The entire idea does not make sense in those languages, and that’s fine, because you don’t need it for the applications that they’re typically used for.
    
    Reply
Pester says:

Thursday Dec 5, 2019 at 8:28 am

Spelling error: “diagrames” in the first note [1].

Reply
Michael Brazier says:

Thursday Dec 5, 2019 at 8:32 am

Most languages in the C family optimize for AoS because, in the typical programming domain, if you need to work with any fields of an object, you’ll probably need to work with all its fields; thus, caching the whole object helps performance. Operations that do the same thing to thousands of objects at once are rare. So, most languages optimize for AoS precisely to avoid cache misses as much as possible – they just make different assumptions about what use cases are common.

The only domain other than physics simulations (of which video games are a subset) where batch operations are common is databases – which have special languages, most notably SQL, to describe them. A relational database is arguably a really big SoA, and the problems with using object-oriented languages to program databases are well-documented; it’s not that surprising that you’re seeing similar problems in video game programming.

Reply
1. Retsam says:
  
  Friday Dec 6, 2019 at 1:18 pm
  
  I think this is the best answer in the thread – it’s not that “AoS is better than SoA for avoiding cache misses”, it’s that “AoS is better than SoA in specific circumstances” – like a “fireGun” method might look like:
  
  fireGun(Jack) { if(Jack.bullets > 0) { Jack.bullets -= 1; Jack.gun.fire(); if(Jack.cigar) { // Remember, smoking is cool. Jack.smoke(); } } else { Jack.gun.click(); } }
  
  In this total realistic AAA code, if “Jack’s” data is spread all over a SoA, then each attribute is going to be a cache miss: Jack’s cigar attribute isn’t stored anywhere near his bullet count in memory. Whereas in an AoS, all of Jack’s data would likely load into cache after the first access.
  
  Even in AAA game dev, I imagine code that looks like this is probably more common than code that loops over an array of marines and accesses one property from each: but it just so happens that the bits of code that benefit from the SoA technique happens to be the most performance critical code in the app – the collision detection logic is going to happen exponentially more often than the “fireGun” logic, so it makes sense to optimize for it.
  
  Reply
Leipävelho says:

Thursday Dec 5, 2019 at 9:14 am

At the end of this series Shamus is going to surprise us all with a finished a game.

Reply
1. Duoae says:
  
  Thursday Dec 5, 2019 at 12:13 pm
  
  .But we have to write the compiler for it, as the twist? :D
  
  Reply
Cubic says:

Thursday Dec 5, 2019 at 9:21 am

Modern databases may provide some implementation guidance, see ‘column-oriented DBMS’.

https://en.wikipedia.org/wiki/Column-oriented_DBMS

Reply
Chris says:

Thursday Dec 5, 2019 at 9:38 am

Since that Vector3 is presumably 3 floats, in practice there would be three bytes of padding after name in order to align the first element of Vector3. There would also be a byte of padding at the end in order to align the next element in an array of SpaceMarine. Those 4 bytes of padding could be eliminated by moving a bool and a short immediately after name. Many compilers have extensions to force tighter packing, but whether or not it improves performance is hard to say without actually measuring it. Unaligned accesses can be slower, even on x86 (e.g. unaligned SIMD loads).

The data-oriented SOA example with “dead dead dead” followed immediately by “name name name” is misleading. In practice you wouldn’t normally have the different variable groups packed next to each other, particularly because you couldn’t add a fourth marine without moving everything around. All that matters is that the same variables are grouped together, and it’s perfectly fine, and perhaps even preferred, if other variable groupings are out on their own cache lines, or even their own pages. The point is that you don’t litter your cache with irrelevant data (striding over it) while you’re iterating over your marines.

Sadly, none of the C languages offer a way to store your data as a SoA without giving up all of your object-oriented design.

I’ve never actually used something like this in a real program, but this got me thinking about it how to tackle it within C++ OOP:

// Backing “SOA”
static float entity_x[1024];
static float entity_y[1024];
static float entity_z[1024];
static int entity_next;

struct Entity {
int i;
Entity() { i = entity_next++; }
// getters/setters
float &x() { return entity_x[i]; }
float &y() { return entity_y[i]; }
float &z() { return entity_z[i]; }
};

This would need more of course (rule of three/five), and a smarter “allocator”, but I think this is the gist of it. Each entity is really just an integer, an index, and the accessors reach into the backing SOA at that index. Hopefully compilers wouldn’t be too confused and could still generate vectorized loops from this.

Reply
1. Rolf Andreassen says:
  
  Friday Dec 6, 2019 at 12:15 am
  
  As written, that doesn’t solve the problem of cache misses. To get the position of a single marine, I have to look up x[i], y[i], z[i]; and because you declared them as separate arrays those could be on opposite sides of main memory for all you know. There’s no very obvious way for the compiler to know that an x lookup is going to be followed by a y and a z lookup and get all three into cache, and indeed, even if it did know that and placed the three arrays sequentially, x[i] is still separated from y[i] by 1024*sizeof(float).
  
  Now, clearly, there’s a fix for that:
  
  static float position[1024*3];
  
  // class boilerplate
  
  void Marine::getPos(float& x, float& y, float& z) {
  x = position[idx];
  y = position[idx+1];
  z = position[idx+2];
  }
  
  (where I’ve got vectors as separate x, y, z variables, but obviously feel free to insert an actual class with, like, vector math features if you’re writing non-example code). Now the compiler can hardly avoid loading up the three floats into cache at the same time. But although the fix is pretty easy, this does demonstrate how annoying it is to have to think about this memory-layout stuff every damn time, instead of just zapping the variables into an object. Worse still, what if you sometimes but not always needed the stubble along with the position?
  
  Reply
  1. Chris says:
    
    Friday Dec 6, 2019 at 8:42 am
    
    It really doesn’t matter if x, y, and z are far apart in memory, for two reasons. The first is virtual memory. In my example, each of x, y, and z are on their own pages (each array is 4kB), and those pages could be mapped anywhere in physical memory, including right next to each other, regardless of their virtual addresses. When you’re talking about big pieces of data, their relative position really doesn’t matter. (Exception: Jumps between code can be encoded more efficiently when they’re close together, but only if the compiler knows this.)
    
    The second, and more important reason, is that the cache isn’t a single window into main memory. It’s a bunch of little windows called cache lines (64 bytes wide on x86). Each of x, y, and z will load on its own cache line. Further, modern CPUs have memory-level parallelism, and so, if the program is written/compiled properly, can fetch x, y, and z in parallel — essentially fetching all three in the same amount of time it takes to fetch one. If you’re reading x, y, and z in order across each array, as you would when iterating over all marines, then the CPU will even predict the program’s behavior by pre-fetching the next part of the array before it’s needed. This is what makes data-oriented design so much more efficient.
    
    You’re right that this is all less efficient when we’re accessing only a single entity. That’s 3 cache lines, 192 bytes of cache, spent just to access 12 bytes. The traditional AOS way would fit the whole entity in a single cache line, or at worst straddle across two. But SOA isn’t optimizing for this case. It’s optimizing for accessing parts of many entities, such as when we want to check some property of every marine in the game.
    
    Even better, with common fields packed together, the compiler has much more opportunity to vectorize, so that multiple marines can be processed in a single instruction. For example, suppose we wanted to count the number of marines underground (y < 0). Once the loop is warmed up (pipeline filled, cache is pre-fetching), a high-end x86 CPU (i.e. with AVX-512) could make this check for 16 marines every clock cycle, single-threaded. Adding arbitrarily new properties to the marine wouldn't impact this performance, either, since they're stored separately.
    
    Reply
    1. Rolf Andreassen says:
      
      Friday Dec 6, 2019 at 11:00 pm
      
      That’s very interesting; thanks! It also demonstrates how difficult it is to optimise this stuff without being a specialist in it – my knowledge of what the hardware does was out of date, and boom, my reasoning is suddenly way off.
      
      Reply
      1. Roger Hågensen says:
        
        Sunday Dec 8, 2019 at 11:08 am
        
        You don’t have to be a specialist i it, you just have to specialise the optimisation for the task, and finding the right way to do that is not easy. A list of marine position an a list of clutter on the ground might need quite different optimisations.
        
        Reply
Decius says:

Thursday Dec 5, 2019 at 11:12 am

Distance comparisons don’t have to involve a square root unless you’re adding distances together.

If you want to see if (x1,y1,z1) is closer than 10 units from (x2,y2,z2), you do (x1-x2)^2+(y1-y2)^2+(z1-z2)^2<100, comparing the square of the distance between the points with the square of the threshold.

For finding which point is closer, find the point with the smallest squared distance, &tc.

If your coordinates are ints, you can get the squared distance without even using floating point operations, and save another handful of cycles.

But you lose ALL of that optimization and more when you do something stupid like iterate over every pair of objects to check collision, rather than only checking collision between objects in the same or adjacent areas… a practice that requires that the areas know what objects are in them.

Reply
Decius says:

Thursday Dec 5, 2019 at 11:25 am

Oh, the real reason why AoS vs SoA design is hard is because it used to be the programmer’s job to do memory allocation, and when it became the compiler’s job to do memory allocation they used the best practices of programmers doing memory allocation, which were focused on being able to correctly find the right bytes over and over again- things like indexes and offsets.

The ideal situation would be that each data value is stored in only one location, and the ones that will often be operated on together are stored together. The computer can easily store the location data of all of the marines together, and the inventory of each marine together, so that it can check to see if a marine is hit by a bullet quickly, and then drop everything that they are carrying. But that means that when you ask “where is this marine in memory” you get “It’s position starts at #marineposition+12*#marineid, their inventory is at #marines+56*#marineid, and how brightly their cigar is burning requires you to look at the light source index.”

Reply
1. Roger Hågensen says:
  
  Sunday Dec 8, 2019 at 11:24 am
  
  “it used to be the programmer’s job to do memory allocation”
  
  I’m so happy someone said this. As someone who prefer unmanaged modular procedural programming I always cringe when I see managed code where the programmer has little to no control over memory.
  
  It’s similar to the same issue with general optimization. In the past games and software was more closely optimized to the hardware, these days there is so much wasted potential (even on consoles).
  
  Many just throw stuff into the loops these days too instead of preparing variables before entering the loop, maybe getting stuff into the CPU registers so that you can run the loop almost fully on the CPU registers avoiding any extra calculations or memory reads inside the loop (unless you really have to).
  
  If you are lucky a modern language and compiler has been created such that it’s able to do these things for you. But it’s still hard to beat ASM or C compiled exactly the way it was coded. The C and C++ language has not changed much, what has changed dramatically is the standard library.
  
  As the GHz speed of CPUs have now plateaued at around 4-5 GHz max, and computation is being solved through throwing more cores at it, parallelism and cache optimization is where future gains will be. Especially cache as the CPU cache is really expensive, way more expensive than DDR4 and GDDR6/HBM2, and not just due to the memory itself but also due to the die size it eats up (which could be used for cores).
  
  Besides certain tasks, core count is not that much of any issue yet as most games and programs use 4-8 threads/cores. The last few years the average sneaked up from 4 to 6 I think. Sure the OS eats up a few threads too. But a 8 core 16 thread CPU should last a long time, the cache though is another matter. A lot of the security issues with Intel’s CPUs where due to shortcuts taken to speed up the caching (or avoid caching) or cache ahead of the actual use. With this weaknesses fixed the performance are reduced and the only quick solution is to add more physical CPU cache as you can’t do as much branch prediction so instead you have to keep more previous stuff around for longer.
  
  Reply
  1. Decius says:
    
    Monday Dec 9, 2019 at 8:47 pm
    
    I’ve never managed to do so myself, but I admire the programmers who are able to defy the difference between data and code by having an area of memory that does both.
    
    Completely unmaintainable code, of course; even more so than simply self-modifying assembly where the logical destination of a jump is context-dependent.
    
    Reply
The Coach says:

Thursday Dec 5, 2019 at 12:12 pm

Clearly “Stubble” is the hit point variable.

Bullets and Lasers damage and reduce a Space Marine’s Stubble, and clean shaven Marines die instantly…

Reply
1. Echo Tango says:
  
  Thursday Dec 5, 2019 at 8:07 pm
  
  This is true – all space marines are clean-shaven in their caskets during the funeral. Now we know why!
  
  Reply
Jamey says:

Thursday Dec 5, 2019 at 12:23 pm

One option (with low level languages) is to handle the memory yourself for an array of marine positions, and in your class store a pointer to the location in the array for the position of that marine. With the usual caveats of it’s a giant pain the the butt and very error prone and hard to debug. But that’s the price for squeezing out every cycle of performance.

Reply
Riley says:

Thursday Dec 5, 2019 at 12:33 pm

My understanding is that most engines solve this with components. https://gameprogrammingpatterns.com/component.html
Basically the space marine holds pointers to smaller objects that can be more efficiently iterated through with fewer cashe misses.

Reply
C.J.Geringer says:

Thursday Dec 5, 2019 at 2:21 pm

“After creating the above diagrams, I’m finally noticing I didn’t give the marines a hitpoint variable. So I guess all hits are an insta-kill in this game?”

Stubble is HP. when clean shaved the marine dies.

Reply
1. Roger Hågensen says:
  
  Sunday Dec 8, 2019 at 11:35 am
  
  As a RPG mechanic it would make more sense to use larger beard as a chance of a gas mask leaking for example, so clean shaven would improve the masks seal around the face.
  
  Reply
Rick says:

Friday Dec 6, 2019 at 2:20 am

Speaking of Good Robot, I spotted Good Robot’s logo in the Programming Tutorial PDF linked from this page on DFRobot.com (the file is hosted on GitHub.

It’s on page 85.

Reply
1. Echo Tango says:
  
  Friday Dec 6, 2019 at 9:14 am
  
  Apparently the culture around copyright/patents/trademarks is a lot looser in Asia, especially around electronics/physical goods. /shrug
  
  Reply
Noah says:

Friday Dec 6, 2019 at 3:40 am

Languages so low-level that you directly manage memory layout (e.g. C, C++, Java) want to give you more customisability than a language-layer AoS implementation could easily do – though with enough horrifying bending over backwards I’ll bet you could do this with C++ templates for your specific use case, even with some level of switching back and forth. But to get a good AoS layout for your specific use case you’d need to custom-code it.

Why? Because think about reallocation and object deletion in AoS. If you have a lot of objects deleted out of the middle, how do you handle that? Free-memory bitmap? Copying structures and updating pointers to them? There’s no good general-purpose high-performance answer, and you’re specifically doing this for the performance. This is a *complicated* tradeoff, which at least in C or C++ would clearly belong in an external library (where you could have several implementations with very different tradeoffs) not in the standard library.

If JAI wants to do this in a “generic” way then it’s going to need to be *very* opinionated about how and when you get to delete objects – does it keep a free bitmap, which is a waste for implementations with very little deletion? Does it update pointers, which is complex and messy and not thread-friendly? Does it say “delete rarely and in big batches,” which is a giant headache for the developer?

Languages that are so high-level that the language could reasonably abstract this away with only about that language’s usual level of overhead (e.g. Python, Ruby) are already so high-overhead that if you’re trying to mess around with memory layout to save L1 cache, you’re already completely hosed. In those languages L1 cache locality is already completely broken and you’re trying to save on things like database queries and network round trips that are multiple orders of magnitude slower.

Reply
1. Addie says:
  
  Friday Dec 6, 2019 at 4:37 am
  
  C++ does have this in the standard library – for example, std::vector’s backing store is guaranteed to be contiguous in memory, for compatibility with C code that expects that. Insertions might trigger the store for the vector to be reallocated somewhere else; deletions cause everything after that to be moved down an address in memory. The expectation is that, if you’re directly accessing the backing store, you should know what you’re doing to manage that, and just use the standard vector access methods if not.
  
  The C++ standard library templates also generally allow you to specify a custom allocator, for when it really matters to you – mostly it doesn’t, but it is there in the standard library if you need it.
  
  I’d imagine that for most struct-of-array operations, iteration is going to be a great deal more common than insertion or deletion. If Jai’s iterators were modified to automatically skip items marked as deleted, then that would be fairly transparent to the end programmer – you could then call a once-per-(frame / second / whenever) synchronised method to compact all of the vectors in memory again. Jon Blow has said that Jai can swap between AoS and SoA storage in a simple way, which I’d visualised as a system of annotations, like Java code would – this could be extended to specify how insertion / deletion takes place in an array if that’s required.
  
  Reply
  1. Decius says:
    
    Monday Dec 9, 2019 at 8:57 pm
    
    “deletions cause everything after that to be moved down an address in memory”
    
    Well, there’s your performance problem.
    
    If your problem is on the scale of cache misses, adding a read/write cycle to your data and rebuilding the index every frame is not going to improve performance. That would be like someone occasionally having trouble finding the 2% milk, so the supermarket shuts down and does a full inventory after each customer.
    
    Reply
2. Bloodsquirrel says:
  
  Friday Dec 6, 2019 at 9:17 am
  
  Languages so low-level that you directly manage memory layout (e.g. C, C++, Java)
  
  Java?
  
  Reply
  1. Echo Tango says:
    
    Friday Dec 6, 2019 at 4:07 pm
    
    I was going to ask the same thing – isn’t Java a garbage-collected language without direct memory access?
    
    Reply
    1. tmtvl says:
      
      Friday Dec 6, 2019 at 6:46 pm
      
      Well, there are ways of directly shoving bytecode into the JVM, but that’s even clumsier than putting assembly in your C.
      
      Reply
Jack V says:

Friday Dec 6, 2019 at 5:45 am

Hm. Now I think about it, you MIGHT be able to do this with C++ magic. But it’s not going to be very pretty.

Something like, have a SpacemarineData struct. Have a SpacemarineVector class (actually a general GameVector class templated to use SpacemarineData in this case), which internally allocates a blob of memory the size of N * SpacemarineData. But it doesn’t store N contiguous SpacemarineData structs, instead for each data member it stores an array of N values. And when you index an element from it, it returns a magic dataless wrapper class that remembers the pointer to each separately and serves them up when you use the appropriate member access -> operator. So now, accessing an individual marine is a bit slow. But all the xyz coords are contiguous and to optimise you can have GameVector functions that lets you access one specific data member only, which (hopefully) when compiled turn into a loop over an array of simple values.

I don’t know if anyone actually tries to do that, or just hardcodes the equivalent by using an index to refer to space marines and just knowing you need to know what array to index it into.

Reply
tmtvl says:

Friday Dec 6, 2019 at 9:28 am

Amethyst (a game engine for Rust) kinda combines AOS with SOA: Entity and Component.

Because idiomatic Rust isn’t quite as close to the metal as raw C I’m not sure how perfomant this is, but it has it’s advantages.

Reply
Kathryn the Avionics Hardware Engineer says:

Saturday Dec 7, 2019 at 9:50 am

*starts reading thread*

*backs away slowly*

(I actually did read most of it. You guys lost me, yes, but not as quickly as mechanical engineers do, so there’s that! Meanwhile, I’m just waiting for the day we have a thread on radiation effects on electronics.)

Reply
Richard says:

Sunday Dec 8, 2019 at 5:03 am

Sadly, none of the C languages offer a way to store your data as a SoA without giving up all of your object-oriented design. To the best of my knowledge, no major language does.

C++ does permit this, transparently and in the exact way you mean.
In fact, in C++ you can even flip them back and forth to quickly benchmark which is better.

C++ allows a class/struct to overload operator new() – the function that allocates the memory and creates a new object.

So, you override operator new() in your “position” class to allocate all “position” data elements in a single contiguous block.
(Obviously there are details around what to do if there are too many for a single block and thread-safety but nothing ‘hard’)

Then in use, all the “position” data elements are magically stuck together and you don’t even notice.

You can do this for each of your ‘often-walked’ types.

Of course, to get the full benefit you also need to alter the iterators used to walk the lists so it walks the “positions” rather than the “marines”.

Reply
1. Richard says:
  
  Sunday Dec 8, 2019 at 5:16 am
  
  This is one of the things that game-specific libraries and ‘engines’ do – and why they often have engine-specific types for things like Vector2D, Vector3D, Vector4D, Matrix4x4 etc.
  
  It’s not because there isn’t a standard matrix library, it’s because they can then ensure that “Position” vectors are optimised for the way that specific library or engine deals with them.
  
  The moment you start having a language decide this kind of thing, it steps into game engine territory – there is One True Way and if thou disagrees, thou shall be cast unto the void*.
  
  Reply
  1. Cubic says:
    
    Monday Dec 9, 2019 at 10:05 am
    
    Underrated.
    
    Reply
raifield says:

Sunday Dec 8, 2019 at 8:00 pm

I have nothing especially useful to add to this article, but I do want to say I enjoy reading all of the Programming Vexations now that I’ve (essentially) been told I need to learn Python in order to keep my job. Fortunately Python seems easier to learn and use than C++ or C#, at least for a neophyte such as myself.

Reply
1. Pinkhair says:
  
  Monday Dec 9, 2019 at 2:36 am
  
  I am hoping for Shamus to get into Houdini, and start talking about the vexations of VEX script.
  
  Reply
2. Bloodsquirrel says:
  
  Monday Dec 9, 2019 at 5:28 am
  
  It’s easy to learn, but also really easy to get yourself in trouble with if you aren’t disciplined. You can get a lot of run-time errors that C++ or C# would have caught at compile time.
  
  Reply
default_ex says:

Monday Dec 9, 2019 at 10:18 am

I honestly never understood the problem of array of structs or struct of arrays. Either way to do any real work with it you have to do a bunch of math on the indexer. However I do tend to favor spatial arrays of structs. That is arrays of structures that have geometric indexing properties (both in the array of structs and the structs data layout). That’s part of why I find Hilbert Curves so fascinating, they facilitate that into any dimension count where each added dimension is a relationship between the objects. To me that’s separating a huge part of what makes the argument with a compromise in effort to construct the dimensions in a meaningful way.

Reply

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

T w e n t y S i d e d