(apologies in advance — this will be a rather technical post. It’s eventually intended for fdiv.net, but while it’s getting migrated I figured I’d plop it down here.)
Instance Variables (or, more briefly, ivars) are a pretty simple trend in Object Oriented Languages. They’re data that get carried along inside an object, helping to define its state. While it would be fun to elaborate more on this, this particular article isn’t intended to teach the basics of OOP. So if you’re unsure of what an ivar is, this article probably isn’t for you.
In Objective-C, (and possibly other languages), ivars have an interesting property — they’re
volatile
.volatile
is one of those shady regions of C and C-derived languages (like Objective-C) that few people talk about, but what it means is that the data behind the variable can change at any moment, so the compiler should reload it from memory whenever it’s used (the alternative is to load it once, and store it in a register — this is typically simpler and faster, but sometimes wrong if the data is in fact volatile). This is all well and good, and it performs exactly as intended. However, because of this reloading, we get some negative performance characteristics to go along with it.To demonstrate, we’ll have this chunk of code:
#import <Cocoa/Cocoa.h>#define ELEMENTS (1024*1024*4) #define ROUNDS (50)typedef struct Vertex { float x, y, z; } Vertex;@interface PointCloud : NSObject { Vertex *vertices; Vertex min, max; unsigned int count; }-(id)initWithCount:(unsigned int)count;-(void)calculateMaxMin1; -(void)calculateMaxMin2; -(void)calculateMaxMin3;@end
@implementation PointCloud -(id)initWithCount:(unsigned int)pCount { if(self = [super init]) { count = pCount; vertices = (Vertex*)malloc(sizeof(Vertex)*count);
unsigned int i; for(i=0;i<count;++i) { vertices[i].x = 4. * sinf(i); vertices[i].y = 4. * cosf(i); vertices[i].z = 4. * sinf(i)+cosf(i); } } return self; }-(void)calculateMaxMin1 { if(count == 0) return;
min.x = max.x = vertices[0].x; min.y = max.y = vertices[0].y; min.z = max.z = vertices[0].z;unsigned int i;
for(i = 1; i < count; ++i) { if(vertices[i].x > max.x) max.x = vertices[i].x; if(vertices[i].y > max.y) max.y = vertices[i].y; if(vertices[i].z > max.z)
max.z = vertices[i].z; if(vertices[i].x < min.x) min.x = vertices[i].x; if(vertices[i].y < min.y) min.y = vertices[i].y; if(vertices[i].z < min.z) min.z = vertices[i].z; } }-(void)calculateMaxMin2 { if(count == 0) return;
min.x = max.x = vertices[0].x; min.y = max.y = vertices[0].y; min.z = max.z = vertices[0].z;unsigned int i, lCount = count;
for(i = 1; i < lCount; ++i) { if(vertices[i].x > max.x) max.x = vertices[i].x; if(vertices[i].y > max.y) max.y = vertices[i].y; if(vertices[i].z > max.z) max.z = vertices[i].z;
if(vertices[i].x < min.x) min.x = vertices[i].x; if(vertices[i].y < min.y) min.y = vertices[i].y; if(vertices[i].z < min.z) min.z = vertices[i].z; } }-(void)calculateMaxMin3 { if(count == 0) return;
unsigned int i, lCount = count; Vertex *verts = vertices; Vertex lMax = verts[0], lMin = verts[0];
for(i = 1; i < lCount; ++i) { if(verts[i].x > lMax.x) lMax.x = verts[i].x; if(verts[i].y > lMax.y) lMax.y = verts[i].y; if(vertices[i].z > lMax.z) lMax.z = verts[i].z;
if(verts[i].x < lMin.x) lMin.x = verts[i].x; if(verts[i].y < lMin.y) lMin.y = verts[i].y; if(verts[i].z < lMin.z) lMin.z = verts[i].z; } max = lMax; min = lMin; } @end
int main() { unsigned int i; double then, now;printf("Initializing Object...\n");
PointCloud *p = [[PointCloud alloc] initWithCount:ELEMENTS];printf("Performing tests...\n");
then = CFAbsoluteTimeGetCurrent(); for(i=0;i<ROUNDS;++i) [p calculateMaxMin1]; now = CFAbsoluteTimeGetCurrent(); printf("Method1: %f\n", (now-then)/ROUNDS);then = CFAbsoluteTimeGetCurrent(); for(i=0;i<ROUNDS;++i) [p calculateMaxMin2];
now = CFAbsoluteTimeGetCurrent(); printf("Method2: %f\n", (now-then)/ROUNDS); then = CFAbsoluteTimeGetCurrent(); for(i=0;i<ROUNDS;++i) [p calculateMaxMin3]; now = CFAbsoluteTimeGetCurrent(); printf("Method3: %f\n", (now-then)/ROUNDS);return 0; }
So, in a nutshell: we’ve got a Vertex structure, which holds an x, y, and z coordinate. Then, we have a PointCloud object, which holds a bunch of points. It also stores a max vertex, and a min vertex. These would be used to create crude axis-aligned bounding boxes, or for set normalization or something.
And our task is finding the maximum and minimum x, y, and z values in the cloud. (bonus points for using the phrase “in the cloud” without using buzzwords ;). The algorithm to do this is pretty simple: set our max/min value to the initial point in the cloud, and then check every other point to see if it’s higher than our max, or lower than our min. If it is, set the max or min as needed, and continue. In Computer Science class, they’ll tell you that this algorithm is O(N), which doesn’t suck too severely. I believe it’s the optimal way to solve this problem (for an unordered set, at least), but I could be mistaken.
Anyway, to solve our problem, I have supplied 3 example methods: calculateMaxMin1, calculateMaxMin2, and calculateMaxMin3. They all follow the same pattern, but they have measurably different performance characteristics.
Let’s talk about these methods. The first method uses only ivars. The loop checks against count (an ivar), and it updates max/min (ivars) based on the vertices array (also an ivar).
The second method, calculateMaxMin2, is basically the same as calculateMaxMin1, except that instead of vertexCount, it stores the count locally in a local variable called “lCount”.
The third and final method, calculateMaxMin3, still follows the above pattern, but uses local variables of count, max/min, and the vertices array.
A casual glance at the code would lead one to believe that the running time for any of the three methods would be similar (with the last one possibly taking a teeny-tiny bit longer because it’s doing a couple more copies… probably immeasurably small though, as in 15 nanoseconds tops). Instead of guessing though, we’ll run it and see what happens.
Running the program (for 4,194,304 points), we get this as the output:
cwright@phendrana:~/projects/ivar>./ivar
Initializing Object…
Performing tests…
Method1: 0.043917
Method2: 0.038990
Method3: 0.024590Interpreting the results, we see that Method1 takes about 0.044 seconds to run, Method2 takes a little less, at 0.039 seconds, and Method3 takes a stunning 0.025 seconds to complete the exact same task.
Why does Method3 take so much less time? It takes just 55% as long as Method1, yet it’s doing the same amount of work!
The answer lies in how the compiler/optimizer treats ivars compared to local variables.
Let’s pull out some assembly output to see what’s happening behind the scenes. (This is going to get a bit more complicated, but don’t worry!)
calculateMaxMin1’s inner loop assembly code looks like this:
+59 00001a2d f30f100401 movss (%ecx,%eax),%xmm0
+64 00001a32 0f2e4214 ucomiss 0x14(%edx),%xmm0
+68 00001a36 7605 jbe 0x00001a3d
+70 00001a38 f30f114214 movss %xmm0,0x14(%edx) (Vertex)max
+75 00001a3d f30f10440104 movss 0x04(%ecx,%eax),%xmm0
+81 00001a43 0f2e4218 ucomiss 0x18(%edx),%xmm0
+85 00001a47 7605 jbe 0x00001a4e
+87 00001a49 f30f114218 movss %xmm0,0x18(%edx)
+92 00001a4e f30f10440108 movss 0x08(%ecx,%eax),%xmm0
+98 00001a54 0f2e421c ucomiss 0x1c(%edx),%xmm0
+102 00001a58 7605 jbe 0x00001a5f
+104 00001a5a f30f11421c movss %xmm0,0x1c(%edx)
+109 00001a5f f30f100c01 movss (%ecx,%eax),%xmm1
+114 00001a64 f30f104208 movss 0x08(%edx),%xmm0 (Vertex)min
+119 00001a69 0f2ec1 ucomiss %xmm1,%xmm0
+122 00001a6c 7605 jbe 0x00001a73
+124 00001a6e f30f114a08 movss %xmm1,0x08(%edx) (Vertex)min
+129 00001a73 f30f104c0104 movss 0x04(%ecx,%eax),%xmm1
+135 00001a79 f30f10420c movss 0x0c(%edx),%xmm0
+140 00001a7e 0f2ec1 ucomiss %xmm1,%xmm0
+143 00001a81 7605 jbe 0x00001a88
+145 00001a83 f30f114a0c movss %xmm1,0x0c(%edx)
+150 00001a88 f30f104c0108 movss 0x08(%ecx,%eax),%xmm1
+156 00001a8e f30f104210 movss 0x10(%edx),%xmm0
+161 00001a93 0f2ec1 ucomiss %xmm1,%xmm0
+164 00001a96 7605 jbe 0x00001a9d
+166 00001a98 f30f114a10 movss %xmm1,0x10(%edx)
+171 00001a9d 43 incl %ebx
+172 00001a9e 83c00c addl $0x0c,%eax
+175 00001aa1 3b5a20 cmpl 0x20(%edx),%ebx (unsigned int)count
+178 00001aa4 7287 jb 0x00001a2d
Note all the jumps (jbe, highlighted in red), the overall length (+59 to +180 bytes, or 121 bytes in total (180-59)), and how our ivars are loaded/stored (movss (%ecx,%eax),%xmm0 loads max.x, for example, highlighted in blue).
Now let’s contrast it with calculateMaxMin3’s inner loop:
+62 00001ba2 f30f10500c movss 0x0c(%eax),%xmm2
+67 00001ba7 0f28c2 movaps %xmm2,%xmm0
+70 00001baa f30f5fc5 maxss %xmm5,%xmm0
+74 00001bae 0f28e8 movaps %xmm0,%xmm5
+77 00001bb1 f30f104810 movss 0x10(%eax),%xmm1
+82 00001bb6 0f28d9 movaps %xmm1,%xmm3
+85 00001bb9 f30f5fdc maxss %xmm4,%xmm3
+89 00001bbd 0f28e3 movaps %xmm3,%xmm4
+92 00001bc0 f30f104014 movss 0x14(%eax),%xmm0
+97 00001bc5 0f28d8 movaps %xmm0,%xmm3
+100 00001bc8 f30f5f5df4 maxss 0xf4(%ebp),%xmm3
+105 00001bcd f30f115df4 movss %xmm3,0xf4(%ebp)
+110 00001bd2 f30f5d55f8 minss 0xf8(%ebp),%xmm2
+115 00001bd7 f30f1155f8 movss %xmm2,0xf8(%ebp)
+120 00001bdc f30f5dcf minss %xmm7,%xmm1
+124 00001be0 0f28f9 movaps %xmm1,%xmm7
+127 00001be3 f30f5dc6 minss %xmm6,%xmm0
+131 00001be7 0f28f0 movaps %xmm0,%xmm6
+134 00001bea 42 incl %edx
+135 00001beb 83c00c addl $0x0c,%eax
+138 00001bee 39da cmpl %ebx,%edx
+140 00001bf0 72b0 jb 0x00001ba2
No jump instructions, except for the loop’s jb at the end. Size is +62 to +142 (80 bytes, almost 30% smaller!), and most of our variables live in registers (note how xmm0-xmm7 are all utilized, and %ebp (stack spillage) is only used twice, at 0xf4, and 0xf8. Aside from the blue highlighted load/store lines, everything lives in a register, ready for screaming fast access (granted, most ivars will live in L1, so they’re only half as fast as registers, but as we see from the profiling above, that half-speed hit is actually measurable, even from L1!).
So, let’s analyze this a bit. Our loop code’s smaller (so it can complete about 33% more loops per unit of time). But that’s only part of the benefit. We also get no jumps. No jumps means no mis-predicted jumps, which helps the CPU pipeline stay loaded. And finally, we have almost no memory loading except for the vertex data itself (and a tiny bit of stack spillage). This means more CPU-RAM (or CPU-L1 Cache even) bandwidth is spent on loading relevant data, instead of reloading/storing ivars all the time.
Why are ivars implemented this way? Because a method needs to work on fresh data at all times — if another thread comes along and changes an ivar on the object, it wouldn’t make sense to keep using stale data. Local variables, in contrast, aren’t at all likely to get modified by another thread (you can do it, but it’s a rather contrived example). As such, the compiler knows exactly when a value changes, and can let it live on the cpu for as long as possible, for optimum speed.
Disclaimer stuff: This is a silly problem to solve, but it illustrates the point nicely. There are better solutions (using SSE to check 4 values at once, and then coalescing at the end, for example), but that’s not the point. Here we see how we can speed up (by almost a factor of 2!) loop intensive code by simply avoiding ivars in heavily trafficked loops.
Compiler command line to build: gcc-4.2 ivar.m -o ivar -Os -g -framework Cocoa