Pricey ivars « Perpendiculous

2009.07.02

Pricey ivars

Filed under: Programming — cwright @ 12:20 am

(apologies in advance — this will be a rather technical post. It’s eventually intended for fdiv.net, but while it’s getting migrated I figured I’d plop it down here.)

Instance Variables (or, more briefly, ivars) are a pretty simple trend in Object Oriented Languages. They’re data that get carried along inside an object, helping to define its state. While it would be fun to elaborate more on this, this particular article isn’t intended to teach the basics of OOP. So if you’re unsure of what an ivar is, this article probably isn’t for you.

In Objective-C, (and possibly other languages), ivars have an interesting property — they’re volatile. volatile is one of those shady regions of C and C-derived languages (like Objective-C) that few people talk about, but what it means is that the data behind the variable can change at any moment, so the compiler should reload it from memory whenever it’s used (the alternative is to load it once, and store it in a register — this is typically simpler and faster, but sometimes wrong if the data is in fact volatile). This is all well and good, and it performs exactly as intended. However, because of this reloading, we get some negative performance characteristics to go along with it.

To demonstrate, we’ll have this chunk of code:
#import <Cocoa/Cocoa.h>
#define ELEMENTS (1024*1024*4)
#define ROUNDS (50)
typedef struct Vertex
{
	float x, y, z;
} Vertex;
@interface PointCloud : NSObject
{
	Vertex *vertices;
	Vertex min, max;
	unsigned int count;
}
-(id)initWithCount:(unsigned int)count;
-(void)calculateMaxMin1;
-(void)calculateMaxMin2;
-(void)calculateMaxMin3;
@end
@implementation PointCloud
-(id)initWithCount:(unsigned int)pCount
{
	if(self = [super init])
	{
		count = pCount;
		vertices = (Vertex*)malloc(sizeof(Vertex)*count);
		unsigned int i;
		for(i=0;i<count;++i)
		{
			vertices[i].x = 4. * sinf(i);
			vertices[i].y = 4. * cosf(i);
			vertices[i].z = 4. * sinf(i)+cosf(i);
		}
	}
	return self;
}
-(void)calculateMaxMin1
{
	if(count == 0)
		return;
	min.x = max.x = vertices[0].x;
	min.y = max.y = vertices[0].y;
	min.z = max.z = vertices[0].z;
	unsigned int i;
	for(i = 1; i < count; ++i)
	{
		if(vertices[i].x > max.x)
			max.x = vertices[i].x;
		if(vertices[i].y > max.y)
			max.y = vertices[i].y;
		if(vertices[i].z > max.z)
			max.z = vertices[i].z;

		if(vertices[i].x < min.x)
			min.x = vertices[i].x;
		if(vertices[i].y < min.y)
			min.y = vertices[i].y;
		if(vertices[i].z < min.z)
			min.z = vertices[i].z;
	}
}
-(void)calculateMaxMin2
{
	if(count == 0)
		return;
	min.x = max.x = vertices[0].x;
	min.y = max.y = vertices[0].y;
	min.z = max.z = vertices[0].z;
	unsigned int i, lCount = count;
	for(i = 1; i < lCount; ++i)
	{
		if(vertices[i].x > max.x)
			max.x = vertices[i].x;
		if(vertices[i].y > max.y)
			max.y = vertices[i].y;
		if(vertices[i].z > max.z)
			max.z = vertices[i].z;
		if(vertices[i].x < min.x)
			min.x = vertices[i].x;
		if(vertices[i].y < min.y)
			min.y = vertices[i].y;
		if(vertices[i].z < min.z)
			min.z = vertices[i].z;
	}
}
-(void)calculateMaxMin3
{
	if(count == 0)
		return;
	unsigned int i, lCount = count;
	Vertex *verts = vertices;
	Vertex lMax = verts[0], lMin = verts[0];
	for(i = 1; i < lCount; ++i)
	{
		if(verts[i].x > lMax.x)
			lMax.x = verts[i].x;
		if(verts[i].y > lMax.y)
			lMax.y = verts[i].y;
		if(vertices[i].z > lMax.z)
			lMax.z = verts[i].z;
		if(verts[i].x < lMin.x)
			lMin.x = verts[i].x;
		if(verts[i].y < lMin.y)
			lMin.y = verts[i].y;
		if(verts[i].z < lMin.z)
			lMin.z = verts[i].z;
	}
	max = lMax;
	min = lMin;
}
@end
int main()
{
	unsigned int i;
	double then, now;
	printf("Initializing Object...\n");
	PointCloud *p = [[PointCloud alloc] initWithCount:ELEMENTS];
	printf("Performing tests...\n");
	then = CFAbsoluteTimeGetCurrent();
	for(i=0;i<ROUNDS;++i)
		[p calculateMaxMin1];
	now = CFAbsoluteTimeGetCurrent();
	printf("Method1: %f\n", (now-then)/ROUNDS);
	then = CFAbsoluteTimeGetCurrent();
	for(i=0;i<ROUNDS;++i)
		[p calculateMaxMin2];
	now = CFAbsoluteTimeGetCurrent();
	printf("Method2: %f\n", (now-then)/ROUNDS);

	then = CFAbsoluteTimeGetCurrent();
	for(i=0;i<ROUNDS;++i)
		[p calculateMaxMin3];
	now = CFAbsoluteTimeGetCurrent();
	printf("Method3: %f\n", (now-then)/ROUNDS);
	return 0;
}
So, in a nutshell: we’ve got a Vertex structure, which holds an x, y, and z coordinate. Then, we have a PointCloud object, which holds a bunch of points. It also stores a max vertex, and a min vertex. These would be used to create crude axis-aligned bounding boxes, or for set normalization or something.

And our task is finding the maximum and minimum x, y, and z values in the cloud. (bonus points for using the phrase “in the cloud” without using buzzwords ;). The algorithm to do this is pretty simple: set our max/min value to the initial point in the cloud, and then check every other point to see if it’s higher than our max, or lower than our min. If it is, set the max or min as needed, and continue. In Computer Science class, they’ll tell you that this algorithm is O(N), which doesn’t suck too severely. I believe it’s the optimal way to solve this problem (for an unordered set, at least), but I could be mistaken.

Anyway, to solve our problem, I have supplied 3 example methods: calculateMaxMin1, calculateMaxMin2, and calculateMaxMin3. They all follow the same pattern, but they have measurably different performance characteristics.

Let’s talk about these methods. The first method uses only ivars. The loop checks against count (an ivar), and it updates max/min (ivars) based on the vertices array (also an ivar).

The second method, calculateMaxMin2, is basically the same as calculateMaxMin1, except that instead of vertexCount, it stores the count locally in a local variable called “lCount”.

The third and final method, calculateMaxMin3, still follows the above pattern, but uses local variables of count, max/min, and the vertices array.

A casual glance at the code would lead one to believe that the running time for any of the three methods would be similar (with the last one possibly taking a teeny-tiny bit longer because it’s doing a couple more copies… probably immeasurably small though, as in 15 nanoseconds tops). Instead of guessing though, we’ll run it and see what happens.

Running the program (for 4,194,304 points), we get this as the output:

cwright@phendrana:~/projects/ivar>./ivar
Initializing Object…
Performing tests…
Method1: 0.043917
Method2: 0.038990
Method3: 0.024590

Interpreting the results, we see that Method1 takes about 0.044 seconds to run, Method2 takes a little less, at 0.039 seconds, and Method3 takes a stunning 0.025 seconds to complete the exact same task.

Why does Method3 take so much less time? It takes just 55% as long as Method1, yet it’s doing the same amount of work!

The answer lies in how the compiler/optimizer treats ivars compared to local variables.

Let’s pull out some assembly output to see what’s happening behind the scenes. (This is going to get a bit more complicated, but don’t worry!)

calculateMaxMin1’s inner loop assembly code looks like this:

+59    00001a2d f30f100401            movss        (%ecx,%eax),%xmm0
+64    00001a32 0f2e4214                ucomiss    0x14(%edx),%xmm0
+68    00001a36 7605                    jbe        0x00001a3d
+70    00001a38 f30f114214            movss        %xmm0,0x14(%edx) (Vertex)max
+75    00001a3d f30f10440104            movss        0x04(%ecx,%eax),%xmm0
+81    00001a43 0f2e4218                ucomiss    0x18(%edx),%xmm0
+85    00001a47 7605                    jbe        0x00001a4e
+87    00001a49 f30f114218            movss        %xmm0,0x18(%edx)
+92    00001a4e f30f10440108            movss        0x08(%ecx,%eax),%xmm0
+98    00001a54 0f2e421c                ucomiss    0x1c(%edx),%xmm0
+102    00001a58 7605                    jbe        0x00001a5f
+104    00001a5a f30f11421c            movss        %xmm0,0x1c(%edx)
+109    00001a5f f30f100c01            movss        (%ecx,%eax),%xmm1
+114    00001a64 f30f104208            movss        0x08(%edx),%xmm0 (Vertex)min
+119    00001a69 0f2ec1                ucomiss    %xmm1,%xmm0
+122    00001a6c 7605                    jbe        0x00001a73
+124    00001a6e f30f114a08            movss        %xmm1,0x08(%edx) (Vertex)min
+129    00001a73 f30f104c0104            movss        0x04(%ecx,%eax),%xmm1
+135    00001a79 f30f10420c            movss        0x0c(%edx),%xmm0
+140    00001a7e 0f2ec1                ucomiss    %xmm1,%xmm0
+143    00001a81 7605                    jbe        0x00001a88
+145    00001a83 f30f114a0c            movss        %xmm1,0x0c(%edx)
+150    00001a88 f30f104c0108            movss        0x08(%ecx,%eax),%xmm1
+156    00001a8e f30f104210            movss        0x10(%edx),%xmm0
+161    00001a93 0f2ec1                ucomiss    %xmm1,%xmm0
+164    00001a96 7605                    jbe        0x00001a9d
+166    00001a98 f30f114a10            movss        %xmm1,0x10(%edx)
+171    00001a9d 43                    incl        %ebx
+172    00001a9e 83c00c                addl        $0x0c,%eax
+175    00001aa1 3b5a20                cmpl        0x20(%edx),%ebx (unsigned int)count
+178    00001aa4 7287                    jb        0x00001a2d

Note all the jumps (jbe, highlighted in red), the overall length (+59 to +180 bytes, or 121 bytes in total (180-59)), and how our ivars are loaded/stored (movss (%ecx,%eax),%xmm0 loads max.x, for example, highlighted in blue).

Now let’s contrast it with calculateMaxMin3’s inner loop:

+62    00001ba2 f30f10500c            movss        0x0c(%eax),%xmm2
+67    00001ba7 0f28c2                movaps    %xmm2,%xmm0
+70    00001baa f30f5fc5                maxss        %xmm5,%xmm0
+74    00001bae 0f28e8                movaps    %xmm0,%xmm5
+77    00001bb1 f30f104810            movss        0x10(%eax),%xmm1
+82    00001bb6 0f28d9                movaps    %xmm1,%xmm3
+85    00001bb9 f30f5fdc                maxss        %xmm4,%xmm3
+89    00001bbd 0f28e3                movaps    %xmm3,%xmm4
+92    00001bc0 f30f104014            movss        0x14(%eax),%xmm0
+97    00001bc5 0f28d8                movaps    %xmm0,%xmm3
+100    00001bc8 f30f5f5df4            maxss        0xf4(%ebp),%xmm3
+105    00001bcd f30f115df4            movss        %xmm3,0xf4(%ebp)
+110    00001bd2 f30f5d55f8            minss        0xf8(%ebp),%xmm2
+115    00001bd7 f30f1155f8            movss        %xmm2,0xf8(%ebp)
+120    00001bdc f30f5dcf                minss        %xmm7,%xmm1
+124    00001be0 0f28f9                movaps    %xmm1,%xmm7
+127    00001be3 f30f5dc6                minss        %xmm6,%xmm0
+131    00001be7 0f28f0                movaps    %xmm0,%xmm6
+134    00001bea 42                    incl        %edx
+135    00001beb 83c00c                addl        $0x0c,%eax
+138    00001bee 39da                    cmpl        %ebx,%edx
+140    00001bf0 72b0                    jb        0x00001ba2

No jump instructions, except for the loop’s jb at the end. Size is +62 to +142 (80 bytes, almost 30% smaller!), and most of our variables live in registers (note how xmm0-xmm7 are all utilized, and %ebp (stack spillage) is only used twice, at 0xf4, and 0xf8. Aside from the blue highlighted load/store lines, everything lives in a register, ready for screaming fast access (granted, most ivars will live in L1, so they’re only half as fast as registers, but as we see from the profiling above, that half-speed hit is actually measurable, even from L1!).

So, let’s analyze this a bit. Our loop code’s smaller (so it can complete about 33% more loops per unit of time). But that’s only part of the benefit. We also get no jumps. No jumps means no mis-predicted jumps, which helps the CPU pipeline stay loaded. And finally, we have almost no memory loading except for the vertex data itself (and a tiny bit of stack spillage). This means more CPU-RAM (or CPU-L1 Cache even) bandwidth is spent on loading relevant data, instead of reloading/storing ivars all the time.

Why are ivars implemented this way? Because a method needs to work on fresh data at all times — if another thread comes along and changes an ivar on the object, it wouldn’t make sense to keep using stale data. Local variables, in contrast, aren’t at all likely to get modified by another thread (you can do it, but it’s a rather contrived example). As such, the compiler knows exactly when a value changes, and can let it live on the cpu for as long as possible, for optimum speed.

Disclaimer stuff: This is a silly problem to solve, but it illustrates the point nicely. There are better solutions (using SSE to check 4 values at once, and then coalescing at the end, for example), but that’s not the point. Here we see how we can speed up (by almost a factor of 2!) loop intensive code by simply avoiding ivars in heavily trafficked loops.

Perpendiculous Programming, Personal Finance, and Personal musings

2009.07.02

Pricey ivars

No Comments »

Leave a comment