Perpendiculous Programming, Personal Finance, and Personal musings

2009.07.02

Pricey ivars

Filed under: Programming — cwright @ 12:20 am

(apologies in advance — this will be a rather technical post.  It’s eventually intended for fdiv.net, but while it’s getting migrated I figured I’d plop it down here.)

Instance Variables (or, more briefly, ivars) are a pretty simple trend in Object Oriented Languages.  They’re data that get carried along inside an object, helping to define its state.  While it would be fun to elaborate more on this, this particular article isn’t intended to teach the basics of OOP.  So if you’re unsure of what an ivar is, this article probably isn’t for you.

In Objective-C, (and possibly other languages), ivars have an interesting property — they’re volatilevolatile is one of those shady regions of C and C-derived languages (like Objective-C) that few people talk about, but what it means is that the data behind the variable can change at any moment, so the compiler should reload it from memory whenever it’s used (the alternative is to load it once, and store it in a register — this is typically simpler and faster, but sometimes wrong if the data is in fact volatile). This is all well and good, and it performs exactly as intended. However, because of this reloading, we get some negative performance characteristics to go along with it.

To demonstrate, we’ll have this chunk of code:

#import <Cocoa/Cocoa.h>
#define ELEMENTS (1024*1024*4)
#define ROUNDS (50)
typedef struct Vertex
{
	float x, y, z;
} Vertex;
@interface PointCloud : NSObject
{
	Vertex *vertices;
	Vertex min, max;
	unsigned int count;
}
-(id)initWithCount:(unsigned int)count;
-(void)calculateMaxMin1;
-(void)calculateMaxMin2;
-(void)calculateMaxMin3;
@end
@implementation PointCloud
-(id)initWithCount:(unsigned int)pCount
{
	if(self = [super init])
	{
		count = pCount;
		vertices = (Vertex*)malloc(sizeof(Vertex)*count);
		unsigned int i;
		for(i=0;i<count;++i)
		{
			vertices[i].x = 4. * sinf(i);
			vertices[i].y = 4. * cosf(i);
			vertices[i].z = 4. * sinf(i)+cosf(i);
		}
	}
	return self;
}
-(void)calculateMaxMin1
{
	if(count == 0)
		return;
	min.x = max.x = vertices[0].x;
	min.y = max.y = vertices[0].y;
	min.z = max.z = vertices[0].z;
	unsigned int i;
	for(i = 1; i < count; ++i)
	{
		if(vertices[i].x > max.x)
			max.x = vertices[i].x;
		if(vertices[i].y > max.y)
			max.y = vertices[i].y;
		if(vertices[i].z > max.z)
			max.z = vertices[i].z;

		if(vertices[i].x < min.x)
			min.x = vertices[i].x;
		if(vertices[i].y < min.y)
			min.y = vertices[i].y;
		if(vertices[i].z < min.z)
			min.z = vertices[i].z;
	}
}
-(void)calculateMaxMin2
{
	if(count == 0)
		return;
	min.x = max.x = vertices[0].x;
	min.y = max.y = vertices[0].y;
	min.z = max.z = vertices[0].z;
	unsigned int i, lCount = count;
	for(i = 1; i < lCount; ++i)
	{
		if(vertices[i].x > max.x)
			max.x = vertices[i].x;
		if(vertices[i].y > max.y)
			max.y = vertices[i].y;
		if(vertices[i].z > max.z)
			max.z = vertices[i].z;
		if(vertices[i].x < min.x)
			min.x = vertices[i].x;
		if(vertices[i].y < min.y)
			min.y = vertices[i].y;
		if(vertices[i].z < min.z)
			min.z = vertices[i].z;
	}
}
-(void)calculateMaxMin3
{
	if(count == 0)
		return;

	unsigned int i, lCount = count;
	Vertex *verts = vertices;
	Vertex lMax = verts[0], lMin = verts[0];
	for(i = 1; i < lCount; ++i)
	{
		if(verts[i].x > lMax.x)
			lMax.x = verts[i].x;
		if(verts[i].y > lMax.y)
			lMax.y = verts[i].y;
		if(vertices[i].z > lMax.z)
			lMax.z = verts[i].z;
		if(verts[i].x < lMin.x)
			lMin.x = verts[i].x;
		if(verts[i].y < lMin.y)
			lMin.y = verts[i].y;
		if(verts[i].z < lMin.z)
			lMin.z = verts[i].z;
	}
	max = lMax;
	min = lMin;
}
@end
int main()
{
	unsigned int i;
	double then, now;
	printf("Initializing Object...\n");
	PointCloud *p = [[PointCloud alloc] initWithCount:ELEMENTS];
	printf("Performing tests...\n");
	then = CFAbsoluteTimeGetCurrent();
	for(i=0;i<ROUNDS;++i)
		[p calculateMaxMin1];
	now = CFAbsoluteTimeGetCurrent();
	printf("Method1: %f\n", (now-then)/ROUNDS);
	then = CFAbsoluteTimeGetCurrent();
	for(i=0;i<ROUNDS;++i)
		[p calculateMaxMin2];
	now = CFAbsoluteTimeGetCurrent();
	printf("Method2: %f\n", (now-then)/ROUNDS);

	then = CFAbsoluteTimeGetCurrent();
	for(i=0;i<ROUNDS;++i)
		[p calculateMaxMin3];
	now = CFAbsoluteTimeGetCurrent();
	printf("Method3: %f\n", (now-then)/ROUNDS);
	return 0;
}

So, in a nutshell:  we’ve got a Vertex structure, which holds an x, y, and z coordinate.  Then, we have a PointCloud object, which holds a bunch of points.  It also stores a max vertex, and a min vertex.  These would be used to create crude axis-aligned bounding boxes, or for set normalization or something.

And our task is finding the maximum and minimum x, y, and z values in the cloud.  (bonus points for using the phrase “in the cloud” without using buzzwords ;).  The algorithm to do this is pretty simple:  set our max/min value to the initial point in the cloud, and then check every other point to see if it’s higher than our max, or lower than our min.  If it is, set the max or min as needed, and continue.  In Computer Science class, they’ll tell you that this algorithm is O(N), which doesn’t suck too severely.  I believe it’s the optimal way to solve this problem (for an unordered set, at least), but I could be mistaken.

Anyway, to solve our problem, I have supplied 3 example methods:  calculateMaxMin1, calculateMaxMin2, and calculateMaxMin3.  They all follow the same pattern, but they have measurably different performance characteristics.

Let’s talk about these methods.  The first method uses only ivars.  The loop checks against count (an ivar), and it updates max/min (ivars) based on the vertices array (also an ivar).

The second method, calculateMaxMin2, is basically the same as calculateMaxMin1, except that instead of vertexCount, it stores the count locally in a local variable called “lCount”.

The third and final method, calculateMaxMin3, still follows the above pattern, but uses local variables of count, max/min, and the vertices array.

A casual glance at the code would lead one to believe that the running time for any of the three methods would be similar (with the last one possibly taking a teeny-tiny bit longer because it’s doing a couple more copies… probably immeasurably small though, as in 15 nanoseconds tops).  Instead of guessing though, we’ll run it and see what happens.

Running the program (for 4,194,304 points), we get this as the output:

cwright@phendrana:~/projects/ivar>./ivar
Initializing Object…
Performing tests…
Method1: 0.043917
Method2: 0.038990
Method3: 0.024590

Interpreting the results, we see that Method1 takes about 0.044 seconds to run, Method2 takes a little less, at 0.039 seconds, and Method3 takes a stunning 0.025 seconds to complete the exact same task.

Why does Method3 take so much less time?  It takes just 55% as long as Method1, yet it’s doing the same amount of work!

The answer lies in how the compiler/optimizer treats ivars compared to local variables.

Let’s pull out some assembly output to see what’s happening behind the scenes.  (This is going to get a bit more complicated, but don’t worry!)

calculateMaxMin1’s inner loop assembly code looks like this:

+59    00001a2d  f30f100401              movss          (%ecx,%eax),%xmm0
+64    00001a32  0f2e4214                  ucomiss      0x14(%edx),%xmm0

+68    00001a36  7605                      jbe          0x00001a3d
+70    00001a38  f30f114214              movss          %xmm0,0x14(%edx) (Vertex)max
+75    00001a3d  f30f10440104              movss          0x04(%ecx,%eax),%xmm0
+81    00001a43  0f2e4218                  ucomiss      0x18(%edx),%xmm0
+85    00001a47  7605                      jbe          0x00001a4e
+87    00001a49  f30f114218              movss          %xmm0,0x18(%edx)
+92    00001a4e  f30f10440108              movss          0x08(%ecx,%eax),%xmm0
+98    00001a54  0f2e421c                  ucomiss      0x1c(%edx),%xmm0
+102    00001a58  7605                      jbe          0x00001a5f
+104    00001a5a  f30f11421c              movss          %xmm0,0x1c(%edx)
+109    00001a5f  f30f100c01              movss          (%ecx,%eax),%xmm1
+114    00001a64  f30f104208              movss          0x08(%edx),%xmm0 (Vertex)min
+119    00001a69  0f2ec1                  ucomiss      %xmm1,%xmm0
+122    00001a6c  7605                      jbe          0x00001a73
+124    00001a6e  f30f114a08              movss          %xmm1,0x08(%edx) (Vertex)min
+129    00001a73  f30f104c0104              movss          0x04(%ecx,%eax),%xmm1
+135    00001a79  f30f10420c              movss          0x0c(%edx),%xmm0
+140    00001a7e  0f2ec1                  ucomiss      %xmm1,%xmm0
+143    00001a81  7605                      jbe          0x00001a88
+145    00001a83  f30f114a0c              movss          %xmm1,0x0c(%edx)
+150    00001a88  f30f104c0108              movss          0x08(%ecx,%eax),%xmm1
+156    00001a8e  f30f104210              movss          0x10(%edx),%xmm0
+161    00001a93  0f2ec1                  ucomiss      %xmm1,%xmm0
+164    00001a96  7605                      jbe          0x00001a9d
+166    00001a98  f30f114a10              movss          %xmm1,0x10(%edx)
+171    00001a9d  43                      incl          %ebx
+172    00001a9e  83c00c                  addl          $0x0c,%eax
+175    00001aa1  3b5a20                  cmpl          0x20(%edx),%ebx (unsigned int)count
+178    00001aa4  7287                      jb          0x00001a2d

Note all the jumps (jbe, highlighted in red), the overall length (+59 to +180 bytes, or 121 bytes in total (180-59)), and how our ivars are loaded/stored (movss (%ecx,%eax),%xmm0 loads max.x, for example, highlighted in blue).

Now let’s contrast it with calculateMaxMin3’s inner loop:

+62    00001ba2  f30f10500c              movss          0x0c(%eax),%xmm2
+67    00001ba7  0f28c2                  movaps      %xmm2,%xmm0
+70    00001baa  f30f5fc5                  maxss          %xmm5,%xmm0
+74    00001bae  0f28e8                  movaps      %xmm0,%xmm5
+77    00001bb1  f30f104810              movss          0x10(%eax),%xmm1
+82    00001bb6  0f28d9                  movaps      %xmm1,%xmm3
+85    00001bb9  f30f5fdc                  maxss          %xmm4,%xmm3
+89    00001bbd  0f28e3                  movaps      %xmm3,%xmm4
+92    00001bc0  f30f104014              movss          0x14(%eax),%xmm0
+97    00001bc5  0f28d8                  movaps      %xmm0,%xmm3
+100    00001bc8  f30f5f5df4              maxss          0xf4(%ebp),%xmm3
+105    00001bcd  f30f115df4              movss          %xmm3,0xf4(%ebp)
+110    00001bd2  f30f5d55f8              minss          0xf8(%ebp),%xmm2
+115    00001bd7  f30f1155f8              movss          %xmm2,0xf8(%ebp)

+120    00001bdc  f30f5dcf                  minss          %xmm7,%xmm1
+124    00001be0  0f28f9                  movaps      %xmm1,%xmm7
+127    00001be3  f30f5dc6                  minss          %xmm6,%xmm0
+131    00001be7  0f28f0                  movaps      %xmm0,%xmm6
+134    00001bea  42                      incl          %edx
+135    00001beb  83c00c                  addl          $0x0c,%eax
+138    00001bee  39da                      cmpl          %ebx,%edx
+140    00001bf0  72b0                      jb          0x00001ba2

No jump instructions, except for the loop’s jb at the end.  Size is +62 to +142 (80 bytes, almost 30% smaller!), and most of our variables live in registers (note how xmm0-xmm7 are all utilized, and %ebp (stack spillage) is only used twice, at 0xf4, and 0xf8.  Aside from the blue highlighted load/store lines, everything lives in a register, ready for screaming fast access (granted, most ivars will live in L1, so they’re only half as fast as registers, but as we see from the profiling above, that half-speed hit is actually measurable, even from L1!).

So, let’s analyze this a bit.  Our loop code’s smaller (so it can complete about 33% more loops per unit of time).  But that’s only part of the benefit.  We also get no jumps.  No jumps means no mis-predicted jumps, which helps the CPU pipeline stay loaded.  And finally, we have almost no memory loading except for the vertex data itself (and a tiny bit of stack spillage).  This means more CPU-RAM (or CPU-L1 Cache even) bandwidth is spent on loading relevant data, instead of reloading/storing ivars all the time.

Why are ivars implemented this way?  Because a method needs to work on fresh data at all times — if another thread comes along and changes an ivar on the object, it wouldn’t make sense to keep using stale data.  Local variables, in contrast, aren’t at all likely to get modified by another thread (you can do it, but it’s a rather contrived example).  As such, the compiler knows exactly when a value changes, and can let it live on the cpu for as long as possible, for optimum speed.

Disclaimer stuff:  This is a silly problem to solve, but it illustrates the point nicely.  There are better solutions (using SSE to check 4 values at once, and then coalescing at the end, for example), but that’s not the point.  Here we see how we can speed up (by almost a factor of 2!) loop intensive code by simply avoiding ivars in heavily trafficked loops.

Compiler command line to build:   gcc-4.2 ivar.m -o ivar -Os -g -framework Cocoa

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress  •  Hosted by Kosada