<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Perpendiculous</title>
	<atom:link href="http://perpendiculo.us/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://perpendiculo.us</link>
	<description>Programming, Personal Finance, and Personal musings</description>
	<lastBuildDate>Thu, 12 Aug 2010 10:47:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Ghost in the shell</title>
		<link>http://perpendiculo.us/?p=218</link>
		<comments>http://perpendiculo.us/?p=218#comments</comments>
		<pubDate>Thu, 12 Aug 2010 10:47:37 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Meta]]></category>
		<category><![CDATA[Personal]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=218</guid>
		<description><![CDATA[I was planning on some really cool technical junk for tonight&#8217;s post, but midway through the day I caught word of a friend&#8217;s death.Part of maintaining a non-anonymous public online presence means continually maintaining (trying to maintain?) implied privacy &#8212; people don&#8217;t like to see critiques of decisions they make, even when the situation is [...]]]></description>
			<content:encoded><![CDATA[<p>I was planning on some really cool technical junk for tonight&#8217;s post, but midway through the day I caught word of a friend&#8217;s death.<span id="more-218"></span>Part of maintaining a non-anonymous public online presence means continually maintaining (trying to maintain?) implied privacy &#8212; people don&#8217;t like to see critiques of decisions they make, even when the situation is sanitized as much as possible so as to be anonymous;  I don&#8217;t like throwing people under the bus, but I do get a kick out of scrutinizing bad choices so I can try to make better ones (I routinely burn down my own ideas and decisions;  I&#8217;m impartial as to where they come from).  Because of this strict anonymity, many written things here become self-centric (because I can deal with me complaining) or so ambiguous that they&#8217;ve lose all meaning.</p>
<p>Another difficulty is that of honoring.  Not for the person being honored, but for those that person may have wronged, or for those that feel they&#8217;ve been wronged.  It&#8217;s easy to deify someone you&#8217;ve not known your whole life, and it&#8217;s easy to censor memories when reality makes heroes less than heroic at all times.</p>
<p>Today I&#8217;m going to break that second rule and pay my respects to Dayle Jellings.  While he was almost certainly not perfect, I am ignorant of his misdeeds and have no desire to become privy to them.  I know that he rendered me service and friendship unlike that of any other, all while never asking for anything in return.  Transportation, ice cream, laundry money, hair cuts, fellowship, those are the things I remember him for.  Even after I moved out he still kept in touch, traveling far out of his way just to say hello.  After we left he remembered each of us years later.  When one of us made a poor decision, he didn&#8217;t abandon or condemn.</p>
<p>A Lauri Anderson lyric describes losing a father as a whole library burning down.  For me, Jellings was more than a library;  he was a friend.  He will be missed.</p>
<p>His facebook account has him immortalized (at least updated with a eulogy of sorts), complete with a friend wishing him a speedy recovery.  Such hollow wishes always choke me up.  How powerless we really are, how fragile life is.</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=218</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cache is King -or- Things are about to get MESI</title>
		<link>http://perpendiculo.us/?p=213</link>
		<comments>http://perpendiculo.us/?p=213#comments</comments>
		<pubDate>Sun, 01 Aug 2010 06:24:51 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=213</guid>
		<description><![CDATA[A few days ago I was chatting with some friends, and the topic of caching came up. I mentioned MESI, which is the basis for modern multicore cache coherence (There are variants like MOESI and MERSI, but the general idea is the same).  It then occurred to me that I&#8217;ve never actually made a test [...]]]></description>
			<content:encoded><![CDATA[<p>A few days ago I was chatting with some friends, and the topic of caching came up.  I mentioned <a href="http://en.wikipedia.org/wiki/MESI_protocol">MESI</a>, which is the basis for modern multicore cache coherence (There are variants like MOESI and MERSI, but the general idea is the same).  It then occurred to me that I&#8217;ve never actually made a test to see the effects of MESI in action.<span id="more-213"></span>To begin, ere&#8217;s an easy-as-pie pathological performance demonstration:</p>
<pre>cwright@phendrana:~/projects/test/cache&gt;gcc -framework Foundation source.m -o source -Os</pre>
<pre>cwright@phendrana:~/projects/test/cache&gt;./source</pre>
<pre>  &amp;value1: 0x2014</pre>
<pre>  &amp;value2: 0x2018</pre>
<pre>  same cache line!</pre>
<pre>  elapsed: 12639.155030ms</pre>
<pre>cwright@phendrana:~/projects/test/cache&gt;gcc -framework Foundation source.m -o source</pre>
<pre>cwright@phendrana:~/projects/test/cache&gt;./source</pre>
<pre>  &amp;value1: 0x2020</pre>
<pre>  &amp;value2: 0x2080</pre>
<pre>  different cache line!</pre>
<pre>  elapsed: 5614.167988ms</pre>
<p>That above nonsense is me compiling the same piece of code (analyzed later) being compiled with optimization (and taking ~12.6 seconds to complete), and without optimization (taking ~5.6 seconds to complete).  Wait a minute.  Optimized takes over 2x as long?  If that&#8217;s not a teaser, I don&#8217;t know what is.</p>
<p>Modern CPUs have a few levels of cache between their registers (the fastest memory around) and the Harddrive (the slowest, unless you count the network).  You&#8217;ve probably heard of one of the most important caches, RAM &#8212; it&#8217;s filled with stuff from the harddrive (and temporary working data) because accessing the harddrive is so laughably slow that if you understood it you&#8217;d never refer to your computer making that grinding noise as &#8220;thinking&#8221; &#8212; it&#8217;s actually doing quite the opposite:  not thinking, but waiting for the slow disk to feed it more data.</p>
<p>Compared to the CPU, the harddrive is a glacier &#8212; it takes hundreds of thousands of clock cycles to get a response.  It&#8217;s like sending a letter across the ocean on a ship.  Compared to the CPU, even RAM, which is orders of magnitude faster than the harddrive, is really slow &#8212; it takes hundreds of cycles to get a response.  So to work around the slowness of RAM, a few more caches are in place:  L1, L2, and sometimes L3.  These are made of very fast, very expensive memory, and they&#8217;re often very small.  L1 and L2 live on the CPU, with L1 being per-core, and L2 usually being shared among all cores on the CPU, and on some newer processors L3 is also on the CPU and also shared (but sometimes it&#8217;s off the CPU).  The L1 cache is further divided into L1-I and L1-D, for Instructions and Data, but that&#8217;s a topic for a different day.  Generally L1 will be the smallest and the fastest, L2 will be slower but larger, and L3 will be slower but larger still.</p>
<p>RAM is divided into bytes.  In the old days, you&#8217;d even read and write it as individual bytes.  (some fancy-pants microcontrollers even offered bit addressing, but they typically worked on bytes and just hid that detail from you).  These days, its read and written in larger chunks (there are some exceptions, but generally reads and writes work on much much larger units).  These chunks are typically Cache Lines.  A cache line on a current X86 processor is 64 bytes.  So when your program reads byte at location 0, behind your back the cache will also fetch bytes 1-63 for you.  This may seem totally bizarre and wasteful, but the reason for it is because often programs that ask for a value at a certain location will ask for nearby locations shortly after (phenomena referred to as &#8220;spatial locality&#8221; and &#8220;locality of reference&#8221;).  The cache will hold on to that entire line until it seems like the program&#8217;s not interested in it any longer, at which point it evicts it, and asks for a new line.  You probably undergo a similar optimization when you&#8217;re putting on your socks and shoes:  you grab them in pairs, and put them on one at a time.  You could put one on, and then go and grab the second and put it on, but it&#8217;s easier to just get them all at once.</p>
<p>When a CPU core holds a cache line, it&#8217;s free to modify it as it needs to &#8212; this is known as the M state (for Modified).  When a line is &#8220;M&#8221;, it no longer matches what&#8217;s in other caches and RAM &#8212; this is ok, as long as it makes it back there eventually.  If a line in cache holds no valid data, it&#8217;s Invalid, or in the I state.  If a CPU holds a line unmodified, and no one else has it, it&#8217;s &#8220;Exclusive&#8221;, or E state.  And finally, when a line is held by 2 or more cores, it&#8217;s in the Shared, or S, state &#8212; this is where it begins to get complicated.</p>
<p>Let&#8217;s say that 2 cores are working on the same cache line.  Core 1 changes one of the bytes.  This means that Core 2&#8242;s version is Invalid (I state!).  So when Core 2 needs to read a byte from that line, it notes that it needs to get the latest version &#8212; this involves asking Core 1 for its latest copy.  Core 2 then changes a byte, invalidating Core 1&#8242;s line, so Core 1 then asks for Core 2&#8242;s copy, and the cycle repeats.  This, as you can imagine, gets really slow &#8212; the cache line is bouncing around between cores, continually invalidating each other, causing them to stall.</p>
<p>For memory that doesn&#8217;t get written to (the instructions that make up the program, for example), this isn&#8217;t a problem &#8212; the lines are Shared (S state), and never modified, so both cores can happily operate simultaneously, each using their own identical copy with no stalls or invalidations.</p>
<p>So, how did we get the situation above?  Here&#8217;s the source code:</p>
<pre>#import &lt;Foundation/Foundation.h&gt;
#import &lt;libkern/OSAtomic.h&gt;

static volatile uint32_t value1 = 1;
static volatile uint32_t padding1[16] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
static volatile uint32_t value2 = 1;
static volatile uint32_t paddings[16] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
static volatile uint32_t count = 2;

#define ROUNDS (0x8000000)

void *counterThread(void *arg)
{
 unsigned int i;
 for(i=0;i&lt;ROUNDS;++i)
    OSAtomicIncrement32(arg);
 OSAtomicDecrement32(&amp;count);
 return 0;
}

int main()
{
 pthread_t pth;

 printf("&amp;value1: %p\n", &amp;value1);
 printf("&amp;value2: %p\n", &amp;value2);
 if((ptrdiff_t)&amp;value2 - (ptrdiff_t)&amp;value1 &lt; 64)
    printf("same cache line!\n");
 else
    printf("different cache line!\n");

 double now, then = CFAbsoluteTimeGetCurrent();

 pthread_create(&amp;pth, NULL, counterThread, &amp;value1);
 pthread_create(&amp;pth, NULL, counterThread, &amp;value2);

 while(count)
    usleep(1);
 now = CFAbsoluteTimeGetCurrent() - then;
 printf("elapsed: %fms\n", now*1000);

 return 0;
</pre>
<pre>}
</pre>
<p>What we&#8217;re doing is spawning 2 threads, and having each of them count at a certain location. Note that our timing is not particularly accurate &#8212; it&#8217;s intended to give us the general magnitude of the duration, not the fine-grained timing information we have in other profiling posts.</p>
<p>When optimization is disabled, the compiler doesn&#8217;t do any cleanup, and just uses the variables as they&#8217;re described at the top (value1, value2, padding1, padding2).  Because of the Padding variables (64bytes of them, to be exact), we guaranteed that our value1 and value2 variables are at least 64bytes apart (thus, on different cache lines) in memory &#8212; this allows each CPU core to operate on a different cache line, never interrupting the other cores.</p>
<p>When we enable optimization, the compiler reasons about the code some.  It notices that we never actually use padding1 and padding2, and so it removes them (normally, fewer variables will improve performance, as the remaining live variables will be together in cache).  In this case, it puts these two variables adjacent (on the same cache line), unaware that they&#8217;re about to be hammered by 2 different Cores with 2 different Caches.  So even though the threads are operating on distinctly separate regions of memory, they&#8217;re the same logical cache line, and thus performance issues emerge.</p>
<p>One simple way to fix this is to tell the compiler to align each of the variables to its own cache line using the &#8220;aligned&#8221; attribute.  This will ensure that we have plenty of space around our values.</p>
<p>Variables aren&#8217;t typically the source of cache line contention like this;  the more frequent culprits are spinlocks and mutexs &#8212; thankfully, on OS X mutexes are big enough to be cache-line aligned already, but spinlocks are not.  So if you have a large block of spinlock definitions at the top of some source file, it might be useful to separate them some if contention is actually rears its head.</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=213</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NX in action</title>
		<link>http://perpendiculo.us/?p=158</link>
		<comments>http://perpendiculo.us/?p=158#comments</comments>
		<pubDate>Mon, 28 Jun 2010 05:49:46 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=158</guid>
		<description><![CDATA[NX, or the No eXecute bit, is an interesting technology that prevents instructions on the stack from getting executed.  The reason for this is security (stack smashing becomes a bit more difficult for a would-be attacker), and the implications are typically few and far between. The way it works is by marking stack memory pages [...]]]></description>
			<content:encoded><![CDATA[<p>NX, or the No eXecute bit, is an interesting technology that prevents instructions on the stack from getting executed.  The reason for this is security (stack smashing becomes a bit more difficult for a would-be attacker), and the implications are typically few and far between.<span id="more-158"></span>  The way it works is by marking stack memory pages as non-executable (but still read/writable; historically, x86 only had read/write, and read implied execute).  Pretty simple stuff.</p>
<p>One of the more interesting aspects (on OS X at least) is that in 32bit mode, the heap is automatically executable (no need for <code>mmap()</code> and friends), while in 64bit mode, even the heap is NX&#8217;d.  This makes heap overflows much more difficult to exploit;  you&#8217;re pretty much stuck needing to do a return-to-libc attack (which is still possible, mind you &#8212; NX does nothing to prevent that sort of attack).</p>
<p>Here&#8217;s some example code:</p>
<pre>
#include &lt;stdio.h&gt;
#include &lt;unistd.h&gt;

void (*f)();

int main()
{
        void *ptr = malloc(16);
        memset(ptr, 0xc3, 16);    // 0xc3 = RET
        f = ptr;

        printf("Executing from heap\n");
        f();
        printf("we're still alive.\n");

        char buffer[16];
        memset(buffer, 0xc3, 16);
        f = buffer;
        printf("next up, from the stack\n");
        f();
        printf("we're still alive.\n");

        return 0;
}</pre>
<p>Compile using <code>gcc nx.c -o nx -m32</code>, and you&#8217;ll see the program crash on the &#8220;next up, from the stack&#8221; step.  Swap in <code>-m64</code> instead of <code>-m32</code>, and you&#8217;ll see it crash immediately after &#8220;Executing from the heap.&#8221;</p>
<p>None of this is particularly new or earth-shattering, but it&#8217;s a neat little concept to play with.  For self-modifying code paths, or <acronym title="Just In Time">JIT</acronym> compilers, this can be require a slight detour (though JITs should be using <code>mmap()</code> by now anyway).</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=158</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>past trends</title>
		<link>http://perpendiculo.us/?p=175</link>
		<comments>http://perpendiculo.us/?p=175#comments</comments>
		<pubDate>Tue, 11 May 2010 04:09:53 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=175</guid>
		<description><![CDATA[Visualizing the badness of purchasing a new car.The graph of our networth over time from wedding day to current.  Note the general positive trend, up until march 2010.  That would be the car purchase&#8230;  (even relocating across the country, losing one job, and purchasing a bunch of new stuff didn&#8217;t swing February down&#8230; crazy.)]]></description>
			<content:encoded><![CDATA[<p>Visualizing the badness of purchasing a new car.<span id="more-175"></span>The graph of our networth over time from wedding day to current.  Note the general positive trend, up until march 2010.  That would be the car purchase&#8230;  (even relocating across the country, losing one job, and purchasing a bunch of new stuff didn&#8217;t swing February down&#8230; crazy.)</p>
<div id="attachment_176" class="wp-caption alignright" style="width: 635px"><a href="http://perpendiculo.us/wp-content/uploads/2010/05/cars.png"><img class="size-full wp-image-176" title="cars" src="http://perpendiculo.us/wp-content/uploads/2010/05/cars.png" alt="" width="625" height="343" /></a><p class="wp-caption-text">Purchasing a new car starts an aggressive downward trend.. let&#39;s hope we can hold out</p></div>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=175</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>First Week -or- The Craziest Story Ever Told</title>
		<link>http://perpendiculo.us/?p=165</link>
		<comments>http://perpendiculo.us/?p=165#comments</comments>
		<pubDate>Mon, 08 Feb 2010 05:12:28 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Meta]]></category>
		<category><![CDATA[Personal]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=165</guid>
		<description><![CDATA[I&#8217;ve been chilling in my new Cupertino apartment for about 3 days now.  Jet Lag still makes me wake up between 5 and 6am local time, but strangely allows me to stay up till midnight.  When I need to go somewhere, I walk, and until Alisa gets here next week, I likely won&#8217;t have much [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been chilling in my new Cupertino apartment for about 3 days now.  Jet Lag still makes me wake up between 5 and 6am local time, but strangely allows me to stay up till midnight.  When I need to go somewhere, I walk, and until Alisa gets here next week, I likely won&#8217;t have much in the way of amenities.  Not that I&#8217;m in dire need, mind you &#8212; it&#8217;s just not a priority at the moment.<span id="more-165"></span>So anyway, what happened?  Dialing the Way Back machine back to October 2009, you may recall a certain <a title="Reality Distortion Field Deflector" href="http://perpendiculo.us/?p=155" target="_blank">Reality Distortion Field Deflection</a>.  At the time, it looked like I was still Columbus bound for the time being.  However, some phone calls started coming in again in December, and by just before Christmas the process had restarted, albeit with a few changes.  In January some paperwork came my way, I wrote my name and social security number a bunch of times and sent them back, and then I waited.</p>
<p>My start date was set at February First.  This was established on January 29th, which didn&#8217;t leave a lot of time.  Being full of reckless bravado, I welcomed the challenge.</p>
<p>Packing was a mad dash &#8212; stuffing in lots of my clothes, but also some of Alisa&#8217;s less-frequently-used stuff.  The bulk of packing was unfortunately left to Alisa after my departure.</p>
<p>The flight had a layover in Las Vegas.  As always, the people flying to Las Vegas are looking for a good time, and this time was no exception.  Except that one of the passengers behind me had a little bit too much to drink, and decided that he was &#8220;done with the plane&#8221; half an hour before we landed.  He started unloading the overhead bins, and being obnoxious in general (all the while his boss was trying to get him to calm down).  Being seated in the next-to-last row on the plane, it took no less than 79 hours to get off the plane because people still haven&#8217;t mastered the art of fetching stuff from overhead _Before_ they need to move (so they don&#8217;t hold up everyone else) or simply having light (or no!) carry-on at all.  Due to the delay, our inebriated friend started talking about blowing up the plane, figuring that it&#8217;d encourage people to get off faster (and possibly for mild comic effect).  Instead all it did was make some passengers look horrified (in the &#8220;I pity what&#8217;s going to happen to you&#8221; sense, not the &#8220;I&#8217;m scared for my life!&#8221; sense) and make the flight attendants scramble to deal with him.</p>
<p>Estimating a week-long apartment search, I booked a hotel through Thursday morning, and a car through Thursday evening &#8212; I figured the shorter duration would force me to take action, or at least force me to sleep on a park bench for a few nights if I failed.  The 55-60 degree weather of San Jose was a welcomed change from the 8 degree winter that Ohio was offering at the time, but even so park benches weren&#8217;t overly inviting.</p>
<p>Alisa pulled through and arranged some apartments using elite time-zone tricks like calling after-hours in Ohio (which is still business hours in Cali).  Additionally, a co-worker drove me through the nearby area on the way to dinner one evening, so I got to check out some more (from the safety of Troy&#8217;s Audi).  When a suitable place was determined (a 25 minute walk to/from the office, a 20 minute walk to/from Target and the bank, a 15 minute walk to/from church, a 10 minute walk to Whole Foods, and most important, a 5 minute walk to/from Panda Express), I pulled the necessary $2175 for rent, deposit, and application fees out of savings, and landed some living quarters Thursday morning (actually, there were 2 transactions, one of which was deposit + application fees;  that happened on Wednesday.  Rent was due when I actually got approved and signed the lease), and moved in my two suitcases.   (A fabulous reason to have an emergency fund &#8212; also, expenses piled up really quickly with plane tickets and car rentals and hotel stays and buying food, so having credit cards handy makes a huge difference despite what <a title="Dave Ramsey" href="http://www.daveramsey.com/article/the-truth-about-credit-card-debt/" target="_blank">Dave Ramsey unwisely counsels</a>.)</p>
<p>The next tricky bit was getting from the Airport (amusingly, San Jose is abbreviated &#8220;SJ&#8221; all over the freeway signs.  SJ is also an abbreviation for <a title="Steve Jobs as SJ" href="http://americanhistory.si.edu/collections/comphist/sj1.html#tools" target="_blank">Steve Jobs</a>, my new boss (no, I haven&#8217;t met him)) back to Cupertino.  Returning the car went ok (as long as you throw way too much money at them, and don&#8217;t kill the car, they don&#8217;t care all that much), but finding a Taxi for less than $60 to cross town was difficult.  To further complicate matters, it was monsoon season, so every minute I was outside looking for transit I was getting wetter and wetter.  Eventually someone overheard a quote, and offered a ride instead (citing $60 to cross town as high way robbery) &#8212; He was driving to Mountain View, which is just north of Cupertino, so it wasn&#8217;t too far out of his way.  We chatted some, and he&#8217;s in marketing, but was formerly a software engineer as well, so we had some interesting discussions along the way.  I offered him the cash I had upon safe arrival ($40), but he declined, saying &#8220;Welcome to California&#8221; &#8212; despite my usual low impression of people generally, California (or at least the Bay area) offers some extremely hospitable residents &#8212; a few years back at WWDC smokris and I got some pleasant breaks while ironing out transit (except when it came to the Taxi guy at the end).</p>
<p>Saturday was spent wandering around getting acclimated some more.  I picked up some basics (shower curtain, razors, trash can, food+juice for sunday, ordered internet (only Comcast is available where our apartment is, that sucks)), and also discovered that my key card doesn&#8217;t grant me access to my office building on weekends.  That&#8217;s a shame (as I&#8217;m accustomed to dumping 60+ hours a week into what I do because it&#8217;s a work of passion for me), but maybe I&#8217;ll work out some agreement and get access or something later.</p>
<p>Sunday was a nice relaxing day.  I got to catch a nap, and meet some new people.</p>
<p>Work-wise, it&#8217;s been stellar.  I&#8217;m surprised how effective an office is for me (I share an office with Troy at the moment, so he&#8217;s probably much less effective with my constant barrage of questions/thoughts), and the people are great.  The paranoia and secrecy you hear about is entirely true &#8212; the only other organizations that maintain this many secrets are probably the US Government and the KGB.  At first this was a bit saddening (coming from Kosada, where I knew basically everything that was going on), but after a day or two I realized how much more focused I could be.  There are rough edges of course (like trying to work with something &#8220;secret&#8221;), but those are exceptional cases.  This tends to make each team an expert on their particular project, which in turn allows for wildly fast turnaround times and equally impressive wide-scale refactoring/redesign that goes almost completely unnoticed on the outside.</p>
<p>The coworkers are great.  I&#8217;ve learned a lot already, and the group seems really close.  There&#8217;s a NeXTie who dates back to before OpenGL from what I can tell, and several other very talented, very sharp people.  And they all do things besides pressing buttons all day!  That&#8217;s amazing to me (computer nerds get a bad rap for living in their mom&#8217;s basement until they&#8217;re 35) for some reason (not that previously anyone was that bad, it&#8217;s just a stereotype that I expect for some reason).  After spending so many years reverse engineering the project I&#8217;m now working on, I felt pretty at home almost instantly, and immediately dug in and starting learning how to deal with bugs, features, and the development environment in general in an environment where the schedule is much more rigid than my previous experience.  There are all kinds of cool little details I&#8217;d love to chat about, but I&#8217;ve probably said more than I should have already.</p>
<p>And the lunches are fantastic!  It&#8217;s nice to be able to eat such a wide variety of food, and socialize with fellow engineers.  It&#8217;s almost impossible to walk through the courtyard area of Infinite Loop and not have a hugely goofy grin on my face <img src='http://perpendiculo.us/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=165</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reality Distortion Field Deflector (RDFD)</title>
		<link>http://perpendiculo.us/?p=155</link>
		<comments>http://perpendiculo.us/?p=155#comments</comments>
		<pubDate>Wed, 14 Oct 2009 22:17:20 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=155</guid>
		<description><![CDATA[Mid last month (September, for those keeping score at home), a peculiar email arrived in my inbox.  Therein, I was referred by an Apple employee with an opportunity to potentially work there.  To say the least, my interest was piqued.  After all, after spending the past 2 years up to my elbows in some of [...]]]></description>
			<content:encoded><![CDATA[<p>Mid last month (September, for those keeping score at home), a peculiar email arrived in my inbox.  Therein, I was referred by an Apple employee with an opportunity to potentially work there.  To say the least, my interest was piqued.  After all, after spending the past 2 years up to my elbows in some of their software&#8217;s guts, reverse engineering, patching, and exploring, I&#8217;d like to think I had some authority on the subject.</p>
<p><span id="more-155"></span></p>
<p>Being an employee referral, some cool perks came my way.  Specifically, I got to skip the phone interview entirely, and got flown out there, put up in a nice hotel (Cypress Hotel), and had a fun dark blue Mazda 5 waiting for me.  Very smooth, to say the least.  I&#8217;m pretty sure that everyone who gets interviewed receives such treatment.</p>
<p>From first contact to flight out there, it was 2 weeks.  Pretty darn fast.</p>
<p>After resting up that evening in San Jose, I awoke the next morning for the interview.  I was told that ongoings in said interview are confidential.  In general, things went pretty smoothly.  Getting to meet the QC team was quite exciting, and meeting several talented engineers is also always great.  Finally, the recruiter came in, we threw around some numbers, and then I left for my flight home.  My total time in interviews was around 4-4.5 hours, and my total time in San Jose was about 16 hours.  Very brief.</p>
<p>Within a week, the background check company had contacted me to verify my employment.  Paystubs, Articles of Incorporation, and all that went in.  Then, a turn for the worst:  a week of absolute silence.  2 weeks without hearing anything is never a good sign.</p>
<p>Today, the rejection letter finally came.  Total time from first contact to rejection:  28 days.  Citing budgetary issues etc.  Whether or not that&#8217;s valid is of no concern to me.  At least it frees me up to continue working on things here.  Though I am a bit worried about the future of the technology I&#8217;ve so lovingly dealt with&#8230;</p>
<p>So, there you have it:  a brief trip into the heart of the Reality Distortion Field itself, and then a successful deflection.</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=155</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>@synchronized, NSLock, pthread, OSSpinLock showdown, done right</title>
		<link>http://perpendiculo.us/?p=133</link>
		<comments>http://perpendiculo.us/?p=133#comments</comments>
		<pubDate>Wed, 23 Sep 2009 05:19:50 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=133</guid>
		<description><![CDATA[Somewhere out there on the internet, there&#8217;s a &#8220;showdown&#8221; between @synchronized, NSLock, pthread mutexes, and OSSpinLock. It aims to measure their performance relative to each other, but uses sloppy code to perform the measuring. As a result, while the performance ordering is correct (@synchronized is the slowest, OSSpinLock is the fastest), the relative cost is [...]]]></description>
			<content:encoded><![CDATA[<p>Somewhere out there on the internet, there&#8217;s a &#8220;showdown&#8221; between @synchronized, NSLock, pthread mutexes, and OSSpinLock. It aims to measure their performance relative to each other, but uses sloppy code to perform the measuring. As a result, while the performance ordering is correct (@synchronized is the slowest, OSSpinLock is the fastest), the relative cost is severely misrepresented. Herein I attempt to rectify that benchmark.</p>
<p><span id="more-133"></span>Locking is absolutely required for critical sections. These arise in multithreaded code, and sometimes their performance can have severe consequences in applications. The problem with the aforementioned benchmark is that it did a bunch of extraneous work while it was locking/unlocking. It was doing the same amount of extraneous work, so the relative order was correct (the fastest was still the fastest, the slowest still the slowest, etc), but it didn&#8217;t properly show just how much faster the fastest was.</p>
<p>In the benchmark, the author used autorelease pools, allocated objects, and then released them all.  While locking.  This is a pretty reasonable use-case, but by no means the only one.  For most high-performance, multithreaded code, you&#8217;ll spend a _bunch_ of time trying to make the critical sections as small and fast as possible.  Large, slow critical sections effectively undo the multithreading speed up by causing threads to block each other out unnecessarily.  So when you&#8217;ve trimmed the critical sections down to the minimum, another sometimes-justified optimization is to optimize the amount of time spent locking/unlocking itself.</p>
<p>Just to make things exciting though, not all locking primitives are created equal.  Two of the 4 mentioned have special properties that can affect how long they take, and how the operate under pressure.  I&#8217;ll get to that towards the end.</p>
<p>First up, here&#8217;s my &#8220;no-nonsense&#8221; microbench code:</p>
<pre>#import &lt;Foundation/Foundation.h&gt;
#import &lt;objc/runtime.h&gt;
#import &lt;objc/message.h&gt;
#import &lt;libkern/OSAtomic.h&gt;
#import &lt;pthread.h&gt;

#define ITERATIONS (1024*1024*32)

static unsigned long long disp=0, land=0;

int main()
{
 double then, now;
 unsigned int i, count;
 pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
 OSSpinLock spinlock = OS_SPINLOCK_INIT;

 NSAutoreleasePool *pool = [NSAutoreleasePool new];

 NSLock *lock = [NSLock new];
 then = CFAbsoluteTimeGetCurrent();
 for(i=0;i&lt;ITERATIONS;++i)
 {
 [lock lock];
 [lock unlock];
 }
 now = CFAbsoluteTimeGetCurrent();
 printf("NSLock: %f sec\n", now-then);    

 then = CFAbsoluteTimeGetCurrent();
 IMP lockLock = [lock methodForSelector:@selector(lock)];
 IMP unlockLock = [lock methodForSelector:@selector(unlock)];
 for(i=0;i&lt;ITERATIONS;++i)
 {
 lockLock(lock,@selector(lock));
 unlockLock(lock,@selector(unlock));
 }
 now = CFAbsoluteTimeGetCurrent();
 printf("NSLock+IMP Cache: %f sec\n", now-then);    

 then = CFAbsoluteTimeGetCurrent();
 for(i=0;i&lt;ITERATIONS;++i)
 {
 pthread_mutex_lock(&amp;mutex);
 pthread_mutex_unlock(&amp;mutex);
 }
 now = CFAbsoluteTimeGetCurrent();
 printf("pthread_mutex: %f sec\n", now-then);

 then = CFAbsoluteTimeGetCurrent();
 for(i=0;i&lt;ITERATIONS;++i)
 {
 OSSpinLockLock(&amp;spinlock);
 OSSpinLockUnlock(&amp;spinlock);
 }
 now = CFAbsoluteTimeGetCurrent();
 printf("OSSpinlock: %f sec\n", now-then);

 id obj = [NSObject new];

 then = CFAbsoluteTimeGetCurrent();
 for(i=0;i&lt;ITERATIONS;++i)
 {
 @synchronized(obj)
 {
 }
 }
 now = CFAbsoluteTimeGetCurrent();
 printf("@synchronized: %f sec\n", now-then);

 [pool release];
 return 0;
}</pre>
<p>We do 5 tests:  We test NSLock, NSLock with IMP caching, pthread mutexes, OSSpinLocks, and then finally @synchronized.  We simply lock and unlock 33554432 times (that&#8217;s 1024*1024*32 for those keeping score at home <img src='http://perpendiculo.us/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> , and see how long it takes.  No allocation, no releases, no autorelease pools, nothing.  Just pure lock/unlock goodness.  I ran the test a few times, and averaged the results (so overall, the results are from something like 100 million lock/unlock cycles each)</p>
<ol>
<li>NSLock: 3.5175 sec</li>
<li>NSLock+IMP Cache: 3.1165 sec</li>
<li>Mutex: 1.5870 sec</li>
<li>SpinLock: 1.0893</li>
<li>@synchronized: 9.9488 sec</li>
</ol>
<div class="wp-caption alignnone" style="width: 629px"><img title="Lock Performance" src="http://perpendiculo.us/wp-content/uploads/2009/09/LockPerformance.png" alt="Lock Performance" width="619" height="265" /><p class="wp-caption-text">Lock Performance</p></div>
<p>From the above graph, we can see a couple thing:  First, @synchronized is _Really_ expensive &#8212; like, 3 times as expensive as anything else.  We&#8217;ll get into why that is in a moment.  Otherwise, we see that NSLock and NSLock+IMP Cache are pretty close &#8212; these are built on top of pthread mutexes, but we have to pay for the extra ObjC overhead.  Then there&#8217;s Mutex (pthread mutexes) and SpinLock &#8212; these are pretty close, but even then SpinLock is almost 30% faster than Mutex.  We&#8217;ll get into that one too.  So from top to bottom we have almost an order of magnitude difference between the worst and best.</p>
<p>The nice part about these all is that they all take about the same amount of code &#8212; using NSLock takes as many lines as a pthread mutex, and the same number for a spinlock.  @synchronized saves a line or two, but with a cost like that it quickly looks unappealing in all but the most trivial of cases.</p>
<p>So, what makes @sychronized and SpinLock so different from the others?</p>
<p>@synchronized is very heavy weight because it has to set up an exception handler, and it actually ends up taking a few internal locks on its way there.  So instead of a simple cheap lock, you&#8217;re paying for a couple locks/unlocks just to acquire your measly lock.  Those take time.</p>
<p>OSSpinLock, on the other hand, doesn&#8217;t even enter the kernel &#8212; it just keeps reloading the lock, hoping that it&#8217;s unlocked.  This is terribly inefficient if locks are held for more than a few nanoseconds, but it saves a costly system call and a couple context switches.  Pthread mutexes actually use an OSSpinLock first, to keep things running smoothly where there&#8217;s no contention.  When there is, it resorts to heavier, kernel-level locking/tasking stuff.</p>
<p>So, if you&#8217;ve got hotly-contested locks, OSSpinLock probably isn&#8217;t for you (unless your critical sections are _Really_ _Fast_).  Pthread mutexes are a tiny bit more expensive, but they avoid the power-wasting effects of OSSpinLock.</p>
<p>NSLock is a pretty wrapper on pthread mutexes.  They don&#8217;t provide much else, so there&#8217;s not much point in using them over pthread mutexes.</p>
<p>Of course, standard optimization disclaimers apply:  don&#8217;t do it until you&#8217;re sure you&#8217;ve chosen the correct algorithms, have profiled to find hotspots, and have found locking to be one of those hot items.  Otherwise, you&#8217;re wasting your time on something that&#8217;s likely to provide minimal benefits.</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=133</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CollegeAdvantage referral bonus again</title>
		<link>http://perpendiculo.us/?p=140</link>
		<comments>http://perpendiculo.us/?p=140#comments</comments>
		<pubDate>Sat, 12 Sep 2009 17:33:53 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=140</guid>
		<description><![CDATA[Last year, CollegeAdvantage&#8217;s 529 college savings plan offered a referral bonus where those opening new accounts, as well as those referring them, received $25 bonuses.  They&#8217;ve recently reinstated that bonus, so I figured I&#8217;d post some shameless self-promoting in the hopes of getting y&#8217;all some free college savings (for kids, etc), and get some for [...]]]></description>
			<content:encoded><![CDATA[<p>Last year, CollegeAdvantage&#8217;s 529 college savings plan offered a referral bonus where those opening new accounts, as well as those referring them, received $25 bonuses.  They&#8217;ve recently reinstated that bonus, so I figured I&#8217;d post some shameless self-promoting in the hopes of getting y&#8217;all some free college savings (for kids, etc), and get some for me as well (for our future children).<span id="more-140"></span>Signing up is quick and easy, and it only requires $25.  You can enroll at <a title="CollegeAdvantage" href="http://www.collegeadvantage.com/cms.aspx?SectionID=68" target="_blank">http://www.collegeadvantage.com/cms.aspx?SectionID=68</a> &#8212; be sure to use &#8220;2439614&#8243; as your referral number (that&#8217;s me).</p>
<p>Once opened, you can put money aside for college expenses later on while benefiting from some serious tax advantages.  You can read more about them on wikipedia: <a title="529 Plan on wikipedia" href="http://en.wikipedia.org/wiki/529_plan" target="_blank">http://en.wikipedia.org/wiki/529_plan</a>.  CollegeAdvantage offers a bunch of funds, and they&#8217;re generally quite high-quality funds.  With a 529, you can change the beneficiary at any time, so you can open it with yourself as the designated recipient (I&#8217;ve done that, as we have no kids right now), and change it later (once kids arrive).  Or, if you have kids now, you can set it to them today.</p>
<p>This offer&#8217;s good till December 15th, 2009, and requires an initial deposit of $25.  You&#8217;ll receive your $25 bonus shortly thereafter.</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=140</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>alloc, allocWithZone showdown</title>
		<link>http://perpendiculo.us/?p=134</link>
		<comments>http://perpendiculo.us/?p=134#comments</comments>
		<pubDate>Sat, 05 Sep 2009 12:14:29 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=134</guid>
		<description><![CDATA[This brief post will explore some more mundane (but still measurable!) aspects of optimizing Objective-C software.  This time around, we&#8217;ll talk about object allocation.When creating objects in Objective-C, there are approximately 5 ways about it.  Each one has a slightly different execution time cost, executable size cost, and line-of-code-to-write cost.  For this, I&#8217;m mostly going [...]]]></description>
			<content:encoded><![CDATA[<p>This brief post will explore some more mundane (but still measurable!) aspects of optimizing Objective-C software.  This time around, we&#8217;ll talk about object allocation.<span id="more-134"></span>When creating objects in Objective-C, there are approximately 5 ways about it.  Each one has a slightly different execution time cost, executable size cost, and line-of-code-to-write cost.  For this, I&#8217;m mostly going to explore time and lines-of-code costs.</p>
<p>These methods look like the following (one method per line):</p>
<pre>id anObject = [NSObject new]; // 3 total messages
id anObject = [[NSObject alloc] init]; // 3 total messages
id anObject = [[NSObject allocWithZone:NULL] init]; // 2 total messages
id anObject = [anotherObject copy]; // 3 total messages
id anObject = [anotherObject copyWithZone:NULL]; // 2 total messages</pre>
<p>(There&#8217;s also a lower-level objc-runtime-only set of function calls to accomplish the above, but that&#8217;s more verbose than most people are willing to go &#8212; if you have to drop to that level for acceptable performance, your object model is fundamentally wrong, and should be revised.)</p>
<p>These all accomplish the same thing, with varying levels of performance.  -copy and -copyWithZone won&#8217;t actually work in the above because NSObject doesn&#8217;t implement copy.  But the top three can illustrate some interesting characteristics.</p>
<p>Here&#8217;s our breakdown (I did a few runs and averaged them &#8212; I ran the microbench on a 1.83GHz Core2 Duo MacBook running Mac OS X 10.5.8, as well as on a 2.4GHz Core2 Duo MacBook Pro running Mac OS X 10.6.0):</p>
<p>10.5.8/1.83GHz (32 bit):<br />
new time: 8.413659 (420.682949ns/iteration)<br />
alloc/init time: 8.282779 (414.138952ns/iteration)<br />
allocWithZone/init time: 8.030380 (401.519001ns/iteration)</p>
<p>10.6.0/2.4GHz (32 bit):</p>
<p>new time: 6.439551 (321.977550ns/iteration)<br />
alloc/init time: 6.351412 (317.570600ns/iteration)<br />
allocWithZone/init time: 5.972713 (298.635650ns/iteration)</p>
<p>In 64bit mode, we have much better performance all around:</p>
<p>10.5.8/1.83GHz (64bit):</p>
<p>new time: 5.590315 (279.515749ns/iteration)<br />
alloc/init time: 5.339382 (266.969100ns/iteration)<br />
allocWithZone/init time: 5.173148 (258.657402ns/iteration)</p>
<p>10.6.0/2.4GHz (64bit):</p>
<p>new time: 4.058811 (202.940547ns/iteration)<br />
alloc/init time: 3.873028 (193.651399ns/iteration)<br />
allocWithZone/init time: 3.846562 (192.328098ns/iteration)</p>
<p>(yes, that&#8217;s right &#8212; object allocation in 64bits on Leopard at 1.83GHz is faster than 32bit allocations on Snow Leopard at 2.4GHz &#8212; that&#8217;s actually quite impressive evidence for switching to 64bit sooner, rather than later <img src='http://perpendiculo.us/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Since we have different underlying hardware and operating systems, we aren&#8217;t going to generalize between the two sets, except to say that 2.4GHz/Snow Leopard is faster than 1.83GHz Leopard (a big surprise &#8212; 2.4 is simply faster all around, even without Snow Leopard).</p>
<p>In both 32 bit cases, doing the alloc/init dance gives a tiny ~1.5% performance improvement.  More interestingly, allocWithZone/init give a 4.5%-7% performance improvement.  In cases where lots of objects are getting allocated frequently, switching to allocWithZone instead of new can provide some slight but measurable benefits.   (Note that we didn&#8217;t profile other object classes in this test &#8212; we just wanted to test the raw allocation overhead.  In real life, this overhead is still present, but generally lower-weight.  Our optimization above will still reduce that overhead.)</p>
<p>64 Bit isn&#8217;t quite as dramatic, but the effect is still present &#8212; -WithZone is still faster by a tiny margin.</p>
<p>-copy and -copyWithZone have a relationship similar to alloc/allocWithZone, in that copy simply calls copyWithZone:NULL (alloc simply calls allocWithZone:NULL) &#8211; the benefit of using the -WithZone: variant up front is that you&#8217;re skipping the extra message-send (saving a few cache lines, some stack, and some potential objective-c method lookups).</p>
<p>It may seem like 10-20ns isn&#8217;t a very long time (and indeed, 20 billionths of a second is pretty darn short in human terms), but in heavy allocation contexts this can add up very quickly.  Stacking up with thousands of allocations, it&#8217;s possible to shave off larger portions of time employing this.  For no additional lines of code, a tiny overhead reduction can be a pleasant pick-me-up for almost no additional effort.</p>
<p>Note that with all optimization, it should only take place after working code has been written and profiled to locate hotspots, and after optimal (or at least somewhat tuned) algorithms have been employed (optimizing one-time init code generally doesn&#8217;t help anything, optimizing cold code doesn&#8217;t help much, and using fast slow algorithms is worse than using slow fast algorithms almost all of the time.)</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=134</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pricey ivars</title>
		<link>http://perpendiculo.us/?p=116</link>
		<comments>http://perpendiculo.us/?p=116#comments</comments>
		<pubDate>Thu, 02 Jul 2009 04:20:19 +0000</pubDate>
		<dc:creator>cwright</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://perpendiculo.us/?p=116</guid>
		<description><![CDATA[(apologies in advance &#8212; this will be a rather technical post.  It&#8217;s eventually intended for fdiv.net, but while it&#8217;s getting migrated I figured I&#8217;d plop it down here.) Instance Variables (or, more briefly, ivars) are a pretty simple trend in Object Oriented Languages.  They&#8217;re data that get carried along inside an object, helping to define [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>(apologies in advance &#8212; this will be a rather technical post.  It&#8217;s eventually intended for fdiv.net, but while it&#8217;s getting migrated I figured I&#8217;d plop it down here.)</p>
<p>Instance Variables (or, more briefly, ivars) are a pretty simple trend in Object Oriented Languages.  They&#8217;re data that get carried along inside an object, helping to define its state.  While it would be fun to elaborate more on this, this particular article isn&#8217;t intended to teach the basics of OOP.  So if you&#8217;re unsure of what an ivar is, this article probably isn&#8217;t for you.<span id="more-116"></span></p>
<p>In Objective-C, (and possibly other languages), ivars have an interesting property &#8212; they&#8217;re <code>volatile</code>.  <code>volatile</code> is one of those shady regions of C and C-derived languages (like Objective-C) that few people talk about, but what it means is that the data behind the variable can change at any moment, so the compiler should reload it from memory whenever it&#8217;s used (the alternative is to load it once, and store it in a register &#8212; this is typically simpler and faster, but sometimes wrong if the data is in fact volatile).  This is all well and good, and it performs exactly as intended.  However, because of this reloading, we get some negative performance characteristics to go along with it.</p>
<p>To demonstrate, we&#8217;ll have this chunk of code:</p>
<pre><span style="color: #993366;">#import</span> <span style="color: #ff0000;">&lt;Cocoa/Cocoa.h&gt;</span></pre>
<pre><span style="color: #800000;">#define</span> ELEMENTS (<span style="color: #0000ff;">1024</span>*<span style="color: #0000ff;">1024</span>*<span style="color: #0000ff;">4</span>)
<span style="color: #800000;">#define</span> ROUNDS (<span style="color: #0000ff;">50</span>)</pre>
<pre><span style="color: #ff00ff;">typedef struct</span> Vertex
{
<span style="color: #ff00ff;">	float</span> x, y, z;
} Vertex;</pre>
<pre><span style="color: #ff00ff;">@interface</span> PointCloud : NSObject
{
	Vertex *vertices;
	Vertex min, max;
<span style="color: #ff00ff;">	unsigned int</span> count;
}</pre>
<pre>-(id)initWithCount:(unsigned int)count;</pre>
<pre>-(void)calculateMaxMin1;
-(void)calculateMaxMin2;
-(void)calculateMaxMin3;</pre>
<pre><span style="color: #ff00ff;">@end</span></pre>
<pre><span style="color: #ff00ff;">@implementation</span> PointCloud
-(id)initWithCount:(unsigned int)pCount
{
	if(self = [super init])
	{
		count = pCount;
		vertices = (Vertex*)malloc(sizeof(Vertex)*count);</pre>
<pre><span style="color: #ff00ff;">		unsigned int</span> i;
		for(i=<span style="color: #0000ff;">0</span>;i&lt;count;++i)
		{
			vertices[i].x = <span style="color: #0000ff;">4.</span> * sinf(i);
			vertices[i].y = <span style="color: #0000ff;">4.</span> * cosf(i);
			vertices[i].z = <span style="color: #0000ff;">4.</span> * sinf(i)+cosf(i);
		}
	}
	return self;
}</pre>
<pre>-(void)calculateMaxMin1
{
	if(count == <span style="color: #0000ff;">0</span>)
		return;</pre>
<pre>	min.x = max.x = vertices[<span style="color: #0000ff;">0</span>].x;
	min.y = max.y = vertices[<span style="color: #0000ff;">0</span>].y;
	min.z = max.z = vertices[<span style="color: #0000ff;">0</span>].z;</pre>
<pre><span style="color: #ff00ff;">	unsigned int</span> i;</pre>
<pre>	for(i = <span style="color: #0000ff;">1</span>; i &lt; count; ++i)
	{
		if(vertices[i].x &gt; max.x)
			max.x = vertices[i].x;
		if(vertices[i].y &gt; max.y)
			max.y = vertices[i].y;
		if(vertices[i].z &gt; max.z)</pre>
<pre>			max.z = vertices[i].z;

		if(vertices[i].x &lt; min.x)
			min.x = vertices[i].x;
		if(vertices[i].y &lt; min.y)
			min.y = vertices[i].y;
		if(vertices[i].z &lt; min.z)
			min.z = vertices[i].z;
	}
}</pre>
<pre>-(void)calculateMaxMin2
{
	if(count == <span style="color: #0000ff;">0</span>)
		return;</pre>
<pre>	min.x = max.x = vertices[<span style="color: #0000ff;">0</span>].x;
	min.y = max.y = vertices[<span style="color: #0000ff;">0</span>].y;
	min.z = max.z = vertices[<span style="color: #0000ff;">0</span>].z;</pre>
<pre><span style="color: #ff00ff;">	unsigned int</span> i, lCount = count;</pre>
<pre>	for(i = <span style="color: #0000ff;">1</span>; i &lt; lCount; ++i)
	{
		if(vertices[i].x &gt; max.x)
			max.x = vertices[i].x;
		if(vertices[i].y &gt; max.y)
			max.y = vertices[i].y;
		if(vertices[i].z &gt; max.z)
			max.z = vertices[i].z;</pre>
<pre>		if(vertices[i].x &lt; min.x)
			min.x = vertices[i].x;
		if(vertices[i].y &lt; min.y)
			min.y = vertices[i].y;
		if(vertices[i].z &lt; min.z)
			min.z = vertices[i].z;
	}
}</pre>
<pre>-(void)calculateMaxMin3
{
	if(count == <span style="color: #0000ff;">0</span>)
		return;</pre>
<pre></pre>
<pre><span style="color: #ff00ff;">	unsigned int</span> i, lCount = count;
	Vertex *verts = vertices;
	Vertex lMax = verts[0], lMin = verts[0];</pre>
<pre>	for(i = <span style="color: #0000ff;">1</span>; i &lt; lCount; ++i)
	{
		if(verts[i].x &gt; lMax.x)
			lMax.x = verts[i].x;
		if(verts[i].y &gt; lMax.y)
			lMax.y = verts[i].y;
		if(vertices[i].z &gt; lMax.z)
			lMax.z = verts[i].z;</pre>
<pre>		if(verts[i].x &lt; lMin.x)
			lMin.x = verts[i].x;
		if(verts[i].y &lt; lMin.y)
			lMin.y = verts[i].y;
		if(verts[i].z &lt; lMin.z)
			lMin.z = verts[i].z;
	}
	max = lMax;
	min = lMin;
}
<span style="color: #ff00ff;">@end</span></pre>
<pre><span style="color: #ff00ff;">int</span> main()
{
<span style="color: #ff00ff;">	unsigned int</span> i;
<span style="color: #ff00ff;">	double</span> then, now;</pre>
<pre>	printf(<span style="color: #ff0000;">"Initializing Object...\n"</span>);</pre>
<pre>	PointCloud *p = [[PointCloud alloc] initWithCount:ELEMENTS];</pre>
<pre>	printf(<span style="color: #ff0000;">"Performing tests...\n"</span>);</pre>
<pre>	then = CFAbsoluteTimeGetCurrent();
	for(i=<span style="color: #0000ff;">0</span>;i&lt;ROUNDS;++i)
		[p calculateMaxMin1];
	now = CFAbsoluteTimeGetCurrent();
	printf(<span style="color: #ff0000;">"Method1: %f\n"</span>, (now-then)/ROUNDS);</pre>
<pre>	then = CFAbsoluteTimeGetCurrent();
	for(i=<span style="color: #0000ff;">0</span>;i&lt;ROUNDS;++i)
		[p calculateMaxMin2];</pre>
<pre>	now = CFAbsoluteTimeGetCurrent();
	printf(<span style="color: #ff0000;">"Method2: %f\n"</span>, (now-then)/ROUNDS);

	then = CFAbsoluteTimeGetCurrent();
	for(i=<span style="color: #0000ff;">0</span>;i&lt;ROUNDS;++i)
		[p calculateMaxMin3];
	now = CFAbsoluteTimeGetCurrent();
	printf(<span style="color: #ff0000;">"Method3: %f\n"</span>, (now-then)/ROUNDS);</pre>
<pre>	return <span style="color: #0000ff;">0</span>;
}</pre>
<p>So, in a nutshell:  we&#8217;ve got a Vertex structure, which holds an x, y, and z coordinate.  Then, we have a PointCloud object, which holds a bunch of points.  It also stores a max vertex, and a min vertex.  These would be used to create crude axis-aligned bounding boxes, or for set normalization or something.</p>
<p>And our task is finding the maximum and minimum x, y, and z values in the cloud.  (bonus points for using the phrase &#8220;in the cloud&#8221; without using buzzwords <img src='http://perpendiculo.us/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .  The algorithm to do this is pretty simple:  set our max/min value to the initial point in the cloud, and then check every other point to see if it&#8217;s higher than our max, or lower than our min.  If it is, set the max or min as needed, and continue.  In Computer Science class, they&#8217;ll tell you that this algorithm is O(N), which doesn&#8217;t suck too severely.  I believe it&#8217;s the optimal way to solve this problem (for an unordered set, at least), but I could be mistaken.</p>
<p>Anyway, to solve our problem, I have supplied 3 example methods:  calculateMaxMin1, calculateMaxMin2, and calculateMaxMin3.  They all follow the same pattern, but they have measurably different performance characteristics.</p>
<p>Let&#8217;s talk about these methods.  The first method uses only ivars.  The loop checks against count (an ivar), and it updates max/min (ivars) based on the vertices array (also an ivar).</p>
<p>The second method, calculateMaxMin2, is basically the same as calculateMaxMin1, except that instead of vertexCount, it stores the count locally in a local variable called &#8220;lCount&#8221;.</p>
<p>The third and final method, calculateMaxMin3, still follows the above pattern, but uses local variables of count, max/min, and the vertices array.</p>
<p>A casual glance at the code would lead one to believe that the running time for any of the three methods would be similar (with the last one possibly taking a teeny-tiny bit longer because it&#8217;s doing a couple more copies&#8230; probably immeasurably small though, as in 15 nanoseconds tops).  Instead of guessing though, we&#8217;ll run it and see what happens.</p>
<p>Running the program (for 4,194,304 points), we get this as the output:</p>
<p style="padding-left: 30px;">cwright@phendrana:~/projects/ivar&gt;./ivar<br />
Initializing Object&#8230;<br />
Performing tests&#8230;<br />
Method1: 0.043917<br />
Method2: 0.038990<br />
Method3: 0.024590</p>
<p>Interpreting the results, we see that Method1 takes about 0.044 seconds to run, Method2 takes a little less, at 0.039 seconds, and Method3 takes a stunning 0.025 seconds to complete the exact same task.</p>
<p>Why does Method3 take so much less time?  It takes just 55% as long as Method1, yet it&#8217;s doing the same amount of work!</p>
<p>The answer lies in how the compiler/optimizer treats ivars compared to local variables.</p>
<p>Let&#8217;s pull out some assembly output to see what&#8217;s happening behind the scenes.  (This is going to get a bit more complicated, but don&#8217;t worry!)</p>
<p>calculateMaxMin1&#8242;s inner loop assembly code looks like this:</p>
<p><span style="color: #0000ff;">+59    00001a2d  f30f100401              movss          (%ecx,%eax),%xmm0<br />
+64    00001a32  0f2e4214                  ucomiss      0&#215;14(%edx),%xmm0</span><br />
<span style="color: #ff0000;"> +68    00001a36  7605                      jbe          0x00001a3d</span><br />
<span style="color: #0000ff;">+70    00001a38  f30f114214              movss          %xmm0,0&#215;14(%edx)</span> (Vertex)max<br />
<span style="color: #0000ff;">+75    00001a3d  f30f10440104              movss          0&#215;04(%ecx,%eax),%xmm0</span><br />
<span style="color: #0000ff;">+81    00001a43  0f2e4218                  ucomiss      0&#215;18(%edx),%xmm0</span><br />
<span style="color: #ff0000;">+85    00001a47  7605                      jbe          0x00001a4e</span><br />
<span style="color: #0000ff;">+87    00001a49  f30f114218              movss          %xmm0,0&#215;18(%edx)</span><br />
<span style="color: #0000ff;">+92    00001a4e  f30f10440108              movss          0&#215;08(%ecx,%eax),%xmm0</span><br />
<span style="color: #0000ff;">+98    00001a54  0f2e421c                  ucomiss      0x1c(%edx),%xmm0</span><br />
<span style="color: #ff0000;">+102    00001a58  7605                      jbe          0x00001a5f</span><br />
<span style="color: #0000ff;">+104    00001a5a  f30f11421c              movss          %xmm0,0x1c(%edx)</span><br />
<span style="color: #0000ff;">+109    00001a5f  f30f100c01              movss          (%ecx,%eax),%xmm1</span><br />
<span style="color: #0000ff;">+114    00001a64  f30f104208              movss          0&#215;08(%edx),%xmm0</span> (Vertex)min<br />
+119    00001a69  0f2ec1                  ucomiss      %xmm1,%xmm0<br />
<span style="color: #ff0000;">+122    00001a6c  7605                      jbe          0x00001a73</span><br />
<span style="color: #0000ff;">+124    00001a6e  f30f114a08              movss          %xmm1,0&#215;08(%edx)</span> (Vertex)min<br />
<span style="color: #0000ff;">+129    00001a73  f30f104c0104              movss          0&#215;04(%ecx,%eax),%xmm1</span><br />
<span style="color: #0000ff;">+135    00001a79  f30f10420c              movss          0x0c(%edx),%xmm0</span><br />
+140    00001a7e  0f2ec1                  ucomiss      %xmm1,%xmm0<br />
<span style="color: #ff0000;">+143    00001a81  7605                      jbe          0x00001a88</span><br />
<span style="color: #0000ff;">+145    00001a83  f30f114a0c              movss          %xmm1,0x0c(%edx)</span><br />
<span style="color: #0000ff;">+150    00001a88  f30f104c0108              movss          0&#215;08(%ecx,%eax),%xmm1</span><br />
<span style="color: #0000ff;">+156    00001a8e  f30f104210              movss          0&#215;10(%edx),%xmm0</span><br />
+161    00001a93  0f2ec1                  ucomiss      %xmm1,%xmm0<br />
<span style="color: #ff0000;">+164    00001a96  7605                      jbe          0x00001a9d</span><br />
<span style="color: #0000ff;">+166    00001a98  f30f114a10              movss          %xmm1,0&#215;10(%edx)</span><br />
+171    00001a9d  43                      incl          %ebx<br />
+172    00001a9e  83c00c                  addl          $0x0c,%eax<br />
<span style="color: #0000ff;">+175    00001aa1  3b5a20                  cmpl          0&#215;20(%edx),%ebx</span> (unsigned int)count<br />
+178    00001aa4  7287                      jb          0x00001a2d</p></blockquote>
<p>Note all the jumps (jbe, highlighted in red), the overall length (+59 to +180 bytes, or 121 bytes in total (180-59)), and how our ivars are loaded/stored (movss (%ecx,%eax),%xmm0 loads max.x, for example, highlighted in blue).</p>
<p>Now let&#8217;s contrast it with calculateMaxMin3&#8242;s inner loop:</p>
<blockquote><p><span style="color: #0000ff;">+62    00001ba2  f30f10500c              movss          0x0c(%eax),%xmm2</span><br />
+67    00001ba7  0f28c2                  movaps      %xmm2,%xmm0<br />
+70    00001baa  f30f5fc5                  maxss          %xmm5,%xmm0<br />
+74    00001bae  0f28e8                  movaps      %xmm0,%xmm5<br />
<span style="color: #0000ff;">+77    00001bb1  f30f104810              movss          0&#215;10(%eax),%xmm1</span><br />
+82    00001bb6  0f28d9                  movaps      %xmm1,%xmm3<br />
+85    00001bb9  f30f5fdc                  maxss          %xmm4,%xmm3<br />
+89    00001bbd  0f28e3                  movaps      %xmm3,%xmm4<br />
<span style="color: #0000ff;">+92    00001bc0  f30f104014              movss          0&#215;14(%eax),%xmm0</span><br />
+97    00001bc5  0f28d8                  movaps      %xmm0,%xmm3<br />
<span style="color: #0000ff;">+100    00001bc8  f30f5f5df4              maxss          0xf4(%ebp),%xmm3<br />
+105    00001bcd  f30f115df4              movss          %xmm3,0xf4(%ebp)<br />
+110    00001bd2  f30f5d55f8              minss          0xf8(%ebp),%xmm2<br />
+115    00001bd7  f30f1155f8              movss          %xmm2,0xf8(%ebp)</span><br />
+120    00001bdc  f30f5dcf                  minss          %xmm7,%xmm1<br />
+124    00001be0  0f28f9                  movaps      %xmm1,%xmm7<br />
+127    00001be3  f30f5dc6                  minss          %xmm6,%xmm0<br />
+131    00001be7  0f28f0                  movaps      %xmm0,%xmm6<br />
+134    00001bea  42                      incl          %edx<br />
+135    00001beb  83c00c                  addl          $0x0c,%eax<br />
+138    00001bee  39da                      cmpl          %ebx,%edx<br />
+140    00001bf0  72b0                      jb          0x00001ba2</p></blockquote>
<p>No jump instructions, except for the loop&#8217;s jb at the end.  Size is +62 to +142 (80 bytes, almost 30% smaller!), and most of our variables live in registers (note how xmm0-xmm7 are all utilized, and %ebp (stack spillage) is only used twice, at 0xf4, and 0xf8.  Aside from the blue highlighted load/store lines, everything lives in a register, ready for screaming fast access (granted, most ivars will live in L1, so they&#8217;re only half as fast as registers, but as we see from the profiling above, that half-speed hit is actually measurable, even from L1!).</p>
<p>So, let&#8217;s analyze this a bit.  Our loop code&#8217;s smaller (so it can complete about 33% more loops per unit of time).  But that&#8217;s only part of the benefit.  We also get no jumps.  No jumps means no mis-predicted jumps, which helps the CPU pipeline stay loaded.  And finally, we have almost no memory loading except for the vertex data itself (and a tiny bit of stack spillage).  This means more CPU-RAM (or CPU-L1 Cache even) bandwidth is spent on loading relevant data, instead of reloading/storing ivars all the time.</p>
<p>Why are ivars implemented this way?  Because a method needs to work on fresh data at all times &#8212; if another thread comes along and changes an ivar on the object, it wouldn&#8217;t make sense to keep using stale data.  Local variables, in contrast, aren&#8217;t at all likely to get modified by another thread (you can do it, but it&#8217;s a rather contrived example).  As such, the compiler knows exactly when a value changes, and can let it live on the cpu for as long as possible, for optimum speed.</p>
<p>Disclaimer stuff:  This is a silly problem to solve, but it illustrates the point nicely.  There are better solutions (using SSE to check 4 values at once, and then coalescing at the end, for example), but that&#8217;s not the point.  Here we see how we can speed up (by almost a factor of 2!) loop intensive code by simply avoiding ivars in heavily trafficked loops.</p>
<p>Compiler command line to build:   gcc-4.2 ivar.m -o ivar -Os -g -framework Cocoa</p>
]]></content:encoded>
			<wfw:commentRss>http://perpendiculo.us/?feed=rss2&amp;p=116</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
