Archive

Posts Tagged ‘optimization’

gcc optimization case study

January 4, 2011 3 comments

I’ve used gcc for years, with varying levels of optimization… then the other day, I got really bored and started experimenting to see if gcc optimization flags really matter at all.

Trivial Example

It’s probably not the best example I could’ve come up with, but I hammered out the following code as a tiny and trivial test case.


/*
 * bar.c - trivial gcc optimization test case
 */

int main(int argc, char ** argv)
{
	long long i, j, count;

	j = 1;
	count = (100*1000*1000LL);
	for(i = 0; i < count; i++)
		j = j << 1;
	printf("j = %lld\n", j);
}

The Makefile is pretty straightforward, nothing super interesting here:


CC=gcc
CFLAGS=-O0
OBJS=bar.o
TARGET=foo

$(TARGET): $(OBJS)
	$(CC) $(CFLAGS) -o $(TARGET) $(OBJS)

clean:
	-rm -f $(OBJS) $(TARGET) *~

It’s pretty boring code that loops 100 million times doing a left shift. Ignoring the fact that the result is useless, I then ran gcc with a few different flags with the following results. Runtime is in seconds, and the asm was generated with objdump -S and I’m only showing the loop code below.

gcc flags runtime (avg of 5) loop disassembly
-O0 0.2314 sec 29: shlq -0x10(%rbp)
2d: addq $0x1,-0x8(%rbp)
32: mov -0x8(%rbp),%rax
36: cmp -0x18(%rbp),%rax
3a: jl 29
-O1 0.078 sec e: add %rsi,%rsi
11: add $0x1,%rax
15: cmp $0x5f5e100,%rax
1b: jne e
-O2 0.077 sec same loop
-O3 0.077 same loop, different return
-Os 0.072 7: inc %rax
a: add %rsi,%rsi
d: cmp $0x5f5e100,%rax
13: jne 8
-O2 -funroll-loops 0.014 sec 10: add $0x8,%rax
14: shl $0x8,%rsi
18: cmp $0x5f5e100,%rax
1e: jne 10

Runtime measured with “time ./foo” on a Core i7-920 quad-core CPU with 6 GB of DDR3 memory

Interestingly enough, the fastest gcc compile options for this useless code sample are -O2 and -f-unroll-loops. It is faster because it performs an 8-way unroll and therefore does approximately 1/8th the work. This works in this trivial example because it literally replaces 8 left-shift-by-one operations with a single shift-left by 8.

So that’s semi-interesting, but all I’ve proved so far is gcc does in fact optimize and it is indeed much faster when you optimize a trivial loop example.

Multi-threaded app

I was curious to see how this plays out on non-trivial programs, and I had some code at work that needed this kind of analysis anyways – so I gave it a whirl on my multi-threaded app. I can’t break out the source code or the disassembly, and I’m honestly not 100% sure why – but I saw some very odd results.

flags runtime (approx)
-O0 8.5 sec
-O2 12.5 sec
-O3 13 sec
-Os 7.9 sec
-Os -march=core2 7.7 sec

What I find really interesting is that with -O2 and -O3, the application runtime gets WORSE instead of better. My best guess for this is that with O2 and aggressive inlining, the code size blows up and it’s worse for cache hit rates. I haven’t investigated it and probably won’t take the time, but I found it rather fascinating to see such a change just from compiler flags

FIO benchmark

Anyone who has visited my blog before probably knows that I’m a big fan of Jens Axboe’s fio benchmark. I decided to record a similar result from using fio to benchmark /dev/ram0.

I’m using fio 1.44.2, compiled from source on my local Ubuntu 10.04 adm64 system.


$ sudo ./fio --bs=4k --direct=1 --filename=/dev/ram0 
--numjobs=4 --iodepth=8 --ioengine=libaio --group_reporting
--time_based --runtime=60 --rw=rand-read --name=rand-read

There’s no real rhyme or reason for the workload I chose (4k, 4 forks, iodepth=9, etc), I just wanted something with a few threads and some outstanding IO to have a good chance of getting high bandwidth.

flags bandwidth (MB/s) avg latency (us)
-O0 1208 1.04
-O1 1524 0.83
-O2 1645 13.95 OR 0.78, very odd
-O3 1676 0.76
-Os 1543 3.29
-O2 -funroll-loops 1667 0.77 OR 13.33, odd again

Summary

With three very different examples, I get three very different sets of results.

In my trivial, stupid for loop example, unrolling the loop made a lot of sense and using -O2 didn’t matter a whole lot.

In my multi-threaded app, most optimizations actually made the runtime worse, but size and architecture made a difference.

And with fio, the results are pretty much what you’d expect – higher levels of optimization make the benchmark faster.

What’s the conclusion?

Compiler flags matter, but like all optimization their usefulness and impact is highly workload and application dependent.

So… you just gotta try them all, see what is best for your project. I know, killer conclusion, but hey… I just tell it like I see it.

Thanks for reading, I’d love to hear your comments/feedback below

atto

Advertisements
Categories: Uncategorized Tags: , ,

Measure always, optimize sparingly

April 19, 2010 Leave a comment

At work today, I found myself falling into a classic coder mental trap, one that’s worth sharing.

We have an automated test suite – we call them our “sanity tests” – that we’re expected to run prior to checking code into the repository. It’s a quick validation run that helps keep the code moving in a positive direction – we can know almost instantly if we’ve made changes that break the system, and back out bad code before anyone else is affected by it. It’s really the only sane way to code, I honestly don’t know how people and teams stay sane without… sanity tests.

Between each test run, we try to initialize the system back to a “known good” state, so failures from a previous test don’t result in false positives in later tests. The system “reset” routine destroys some intermediary data, and depending on the amount of data generated, it can take a while.

Or so I thought.

But I digress.

The total test time is very small, a few seconds on a nice Linux box and maybe half a minute on a slower Windows system. But for whatever reason, the time it takes to reset the system has been bugging me. I keep thinking to myself, if only I could shave the 8.37 seconds down to 4.83 seconds… that would be totally awesome.

So finally, I broke down and gave in to my obsessive compulsive nature and mentally geared up for a good half hour of debug/hack/burn to optimize the reset.

More than half an hour later, I put the finishing touches on the routine that destroys the intermediary data – because that was obviously the slowest, most annoying part.

Right?

Uh… no.

I literally smacked my own head in disgust.

I had fallen for the classic engineer “infinite optimization loop.” The special OCD place we all go where everything is open for tweaking, improving, and optimizing to death. The place where ROI is always infinite and there are no time constraints.

I went back and measured the reset routine, and the piece I had optimized accounted for only 1/3 of the reset time. My optimizations were about a 50% speedup, or 1/6 of the reset time, and only .3 seconds shaved off the best total system time.

Holy cow, I’m a flippin’ rockstar… move over Elvis, here comes dork boy.

When I was in the middle of this grand epiphany, I remembered the sage advice of my uncle – a carpenter of 30+ years: “measure twice, cut once.”

In software engineering, we’re lucky because many physical constraints just don’t apply. There is no board, no saw. So there’s no reason we can’t cut a hundred times, measure, and go back and cut some more.

But the problem is, sometimes we get so far removed from physical constraints, we can far too easily believe there are no constraints. But time, money, and schedules are at odds with the Utopian, ideal engineer’s optimization loop.

So I’d like to propose a motto for coders of all races, creeds, and compilers:

Measure always, optimize sparingly.

Atto

Categories: Epiphany Tags: , ,