Archive

Posts Tagged ‘gcc’

gcc optimization case study

January 4, 2011 3 comments

I’ve used gcc for years, with varying levels of optimization… then the other day, I got really bored and started experimenting to see if gcc optimization flags really matter at all.

Trivial Example

It’s probably not the best example I could’ve come up with, but I hammered out the following code as a tiny and trivial test case.


/*
 * bar.c - trivial gcc optimization test case
 */

int main(int argc, char ** argv)
{
	long long i, j, count;

	j = 1;
	count = (100*1000*1000LL);
	for(i = 0; i < count; i++)
		j = j << 1;
	printf("j = %lld\n", j);
}

The Makefile is pretty straightforward, nothing super interesting here:


CC=gcc
CFLAGS=-O0
OBJS=bar.o
TARGET=foo

$(TARGET): $(OBJS)
	$(CC) $(CFLAGS) -o $(TARGET) $(OBJS)

clean:
	-rm -f $(OBJS) $(TARGET) *~

It’s pretty boring code that loops 100 million times doing a left shift. Ignoring the fact that the result is useless, I then ran gcc with a few different flags with the following results. Runtime is in seconds, and the asm was generated with objdump -S and I’m only showing the loop code below.

gcc flags runtime (avg of 5) loop disassembly
-O0 0.2314 sec 29: shlq -0x10(%rbp)
2d: addq $0x1,-0x8(%rbp)
32: mov -0x8(%rbp),%rax
36: cmp -0x18(%rbp),%rax
3a: jl 29
-O1 0.078 sec e: add %rsi,%rsi
11: add $0x1,%rax
15: cmp $0x5f5e100,%rax
1b: jne e
-O2 0.077 sec same loop
-O3 0.077 same loop, different return
-Os 0.072 7: inc %rax
a: add %rsi,%rsi
d: cmp $0x5f5e100,%rax
13: jne 8
-O2 -funroll-loops 0.014 sec 10: add $0x8,%rax
14: shl $0x8,%rsi
18: cmp $0x5f5e100,%rax
1e: jne 10

Runtime measured with “time ./foo” on a Core i7-920 quad-core CPU with 6 GB of DDR3 memory

Interestingly enough, the fastest gcc compile options for this useless code sample are -O2 and -f-unroll-loops. It is faster because it performs an 8-way unroll and therefore does approximately 1/8th the work. This works in this trivial example because it literally replaces 8 left-shift-by-one operations with a single shift-left by 8.

So that’s semi-interesting, but all I’ve proved so far is gcc does in fact optimize and it is indeed much faster when you optimize a trivial loop example.

Multi-threaded app

I was curious to see how this plays out on non-trivial programs, and I had some code at work that needed this kind of analysis anyways – so I gave it a whirl on my multi-threaded app. I can’t break out the source code or the disassembly, and I’m honestly not 100% sure why – but I saw some very odd results.

flags runtime (approx)
-O0 8.5 sec
-O2 12.5 sec
-O3 13 sec
-Os 7.9 sec
-Os -march=core2 7.7 sec

What I find really interesting is that with -O2 and -O3, the application runtime gets WORSE instead of better. My best guess for this is that with O2 and aggressive inlining, the code size blows up and it’s worse for cache hit rates. I haven’t investigated it and probably won’t take the time, but I found it rather fascinating to see such a change just from compiler flags

FIO benchmark

Anyone who has visited my blog before probably knows that I’m a big fan of Jens Axboe’s fio benchmark. I decided to record a similar result from using fio to benchmark /dev/ram0.

I’m using fio 1.44.2, compiled from source on my local Ubuntu 10.04 adm64 system.


$ sudo ./fio --bs=4k --direct=1 --filename=/dev/ram0 
--numjobs=4 --iodepth=8 --ioengine=libaio --group_reporting
--time_based --runtime=60 --rw=rand-read --name=rand-read

There’s no real rhyme or reason for the workload I chose (4k, 4 forks, iodepth=9, etc), I just wanted something with a few threads and some outstanding IO to have a good chance of getting high bandwidth.

flags bandwidth (MB/s) avg latency (us)
-O0 1208 1.04
-O1 1524 0.83
-O2 1645 13.95 OR 0.78, very odd
-O3 1676 0.76
-Os 1543 3.29
-O2 -funroll-loops 1667 0.77 OR 13.33, odd again

Summary

With three very different examples, I get three very different sets of results.

In my trivial, stupid for loop example, unrolling the loop made a lot of sense and using -O2 didn’t matter a whole lot.

In my multi-threaded app, most optimizations actually made the runtime worse, but size and architecture made a difference.

And with fio, the results are pretty much what you’d expect – higher levels of optimization make the benchmark faster.

What’s the conclusion?

Compiler flags matter, but like all optimization their usefulness and impact is highly workload and application dependent.

So… you just gotta try them all, see what is best for your project. I know, killer conclusion, but hey… I just tell it like I see it.

Thanks for reading, I’d love to hear your comments/feedback below

atto

Advertisements
Categories: Uncategorized Tags: , ,