gcc optimization case study
I’ve used gcc for years, with varying levels of optimization… then the other day, I got really bored and started experimenting to see if gcc optimization flags really matter at all.
Trivial Example
It’s probably not the best example I could’ve come up with, but I hammered out the following code as a tiny and trivial test case.
/*
* bar.c - trivial gcc optimization test case
*/
int main(int argc, char ** argv)
{
long long i, j, count;
j = 1;
count = (100*1000*1000LL);
for(i = 0; i < count; i++)
j = j << 1;
printf("j = %lld\n", j);
}
The Makefile is pretty straightforward, nothing super interesting here:
CC=gcc
CFLAGS=-O0
OBJS=bar.o
TARGET=foo
$(TARGET): $(OBJS)
$(CC) $(CFLAGS) -o $(TARGET) $(OBJS)
clean:
-rm -f $(OBJS) $(TARGET) *~
It’s pretty boring code that loops 100 million times doing a left shift. Ignoring the fact that the result is useless, I then ran gcc with a few different flags with the following results. Runtime is in seconds, and the asm was generated with objdump -S and I’m only showing the loop code below.
| gcc flags | runtime (avg of 5) | loop disassembly |
|---|---|---|
| -O0 | 0.2314 sec | 29: shlq -0×10(%rbp) 2d: addq $0×1,-0×8(%rbp) 32: mov -0×8(%rbp),%rax 36: cmp -0×18(%rbp),%rax 3a: jl 29 |
| -O1 | 0.078 sec | e: add %rsi,%rsi 11: add $0×1,%rax 15: cmp $0x5f5e100,%rax 1b: jne e |
| -O2 | 0.077 sec | same loop |
| -O3 | 0.077 | same loop, different return |
| -Os | 0.072 | 7: inc %rax a: add %rsi,%rsi d: cmp $0x5f5e100,%rax 13: jne 8 |
| -O2 -funroll-loops | 0.014 sec | 10: add $0×8,%rax 14: shl $0×8,%rsi 18: cmp $0x5f5e100,%rax 1e: jne 10 |
Runtime measured with “time ./foo” on a Core i7-920 quad-core CPU with 6 GB of DDR3 memory
Interestingly enough, the fastest gcc compile options for this useless code sample are -O2 and -f-unroll-loops. It is faster because it performs an 8-way unroll and therefore does approximately 1/8th the work. This works in this trivial example because it literally replaces 8 left-shift-by-one operations with a single shift-left by 8.
So that’s semi-interesting, but all I’ve proved so far is gcc does in fact optimize and it is indeed much faster when you optimize a trivial loop example.
Multi-threaded app
I was curious to see how this plays out on non-trivial programs, and I had some code at work that needed this kind of analysis anyways – so I gave it a whirl on my multi-threaded app. I can’t break out the source code or the disassembly, and I’m honestly not 100% sure why – but I saw some very odd results.
| flags | runtime (approx) |
|---|---|
| -O0 | 8.5 sec |
| -O2 | 12.5 sec |
| -O3 | 13 sec |
| -Os | 7.9 sec |
| -Os -march=core2 | 7.7 sec |
What I find really interesting is that with -O2 and -O3, the application runtime gets WORSE instead of better. My best guess for this is that with O2 and aggressive inlining, the code size blows up and it’s worse for cache hit rates. I haven’t investigated it and probably won’t take the time, but I found it rather fascinating to see such a change just from compiler flags
FIO benchmark
Anyone who has visited my blog before probably knows that I’m a big fan of Jens Axboe’s fio benchmark. I decided to record a similar result from using fio to benchmark /dev/ram0.
I’m using fio 1.44.2, compiled from source on my local Ubuntu 10.04 adm64 system.
$ sudo ./fio --bs=4k --direct=1 --filename=/dev/ram0
--numjobs=4 --iodepth=8 --ioengine=libaio --group_reporting
--time_based --runtime=60 --rw=rand-read --name=rand-read
There’s no real rhyme or reason for the workload I chose (4k, 4 forks, iodepth=9, etc), I just wanted something with a few threads and some outstanding IO to have a good chance of getting high bandwidth.
| flags | bandwidth (MB/s) | avg latency (us) |
|---|---|---|
| -O0 | 1208 | 1.04 |
| -O1 | 1524 | 0.83 |
| -O2 | 1645 | 13.95 OR 0.78, very odd |
| -O3 | 1676 | 0.76 |
| -Os | 1543 | 3.29 |
| -O2 -funroll-loops | 1667 | 0.77 OR 13.33, odd again |
Summary
With three very different examples, I get three very different sets of results.
In my trivial, stupid for loop example, unrolling the loop made a lot of sense and using -O2 didn’t matter a whole lot.
In my multi-threaded app, most optimizations actually made the runtime worse, but size and architecture made a difference.
And with fio, the results are pretty much what you’d expect – higher levels of optimization make the benchmark faster.
What’s the conclusion?
Compiler flags matter, but like all optimization their usefulness and impact is highly workload and application dependent.
So… you just gotta try them all, see what is best for your project. I know, killer conclusion, but hey… I just tell it like I see it.
Thanks for reading, I’d love to hear your comments/feedback below
atto

Hyperthreading is screwing up your cpu times. The processor is running 8 threads simultaneously on 4 cores and interleaving the instructions. It is effectively double counting cpu cycles.
This is a super interesting microcosm you developed. Whenever I change compilers on my system and have to recreate the use flags for general purpose binaries, I always get a copy of nbench (http://www.tux.org/~mayer/linux/bmark.html), and try candidate stable flags on the benchmark binaries for the compiler. I find that this is useful keeping my 32 bit system closer to 64 bit performance. There are a lot of aggressive stable flags which boost performance significantly for nbench, especially if the compiler isn’t activating sseX or the compiler isn’t taking advantage of math fpu (-mfpmath=) correctly. This can lead to memory and floating point performance boosts of about 20% on an atom chip, while remaining 100% within the ‘stable’ pervue, without being ” highly workload and application dependent.” Thanks for the Post.