Archive

Archive for the ‘Uncategorized’ Category

My blog is dead; long live my blog

December 2, 2011 Leave a comment

It’s been a long time since my last post, so at this point my blog is pretty much dead. I never was much of a blogger anyways, so I’m kind of surprised I posted so much last year. I started a new job this year, and sometimes things get busy and I disappear for a while… but the long and short of it is, my blog is dead πŸ™‚

Looking back over my prior posts, I realized a few things that might be useful for other aspiring bloggers out there

  1. Think twice, three times before posting a blog entry to reddit/digg/etc. If you have a catchy title, you can pull down a few thousand hits per day easily, but you should really think “Is my post worth sharing? Or does it need some work before a few hundred/thousand people wander over?”
  2. Figure out “why am I blogging?” and stick to that reason. If you’re blogging because you’re bored, your posts will probably be boring. If you’re blogging because you’re genuinely curious about something, and you put in the time to make interesting posts… then you’ve got a great reason to blog.
  3. Blogging takes time. Especially for highly technical posts, or posts about a project/experiment/some code you wrote. You should expect to spend several hours writing, reviewing, and proofreading each post.
  4. Blogging to get hits is a crappy excuse for blogging. Blog because you’re passionate about something, heck with hits and link karma.

I don’t have much else to say right now, but while I’m typing, here are a couple of links for fun/profit:

  1. Bit hacks – best collection of bit hacks I’ve run across… need to know how to hack bits optimally? Read on…
  2. Jenkins CI – life without some kind of continuous integration is a pain. Everyone should always test code before hitting the ‘submit’ button, but reality != fiction, and CI makes your life easier. Check it out, Jenkins is pretty cool.

ttfn

Advertisements
Categories: Uncategorized

Core Dumps Rock

April 17, 2011 3 comments

So it’s been a while since my last post, and I find myself thinking about core dumps lately.

I won’t be wowing you with my awesome knowledge or teaching obscure implementation details regarding core dumps… I guess my bottom line message is “core dumps rock.”

Some of the more hard-core guys like Linus don’t believe in debuggers (real men don’t need ’em, eh?) but for the rest of us mere mortals debuggers are vital tools. And while I agree with Linus’s reasoning against debuggers, especially in the kernel — I guess I’m a pragmatist at heart. Use whatever tool helps you get the job done…

Anyways, core dumps are awesome. Especially if any of the following are true:
– you’re chasing an intermittent bug that only happens once in N runs or doesn’t reproduce with a debugger attached
– your code is running on an embedded device and there’s no JTAG (hence no realtime debugging/debugger)
– you want to save a snapshot of the system/app/problem for debugging later

Core dumps (and what triggers them) vary from OS to OS, but they generally contain some/all of the following:
– the page tables/memory for the crashing process, or a full RAM dump on some embedded targets
– the processor registers, including MMU configuration and other hardware configuration/state info

This info can be loaded into a debugger later on to construct a backtrace, examine variables, and can go a long ways towards solving the crash at hand.

I’ve used core dumps several times recently and it’s been a real lifesaver. The exact mechanics of using core dumps are already documented ad nauseum elsewhere, so I won’t repeat that info here. Just wanted to put in a plug for core dumps… Because when you need ’em, they… Rock.

Atto

Categories: Uncategorized Tags: ,

gcc optimization case study

January 4, 2011 3 comments

I’ve used gcc for years, with varying levels of optimization… then the other day, I got really bored and started experimenting to see if gcc optimization flags really matter at all.

Trivial Example

It’s probably not the best example I could’ve come up with, but I hammered out the following code as a tiny and trivial test case.


/*
 * bar.c - trivial gcc optimization test case
 */

int main(int argc, char ** argv)
{
	long long i, j, count;

	j = 1;
	count = (100*1000*1000LL);
	for(i = 0; i < count; i++)
		j = j << 1;
	printf("j = %lld\n", j);
}

The Makefile is pretty straightforward, nothing super interesting here:


CC=gcc
CFLAGS=-O0
OBJS=bar.o
TARGET=foo

$(TARGET): $(OBJS)
	$(CC) $(CFLAGS) -o $(TARGET) $(OBJS)

clean:
	-rm -f $(OBJS) $(TARGET) *~

It’s pretty boring code that loops 100 million times doing a left shift. Ignoring the fact that the result is useless, I then ran gcc with a few different flags with the following results. Runtime is in seconds, and the asm was generated with objdump -S and I’m only showing the loop code below.

gcc flags runtime (avg of 5) loop disassembly
-O0 0.2314 sec 29: shlq -0x10(%rbp)
2d: addq $0x1,-0x8(%rbp)
32: mov -0x8(%rbp),%rax
36: cmp -0x18(%rbp),%rax
3a: jl 29
-O1 0.078 sec e: add %rsi,%rsi
11: add $0x1,%rax
15: cmp $0x5f5e100,%rax
1b: jne e
-O2 0.077 sec same loop
-O3 0.077 same loop, different return
-Os 0.072 7: inc %rax
a: add %rsi,%rsi
d: cmp $0x5f5e100,%rax
13: jne 8
-O2 -funroll-loops 0.014 sec 10: add $0x8,%rax
14: shl $0x8,%rsi
18: cmp $0x5f5e100,%rax
1e: jne 10

Runtime measured with “time ./foo” on a Core i7-920 quad-core CPU with 6 GB of DDR3 memory

Interestingly enough, the fastest gcc compile options for this useless code sample are -O2 and -f-unroll-loops. It is faster because it performs an 8-way unroll and therefore does approximately 1/8th the work. This works in this trivial example because it literally replaces 8 left-shift-by-one operations with a single shift-left by 8.

So that’s semi-interesting, but all I’ve proved so far is gcc does in fact optimize and it is indeed much faster when you optimize a trivial loop example.

Multi-threaded app

I was curious to see how this plays out on non-trivial programs, and I had some code at work that needed this kind of analysis anyways – so I gave it a whirl on my multi-threaded app. I can’t break out the source code or the disassembly, and I’m honestly not 100% sure why – but I saw some very odd results.

flags runtime (approx)
-O0 8.5 sec
-O2 12.5 sec
-O3 13 sec
-Os 7.9 sec
-Os -march=core2 7.7 sec

What I find really interesting is that with -O2 and -O3, the application runtime gets WORSE instead of better. My best guess for this is that with O2 and aggressive inlining, the code size blows up and it’s worse for cache hit rates. I haven’t investigated it and probably won’t take the time, but I found it rather fascinating to see such a change just from compiler flags

FIO benchmark

Anyone who has visited my blog before probably knows that I’m a big fan of Jens Axboe’s fio benchmark. I decided to record a similar result from using fio to benchmark /dev/ram0.

I’m using fio 1.44.2, compiled from source on my local Ubuntu 10.04 adm64 system.


$ sudo ./fio --bs=4k --direct=1 --filename=/dev/ram0 
--numjobs=4 --iodepth=8 --ioengine=libaio --group_reporting
--time_based --runtime=60 --rw=rand-read --name=rand-read

There’s no real rhyme or reason for the workload I chose (4k, 4 forks, iodepth=9, etc), I just wanted something with a few threads and some outstanding IO to have a good chance of getting high bandwidth.

flags bandwidth (MB/s) avg latency (us)
-O0 1208 1.04
-O1 1524 0.83
-O2 1645 13.95 OR 0.78, very odd
-O3 1676 0.76
-Os 1543 3.29
-O2 -funroll-loops 1667 0.77 OR 13.33, odd again

Summary

With three very different examples, I get three very different sets of results.

In my trivial, stupid for loop example, unrolling the loop made a lot of sense and using -O2 didn’t matter a whole lot.

In my multi-threaded app, most optimizations actually made the runtime worse, but size and architecture made a difference.

And with fio, the results are pretty much what you’d expect – higher levels of optimization make the benchmark faster.

What’s the conclusion?

Compiler flags matter, but like all optimization their usefulness and impact is highly workload and application dependent.

So… you just gotta try them all, see what is best for your project. I know, killer conclusion, but hey… I just tell it like I see it.

Thanks for reading, I’d love to hear your comments/feedback below

atto

Categories: Uncategorized Tags: , ,

2010 in review

January 2, 2011 Leave a comment

The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meterβ„’ reads Wow.

Crunchy numbers

Featured image

The average container ship can carry about 4,500 containers. This blog was viewed about 17,000 times in 2010. If each view were a shipping container, your blog would have filled about 4 fully loaded ships.

In 2010, there were 22 new posts, not bad for the first year! There were 23 pictures uploaded, taking up a total of 3mb. That’s about 2 pictures per month.

The busiest day of the year was May 12th with 4,426 views. The most popular post that day was Three Things I Love About C.

Where did they come from?

The top referring sites in 2010 were reddit.com, news.ycombinator.com, Google Reader, twitter.com, and healthfitnesstherapy.com.

Some visitors came searching, mostly for hg vs git, sigtrap, git vs hg, i love c, and hg vs git 2010.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

Three Things I Love About C May 2010
19 comments

2

SVN to Git September 2010

3

True Zero-Copy with XIP vs PRAMFS November 2010
6 comments

4

Cloning a clone: Hg vs Git May 2010
12 comments

5

TortoiseGit – round 2, fight! April 2010
1 comment

Categories: Uncategorized

Sorting for Chumps

December 9, 2010 Leave a comment

A terse post about sorting, written for chumps by a chump

If you’ve never spent time on Urban Dictionary, you’re missing out on one of the finer things in life.

Chump. A sucka that tries to act cool, but is really a fool and tries to act tough, but really isn’t

source

I’m a chump who’s been out of University for a few years now, so I’ve long since forgotten anything that I don’t use in my daily job. Lately, I’ve been going back over some of the things I used to know and trying to re-learn basic fun things like sorting and runtime analysis.

I won’t bore you with the distinction between big-O and big-theta, but there’s a fun table on Wikipedia’s Big O notation page for the curious.

For chumps like me who appreciate how stunningly cool the math is, but just want real-world performance data… you’ve come to the right place.

Test methodology

As preparation for this blog post, I wrote a small sorting library and “integer sorting” program which you can find on github. I named it libsort just because I’m cool like that. The code isn’t the best I’ve ever written, but most of it was hacked out on iSSH so… there you go.

libsort currently implements the following algorithms:

  • bubblesort: O(n2), O(1) extra space
  • mergesort: O(n log n) runtime, O(n) extra space
  • quicksort: O(n log n) runtime, O(1) extra space

I started with bubblesort because it looked easy (remember, I’m a chump :)) and it’s also n2 so it should make for good comparison against faster algorithms.

Mergesort and Quicksort really weren’t that much harder to implement… I picked slightly optimized versions and coded them up based on Wikipedia’s pseudo-code. Man I love pseudo-code…

I also ran two versions of quicksort and mergesort, with some minor changes like malloc’ing once beforehand rather than having each recursive call malloc temporary space. You’ll see this show up in the graphs as quicksort_onemalloc below, etc.

Test machine

The system I ran the tests on looks like this

  • Intel Core i7-920 @ 2.67 GHz (quad-core with 2 threads/core = 8 virtual CPUs)
  • 6 GB of DDR3 memory (sorry, don’t know the bus speed)
  • Ubuntu 10.04 Desktop amd64
  • a regular old SATA HDD

The Results

You can download the spreadsheet I put together from the raw data. Or you can be a chump like me and just look at the graphs.

Result #1: bubblesort sucks.

Bubblesort does SO poorly once you get beyond 10,000 elements that I stopped measuring it. O(n2) really hurts… 100K elements took over 110 seconds. That’s really, really bad.

Result #2: quicksort did slightly better than mergesort, but not enough to really matter a whole lot for most purposes.

The graph above doesn’t really do justice for just how fast sorting is for small lists. Both of these sorting algorithms can sort over 1 MILLION integers in less than a quarter of a second. You really don’t see much difference until you get past 100 million, then my implementation of an in-place quicksort starts to really shine.

Looking at it in a log10-scale layout:

Looking at it from a log-scale shows pretty much linear increase in log-time, which sounds pretty good to this chump.

That’s all for now, please check out the libsort code on github and let me know if I’ve made any stupid mistakes, etc.

Thanks for reading – now let me know what you think!

Categories: Uncategorized Tags: , , ,

True Zero-Copy with XIP vs PRAMFS

November 24, 2010 12 comments

Introduction

A couple of weeks ago, Louis Brandy wrote a blog post entitled “Memory mapped IO for fun and profit,” which I found rather fascinating. I’ve always loved memory mapped files, for a lot of the reasons listed on his blog. It really makes your software simpler to be able to treat a chunk of memory like… a chunk of memory, letting the OS handle the IO behind the scenes as deemed appropriate.

I really liked his blog post, it was a great example usage case where mmap() didn’t just make the code easier to debug and/or understand, it actually provided a real, noticeable performance boost.

But one thing that’s always bugged me about MMAP’d files is they really aren’t “zero copy.” They’re more like ‘one copy’, because you still have to fault and fetch/flush pages via the normal IO mechanism. What if there was a way to actually perform zero-copy MMAP’d files? Granted, you either need RAM or something else that is direct-mapped into the processor (PCM/CMOx/Memristors anyone?), but I’d like to test my theory that this could lead to higher performance than the current mechanism.

The diagram below shows the various paths that files can take through the system, I’ve tried to color code it but probably made a mess of it.

The green arrows show the normal I/O flow with two copies of the data from retrieved from storage, the red arrows show the normal MMAP’d files, and the blue line labelled “mmap + direct_access” is what I’m all keen to test.

The basic idea I’m trying to convey is that conventional IO (either from a RAM disk or a regular SATA HDD) involves a fair amount of copying. Data is copied into the user program’s stack/heap, but it’s also copied into the page cache. So mmap’d files are neat because you can skip some of the copying and do I/O behind the scenes to flush dirty pages, etc. But what I’m after is a true zero-copy, which requires a RAMdisk.

I dig some digging, and found the following snippet from Documentation/filesystems/xip.txt:

A block device operation named direct_access is used to retrieve a reference (pointer) to a block on-disk. The reference is supposed to be cpu-addressable, physical address and remain valid until the release operation is performed.

I did some more digging, and I think what I’m really looking for is like ext2’s XIP feature — but not so much from a “I want to execute this code from NOR” standpoint, more like I want MMAP’d files which are directly addressable by the CPU.

And then there’s PRAMFS, a new-ish filesystem which doesn’t use block devices or the block layer at all – preferring to directly control the target [RAM/PRAM/???] with some page table protection thrown in for fun.

So, I set out to test which of these three methods gives the best bang for our proverbial buck.

People who just want to see the graphs / data, feel free to skip to the conclusion.

The Contenders

In no order of preference, here are the various IO methods I plan to test

  • ext2 + direct_access
  • pramfs
  • mmap() + block IO
  • libaio + block IO [no mmap]

My current plan is to use the fio benchmark with the “mmap” IO engine, because a) I have no useful examples / use cases, b) fio really is a flexible IO tester πŸ™‚

Creating the ext2 + direct_access method was actually quite complex to setup, I decided to write a custom RAMdisk with direct_access (xiprd) and I ended up rebuilding my kernel for ext2+xip support… I’m not going to go into detail on getting this setup / working, google for “how to compile a kernel the Ubuntu way” if you’re curious.

Benchmark machine setup:

  • i7-920 @ 2.67 GHz
  • 6GB DDR3 memory
  • Ubuntu 10.04 with vanilla kernel 2.6.36.1 + ftrace + ext2 XIP

ext2 + direct_access

Once I got my custom kernel with ext2 “xip” (execute in place) support built in, I loaded my custom block device driver (xiprd) with a 2 GB RAM-disk. Of course, trying to run the fio MMAP benchmark resulted in a crash and a system reboot… followed by a harried bug hunt.

This is still a work in progress, I think the bug has something to do with my using vmalloc() instead of kmalloc(), but switching to kmalloc() limits my RAMdisk size to 1MB. Using vmalloc, I keep getting the error:


page_offset=14
[ 1720.844373] ramdisk=ffff880106e00000, kaddr=ffff880106e0e000, pfn=1076750
[ 1720.954378] fio[3749]: segfault at 7fe0265b9328 ip 000000000041b17e sp 00007fff43825f40 error 4 in fio[400000+3b000]

Hacking xiprd to use kmalloc, I then had to force mkfs.ext2 to use block size of 4K (-b 4096) otherwise the mount failed.

$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 \
--numjobs=10 --size=80k --direct=1 --runtime=10 --directory=/mnt/xip/ \
 --name=rand-read --rw=randread --group_reporting --time_based
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=4363
  read : io=21079MB, bw=2108MB/s, iops=539567, runt= 10001msec
    clat (usec): min=2, max=28356, avg=13.28, stdev=53.77
  cpu          : usr=12.75%, sys=55.71%, ctx=31149, majf=0, minf=5396522
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=5396212/0, short=0/0
     lat (usec): 4=0.05%, 10=48.70%, 20=50.97%, 50=0.07%, 100=0.01%
     lat (usec): 250=0.07%, 500=0.03%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=21079MB, aggrb=2108MB/s, minb=2158MB/s, maxb=2158MB/s, mint=10001msec, maxt=10001msec

OK, 2.1 GB/s of random read bandwidth and 13.28 us round-trip using direct_access, not too shabby. However, this is still using a 1MB RAM disk, so I need to look into fixing the vmalloc issues.

Update: I tried booting with mem=3072m, and hacked xiprd to use __va(0x10000000) instead of vmalloc’ing. I also tried an ioremap_cache() which failed. Using __va() appeared to work, but rebooted the system when trying to make the filesystem, so this is definitely a WIP.

mmap + pramfs

Work in progress, I patched pramfs into the kernel source, rebuilt it, rebooted with mem=3072m, and tried creating a pramfs filesystem with

sudo mount -t pramfs -o physaddr=0x100000000,init=2000M,bs=4k none /mnt/pram

When I boot the kernel with only 3GB of RAM, 0x100000000 is a valid system RAM address which is not in use – so this works fine. But when I run fio against it, my system locks up when laying out the last file. So this is still a work in progress, maybe I ran past the RAM or who knows. I’ll keep trying and update if I find anything useful.

I tried a few more times, with mmap and libaio engines but couldn’t get it to work – so I filed bug 3118126 for the curious.

UPDATE: I finally got back around to compiling PRAMFS without memory protection and XIP. I guess the kernel doesn’t like setting PTE’s for RAM it’s not tracking, which makes sense.

Results for libaio:

sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
 --size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/ \
 --name=rand-read --rw=randread --time_based --group_reporting
(laying out files, etc)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [841M/0K /s] [210K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=1741
  read : io=49262MB, bw=840623KB/s, iops=210155, runt= 60008msec
    slat (usec): min=33, max=30023, avg=37.08, stdev=14.81
    clat (usec): min=0, max=20027, avg= 0.41, stdev= 1.21
    bw (KB/s) : min=48520, max=105888, per=12.47%, avg=104790.37, stdev=408.00
  cpu          : usr=2.49%, sys=77.28%, ctx=55025, majf=0, minf=356
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=12611029/0, short=0/0
     lat (usec): 2=99.98%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=49262MB, aggrb=840623KB/s, minb=860798KB/s,
    maxb=860798KB/s, mint=60008msec, maxt=60008msec

For the impatient, that’s ~840 MB/s with libaio.

How about mmap + PRAMFS?

sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 \
--size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/\
 --name=rand-read --rw=randread --time_based --group_reporting
(cut)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [6827M/0K /s] [1707K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=2260
  read : io=382030MB, bw=6367MB/s, iops=1630K, runt= 60001msec
    clat (usec): min=1, max=22316, avg= 4.35, stdev=21.06
    bw (KB/s) : min=44208, max=347336, per=1.31%, avg=85323.68, stdev=525.67
  cpu          : usr=48.01%, sys=31.28%, ctx=60143, majf=460800, minf=97339361
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=97799801/0, short=0/0
     lat (usec): 2=0.01%, 4=38.14%, 10=61.33%, 20=0.01%, 50=0.47%
     lat (usec): 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=382030MB, aggrb=6367MB/s, minb=6520MB/s, maxb=6520MB/s, mint=60001msec, maxt=60001msec

Awesome! 6.3 GB/s of read speed using MMAP. I’m totally loving MMAP, it rocks.

mmap + block IO

I chose ext4 for this test, because it’s the most prevalent at the time of this writing and because rumor pegs it as being more stable than btrfs.

I loaded my xiprd driver so it would be a more fair comparison – RAM disk against RAM disk, hopefully less oranges and more apples to apples.


$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 --size=180m \
 --direct=1 --runtime=60 --directory=/mnt/xip/ --name=rand-read \
--rw=randread --time_based --group_reporting
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=2504
  read : io=392968MB, bw=6549MB/s, iops=1677K, runt= 60001msec
    clat (usec): min=1, max=24708, avg= 4.89, stdev=33.22
    bw (KB/s) : min=188848, max=367640, per=3.86%, avg=258557.40, stdev= 0.00
  cpu          : usr=49.12%, sys=30.48%, ctx=55807, majf=460797, minf=100139571
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=100599920/0, short=0/0
     lat (usec): 2=0.01%, 4=29.03%, 10=70.76%, 20=0.17%, 50=0.03%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=392968MB, aggrb=6549MB/s, minb=6707MB/s, maxb=6707MB/s, mint=60001msec, maxt=60001msec

The main highlights here are the 6549 MB/s of rand read bandwidth, and 4.89 usec of clat. That completely blows direct_access+MMAP out of the water, and is pretty awesome for a MMAP'd ramdisk.

libaio + block IO

Just for complete and total fairness, I re-ran the fio against my RAM-disk with libaio (async IO) instead of MMAP:

 $ sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
--size=180m --direct=1 --runtime=60 --directory=/mnt/xip/ \
--name=rand-read --rw=randread --time_based --group_reporting
(laying out files)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [4521M/0K /s] [1130K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=4835
  read : io=264571MB, bw=4410MB/s, iops=1129K, runt= 60000msec
    slat (usec): min=2, max=32699, avg= 6.53, stdev=33.72
    clat (usec): min=0, max=21591, avg= 0.66, stdev=10.33
    bw (KB/s) : min=161100, max=368408, per=6.85%, avg=309377.89, stdev=12854.65
  cpu          : usr=18.43%, sys=60.72%, ctx=74361, majf=0, minf=367
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=67730119/0, short=0/0
     lat (usec): 2=99.84%, 4=0.15%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=264571MB, aggrb=4410MB/s, minb=4515MB/s, maxb=4515MB/s, mint=60000msec, maxt=60000msec

Skipping to the bottom line, using an MMAP'd RAMdisk was ~2GB and ~2us faster. That is pretty awesome.

Conclusion(s)

UPDATE: I finally added some PRAMFS numbers below. Enjoy!

Regular, memory-mapped files win by a small amount when it comes to bandwidth -- but PRAMFS is a very close contender.

How about latency?

Regular mmap'd files ALMOST win here also. PRAMFS wins by a hair on this test, but only when using MMAP - PRAMFS + libaio is rather... not good. This is *not* what I expected, but is rather fascinating. Granted, all of this data is collected against a 2 GB RAM disk... but it is pretty interesting that regular bio-based transfers between the page cache and my RAM disk beat out a 1MB mmap'd, kmalloc'd, direct_access()'d RAMdisk.

What does this mean in real life?

I have no idea... probably the only real conclusion you can draw from all this data / graphs is something like

  1. Memory-mapped files can be faster for some workloads
  2. XIP is no match for regular MMAP'd files, in its present state

Oh, and here's a link to the spreadsheet (mmap_fio_data.xls) I used to generate the above graphs.

Atto

Extreme Dollhouse Programming

November 13, 2010 Leave a comment

My kids woke up at 5am on Saturday morning, which was not surprising or particularly unusual.

Being the awesome, totally cool dad that I am :-), I let my wife sleep in while I entertained the kids.

Also being a geek / engineer, it obviously wasn’t good enough to just play with my kids’ toys… I soon found myself balancing dollhouse people on top of dollhouse furniture on top of dollhouses.

Ladies and Gentlemen, without further ado I give you… Extreme Dollhouse Programming.

I learned three things from playing with dollhouse toys:

  1. engineer + dollhouse = weird things happen
  2. it’s really hard to balance odd shapes while kids are trying to knock them down
  3. a lot of software is built just like these rickety furniture stacks

On #1, what else can I really say… weird stuff happens when you let engineers out of their cubicles. My wife tells me that I should get my head examined, because normal people might play house or act out episodes from Lifetime TV. Leave it to an engineer to stack Grandma 6 chairs and a toilet up in the sky.

#2 almost goes without saying… but it’s quite entertaining to see just how fast you can rebuild your tower before it gets knocked down by your toddler. Think of it like a game of reverse speed jenga. Someday it’ll be a competive Olympic sport… I can almost see my gold medal now πŸ˜‰

And #3 is my lame excuse of a tie-in to justify this post.

But seriously though, how many projects have you worked on where you felt like the whole project could come crashing down at any minute? How many complex software systems are thrown together at high speed, held together by baling wire and ugly Perl scripts? How many projects have no formal requirements, or worse yet no real customers?

How many software projects are really, carefully, methodically planned and executed in an elegant way?

This is not a critique / rant where I tear into the software industry and make stupid arguments like “software developers suck” — I’m thinking more about the way I approach software development, and thinking we all have room for improvement.

And while there are some great ideas found in eXtreme Programming, Agile, etc – I don’t think there’s one true software development style or approach. More like there are some good ideas out there, and everyone should use these ideas to improve themselves and get better at the “craft.”

So here’s to building better dollhouses software, self improvement, and all that jazz.

Atto

Categories: Uncategorized Tags: , ,