Archive

Archive for November, 2010

True Zero-Copy with XIP vs PRAMFS

November 24, 2010 12 comments

Introduction

A couple of weeks ago, Louis Brandy wrote a blog post entitled “Memory mapped IO for fun and profit,” which I found rather fascinating. I’ve always loved memory mapped files, for a lot of the reasons listed on his blog. It really makes your software simpler to be able to treat a chunk of memory like… a chunk of memory, letting the OS handle the IO behind the scenes as deemed appropriate.

I really liked his blog post, it was a great example usage case where mmap() didn’t just make the code easier to debug and/or understand, it actually provided a real, noticeable performance boost.

But one thing that’s always bugged me about MMAP’d files is they really aren’t “zero copy.” They’re more like ‘one copy’, because you still have to fault and fetch/flush pages via the normal IO mechanism. What if there was a way to actually perform zero-copy MMAP’d files? Granted, you either need RAM or something else that is direct-mapped into the processor (PCM/CMOx/Memristors anyone?), but I’d like to test my theory that this could lead to higher performance than the current mechanism.

The diagram below shows the various paths that files can take through the system, I’ve tried to color code it but probably made a mess of it.

The green arrows show the normal I/O flow with two copies of the data from retrieved from storage, the red arrows show the normal MMAP’d files, and the blue line labelled “mmap + direct_access” is what I’m all keen to test.

The basic idea I’m trying to convey is that conventional IO (either from a RAM disk or a regular SATA HDD) involves a fair amount of copying. Data is copied into the user program’s stack/heap, but it’s also copied into the page cache. So mmap’d files are neat because you can skip some of the copying and do I/O behind the scenes to flush dirty pages, etc. But what I’m after is a true zero-copy, which requires a RAMdisk.

I dig some digging, and found the following snippet from Documentation/filesystems/xip.txt:

A block device operation named direct_access is used to retrieve a reference (pointer) to a block on-disk. The reference is supposed to be cpu-addressable, physical address and remain valid until the release operation is performed.

I did some more digging, and I think what I’m really looking for is like ext2’s XIP feature — but not so much from a “I want to execute this code from NOR” standpoint, more like I want MMAP’d files which are directly addressable by the CPU.

And then there’s PRAMFS, a new-ish filesystem which doesn’t use block devices or the block layer at all – preferring to directly control the target [RAM/PRAM/???] with some page table protection thrown in for fun.

So, I set out to test which of these three methods gives the best bang for our proverbial buck.

People who just want to see the graphs / data, feel free to skip to the conclusion.

The Contenders

In no order of preference, here are the various IO methods I plan to test

  • ext2 + direct_access
  • pramfs
  • mmap() + block IO
  • libaio + block IO [no mmap]

My current plan is to use the fio benchmark with the “mmap” IO engine, because a) I have no useful examples / use cases, b) fio really is a flexible IO tester 🙂

Creating the ext2 + direct_access method was actually quite complex to setup, I decided to write a custom RAMdisk with direct_access (xiprd) and I ended up rebuilding my kernel for ext2+xip support… I’m not going to go into detail on getting this setup / working, google for “how to compile a kernel the Ubuntu way” if you’re curious.

Benchmark machine setup:

  • i7-920 @ 2.67 GHz
  • 6GB DDR3 memory
  • Ubuntu 10.04 with vanilla kernel 2.6.36.1 + ftrace + ext2 XIP

ext2 + direct_access

Once I got my custom kernel with ext2 “xip” (execute in place) support built in, I loaded my custom block device driver (xiprd) with a 2 GB RAM-disk. Of course, trying to run the fio MMAP benchmark resulted in a crash and a system reboot… followed by a harried bug hunt.

This is still a work in progress, I think the bug has something to do with my using vmalloc() instead of kmalloc(), but switching to kmalloc() limits my RAMdisk size to 1MB. Using vmalloc, I keep getting the error:


page_offset=14
[ 1720.844373] ramdisk=ffff880106e00000, kaddr=ffff880106e0e000, pfn=1076750
[ 1720.954378] fio[3749]: segfault at 7fe0265b9328 ip 000000000041b17e sp 00007fff43825f40 error 4 in fio[400000+3b000]

Hacking xiprd to use kmalloc, I then had to force mkfs.ext2 to use block size of 4K (-b 4096) otherwise the mount failed.

$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 \
--numjobs=10 --size=80k --direct=1 --runtime=10 --directory=/mnt/xip/ \
 --name=rand-read --rw=randread --group_reporting --time_based
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=4363
  read : io=21079MB, bw=2108MB/s, iops=539567, runt= 10001msec
    clat (usec): min=2, max=28356, avg=13.28, stdev=53.77
  cpu          : usr=12.75%, sys=55.71%, ctx=31149, majf=0, minf=5396522
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=5396212/0, short=0/0
     lat (usec): 4=0.05%, 10=48.70%, 20=50.97%, 50=0.07%, 100=0.01%
     lat (usec): 250=0.07%, 500=0.03%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=21079MB, aggrb=2108MB/s, minb=2158MB/s, maxb=2158MB/s, mint=10001msec, maxt=10001msec

OK, 2.1 GB/s of random read bandwidth and 13.28 us round-trip using direct_access, not too shabby. However, this is still using a 1MB RAM disk, so I need to look into fixing the vmalloc issues.

Update: I tried booting with mem=3072m, and hacked xiprd to use __va(0x10000000) instead of vmalloc’ing. I also tried an ioremap_cache() which failed. Using __va() appeared to work, but rebooted the system when trying to make the filesystem, so this is definitely a WIP.

mmap + pramfs

Work in progress, I patched pramfs into the kernel source, rebuilt it, rebooted with mem=3072m, and tried creating a pramfs filesystem with

sudo mount -t pramfs -o physaddr=0x100000000,init=2000M,bs=4k none /mnt/pram

When I boot the kernel with only 3GB of RAM, 0x100000000 is a valid system RAM address which is not in use – so this works fine. But when I run fio against it, my system locks up when laying out the last file. So this is still a work in progress, maybe I ran past the RAM or who knows. I’ll keep trying and update if I find anything useful.

I tried a few more times, with mmap and libaio engines but couldn’t get it to work – so I filed bug 3118126 for the curious.

UPDATE: I finally got back around to compiling PRAMFS without memory protection and XIP. I guess the kernel doesn’t like setting PTE’s for RAM it’s not tracking, which makes sense.

Results for libaio:

sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
 --size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/ \
 --name=rand-read --rw=randread --time_based --group_reporting
(laying out files, etc)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [841M/0K /s] [210K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=1741
  read : io=49262MB, bw=840623KB/s, iops=210155, runt= 60008msec
    slat (usec): min=33, max=30023, avg=37.08, stdev=14.81
    clat (usec): min=0, max=20027, avg= 0.41, stdev= 1.21
    bw (KB/s) : min=48520, max=105888, per=12.47%, avg=104790.37, stdev=408.00
  cpu          : usr=2.49%, sys=77.28%, ctx=55025, majf=0, minf=356
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=12611029/0, short=0/0
     lat (usec): 2=99.98%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=49262MB, aggrb=840623KB/s, minb=860798KB/s,
    maxb=860798KB/s, mint=60008msec, maxt=60008msec

For the impatient, that’s ~840 MB/s with libaio.

How about mmap + PRAMFS?

sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 \
--size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/\
 --name=rand-read --rw=randread --time_based --group_reporting
(cut)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [6827M/0K /s] [1707K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=2260
  read : io=382030MB, bw=6367MB/s, iops=1630K, runt= 60001msec
    clat (usec): min=1, max=22316, avg= 4.35, stdev=21.06
    bw (KB/s) : min=44208, max=347336, per=1.31%, avg=85323.68, stdev=525.67
  cpu          : usr=48.01%, sys=31.28%, ctx=60143, majf=460800, minf=97339361
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=97799801/0, short=0/0
     lat (usec): 2=0.01%, 4=38.14%, 10=61.33%, 20=0.01%, 50=0.47%
     lat (usec): 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=382030MB, aggrb=6367MB/s, minb=6520MB/s, maxb=6520MB/s, mint=60001msec, maxt=60001msec

Awesome! 6.3 GB/s of read speed using MMAP. I’m totally loving MMAP, it rocks.

mmap + block IO

I chose ext4 for this test, because it’s the most prevalent at the time of this writing and because rumor pegs it as being more stable than btrfs.

I loaded my xiprd driver so it would be a more fair comparison – RAM disk against RAM disk, hopefully less oranges and more apples to apples.


$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 --size=180m \
 --direct=1 --runtime=60 --directory=/mnt/xip/ --name=rand-read \
--rw=randread --time_based --group_reporting
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=2504
  read : io=392968MB, bw=6549MB/s, iops=1677K, runt= 60001msec
    clat (usec): min=1, max=24708, avg= 4.89, stdev=33.22
    bw (KB/s) : min=188848, max=367640, per=3.86%, avg=258557.40, stdev= 0.00
  cpu          : usr=49.12%, sys=30.48%, ctx=55807, majf=460797, minf=100139571
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=100599920/0, short=0/0
     lat (usec): 2=0.01%, 4=29.03%, 10=70.76%, 20=0.17%, 50=0.03%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=392968MB, aggrb=6549MB/s, minb=6707MB/s, maxb=6707MB/s, mint=60001msec, maxt=60001msec

The main highlights here are the 6549 MB/s of rand read bandwidth, and 4.89 usec of clat. That completely blows direct_access+MMAP out of the water, and is pretty awesome for a MMAP'd ramdisk.

libaio + block IO

Just for complete and total fairness, I re-ran the fio against my RAM-disk with libaio (async IO) instead of MMAP:

 $ sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
--size=180m --direct=1 --runtime=60 --directory=/mnt/xip/ \
--name=rand-read --rw=randread --time_based --group_reporting
(laying out files)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [4521M/0K /s] [1130K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=4835
  read : io=264571MB, bw=4410MB/s, iops=1129K, runt= 60000msec
    slat (usec): min=2, max=32699, avg= 6.53, stdev=33.72
    clat (usec): min=0, max=21591, avg= 0.66, stdev=10.33
    bw (KB/s) : min=161100, max=368408, per=6.85%, avg=309377.89, stdev=12854.65
  cpu          : usr=18.43%, sys=60.72%, ctx=74361, majf=0, minf=367
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=67730119/0, short=0/0
     lat (usec): 2=99.84%, 4=0.15%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=264571MB, aggrb=4410MB/s, minb=4515MB/s, maxb=4515MB/s, mint=60000msec, maxt=60000msec

Skipping to the bottom line, using an MMAP'd RAMdisk was ~2GB and ~2us faster. That is pretty awesome.

Conclusion(s)

UPDATE: I finally added some PRAMFS numbers below. Enjoy!

Regular, memory-mapped files win by a small amount when it comes to bandwidth -- but PRAMFS is a very close contender.

How about latency?

Regular mmap'd files ALMOST win here also. PRAMFS wins by a hair on this test, but only when using MMAP - PRAMFS + libaio is rather... not good. This is *not* what I expected, but is rather fascinating. Granted, all of this data is collected against a 2 GB RAM disk... but it is pretty interesting that regular bio-based transfers between the page cache and my RAM disk beat out a 1MB mmap'd, kmalloc'd, direct_access()'d RAMdisk.

What does this mean in real life?

I have no idea... probably the only real conclusion you can draw from all this data / graphs is something like

  1. Memory-mapped files can be faster for some workloads
  2. XIP is no match for regular MMAP'd files, in its present state

Oh, and here's a link to the spreadsheet (mmap_fio_data.xls) I used to generate the above graphs.

Atto

Extreme Dollhouse Programming

November 13, 2010 Leave a comment

My kids woke up at 5am on Saturday morning, which was not surprising or particularly unusual.

Being the awesome, totally cool dad that I am :-), I let my wife sleep in while I entertained the kids.

Also being a geek / engineer, it obviously wasn’t good enough to just play with my kids’ toys… I soon found myself balancing dollhouse people on top of dollhouse furniture on top of dollhouses.

Ladies and Gentlemen, without further ado I give you… Extreme Dollhouse Programming.

I learned three things from playing with dollhouse toys:

  1. engineer + dollhouse = weird things happen
  2. it’s really hard to balance odd shapes while kids are trying to knock them down
  3. a lot of software is built just like these rickety furniture stacks

On #1, what else can I really say… weird stuff happens when you let engineers out of their cubicles. My wife tells me that I should get my head examined, because normal people might play house or act out episodes from Lifetime TV. Leave it to an engineer to stack Grandma 6 chairs and a toilet up in the sky.

#2 almost goes without saying… but it’s quite entertaining to see just how fast you can rebuild your tower before it gets knocked down by your toddler. Think of it like a game of reverse speed jenga. Someday it’ll be a competive Olympic sport… I can almost see my gold medal now 😉

And #3 is my lame excuse of a tie-in to justify this post.

But seriously though, how many projects have you worked on where you felt like the whole project could come crashing down at any minute? How many complex software systems are thrown together at high speed, held together by baling wire and ugly Perl scripts? How many projects have no formal requirements, or worse yet no real customers?

How many software projects are really, carefully, methodically planned and executed in an elegant way?

This is not a critique / rant where I tear into the software industry and make stupid arguments like “software developers suck” — I’m thinking more about the way I approach software development, and thinking we all have room for improvement.

And while there are some great ideas found in eXtreme Programming, Agile, etc – I don’t think there’s one true software development style or approach. More like there are some good ideas out there, and everyone should use these ideas to improve themselves and get better at the “craft.”

So here’s to building better dollhouses software, self improvement, and all that jazz.

Atto

Categories: Uncategorized Tags: , ,

Joel on Software

November 3, 2010 Leave a comment

I’m sitting at the Garden Court hotel in Palo Alto, California where there are no less than three bubbling fountains within earshot. I’m seated in an open-air courtyard (see photos below) where I’m impatiently waiting for 3:00 to roll around.

Joel Spolsky of “Joel on Software” fame is doing a world tour to pitch FogBugz and Kiln. I’m not familiar with FogBugz (been a JIRA/Crucible user for a while) but I’m mildly excited about hearing about FogBugz, especially if he demos Evidence Based Scheduling. Kiln, well, yes I am very curious to find out more about what they’ve done to Mercurial especially with the recount announcement that Atlassian purchased BitBucket. I’ve been wondering for years why people spend so much time worrying about revision control tools and forget about the whole rest of the stack – GitHub is a perfect example of going beyond just revision control to include the whole change flow from code to bugs to review and so on.

So yes, I’m excited to learn more about Kiln… But the real reason I’m here is Joel.

It’s silly I know, but I’ve been following Joel’s blog for a number of years now and I’m sad that he’s stopped. I’m sure that after 10 years of blogging he’s on to new and better things – but Joel has always stood out in my mind as someone who makes things happen, someone who makes a difference. Am I personally a better developer because of Joel’s ramblings? Maybe, hard to prove. But he made me think new thoughts and helped me to see the world in a slightly different light, see things from another perspective. Do I agree with his thinking that .NET is awesome and Open source is pointless/misguided? Not in the slightest. But I am excited to finally meet the man who challenged my way of thinking and ultimately made me a better person through his long and analogy-ridden blog posts.

Well it’s getting close to the event, so I’m going to go join in the fun.

Best,

Atto

Fountain
courtyard
iPad

Categories: Uncategorized Tags: , ,