Archive

Posts Tagged ‘direct_access’

True Zero-Copy with XIP vs PRAMFS

November 24, 2010 12 comments

Introduction

A couple of weeks ago, Louis Brandy wrote a blog post entitled “Memory mapped IO for fun and profit,” which I found rather fascinating. I’ve always loved memory mapped files, for a lot of the reasons listed on his blog. It really makes your software simpler to be able to treat a chunk of memory like… a chunk of memory, letting the OS handle the IO behind the scenes as deemed appropriate.

I really liked his blog post, it was a great example usage case where mmap() didn’t just make the code easier to debug and/or understand, it actually provided a real, noticeable performance boost.

But one thing that’s always bugged me about MMAP’d files is they really aren’t “zero copy.” They’re more like ‘one copy’, because you still have to fault and fetch/flush pages via the normal IO mechanism. What if there was a way to actually perform zero-copy MMAP’d files? Granted, you either need RAM or something else that is direct-mapped into the processor (PCM/CMOx/Memristors anyone?), but I’d like to test my theory that this could lead to higher performance than the current mechanism.

The diagram below shows the various paths that files can take through the system, I’ve tried to color code it but probably made a mess of it.

The green arrows show the normal I/O flow with two copies of the data from retrieved from storage, the red arrows show the normal MMAP’d files, and the blue line labelled “mmap + direct_access” is what I’m all keen to test.

The basic idea I’m trying to convey is that conventional IO (either from a RAM disk or a regular SATA HDD) involves a fair amount of copying. Data is copied into the user program’s stack/heap, but it’s also copied into the page cache. So mmap’d files are neat because you can skip some of the copying and do I/O behind the scenes to flush dirty pages, etc. But what I’m after is a true zero-copy, which requires a RAMdisk.

I dig some digging, and found the following snippet from Documentation/filesystems/xip.txt:

A block device operation named direct_access is used to retrieve a reference (pointer) to a block on-disk. The reference is supposed to be cpu-addressable, physical address and remain valid until the release operation is performed.

I did some more digging, and I think what I’m really looking for is like ext2’s XIP feature — but not so much from a “I want to execute this code from NOR” standpoint, more like I want MMAP’d files which are directly addressable by the CPU.

And then there’s PRAMFS, a new-ish filesystem which doesn’t use block devices or the block layer at all – preferring to directly control the target [RAM/PRAM/???] with some page table protection thrown in for fun.

So, I set out to test which of these three methods gives the best bang for our proverbial buck.

People who just want to see the graphs / data, feel free to skip to the conclusion.

The Contenders

In no order of preference, here are the various IO methods I plan to test

  • ext2 + direct_access
  • pramfs
  • mmap() + block IO
  • libaio + block IO [no mmap]

My current plan is to use the fio benchmark with the “mmap” IO engine, because a) I have no useful examples / use cases, b) fio really is a flexible IO tester 🙂

Creating the ext2 + direct_access method was actually quite complex to setup, I decided to write a custom RAMdisk with direct_access (xiprd) and I ended up rebuilding my kernel for ext2+xip support… I’m not going to go into detail on getting this setup / working, google for “how to compile a kernel the Ubuntu way” if you’re curious.

Benchmark machine setup:

  • i7-920 @ 2.67 GHz
  • 6GB DDR3 memory
  • Ubuntu 10.04 with vanilla kernel 2.6.36.1 + ftrace + ext2 XIP

ext2 + direct_access

Once I got my custom kernel with ext2 “xip” (execute in place) support built in, I loaded my custom block device driver (xiprd) with a 2 GB RAM-disk. Of course, trying to run the fio MMAP benchmark resulted in a crash and a system reboot… followed by a harried bug hunt.

This is still a work in progress, I think the bug has something to do with my using vmalloc() instead of kmalloc(), but switching to kmalloc() limits my RAMdisk size to 1MB. Using vmalloc, I keep getting the error:


page_offset=14
[ 1720.844373] ramdisk=ffff880106e00000, kaddr=ffff880106e0e000, pfn=1076750
[ 1720.954378] fio[3749]: segfault at 7fe0265b9328 ip 000000000041b17e sp 00007fff43825f40 error 4 in fio[400000+3b000]

Hacking xiprd to use kmalloc, I then had to force mkfs.ext2 to use block size of 4K (-b 4096) otherwise the mount failed.

$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 \
--numjobs=10 --size=80k --direct=1 --runtime=10 --directory=/mnt/xip/ \
 --name=rand-read --rw=randread --group_reporting --time_based
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=4363
  read : io=21079MB, bw=2108MB/s, iops=539567, runt= 10001msec
    clat (usec): min=2, max=28356, avg=13.28, stdev=53.77
  cpu          : usr=12.75%, sys=55.71%, ctx=31149, majf=0, minf=5396522
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=5396212/0, short=0/0
     lat (usec): 4=0.05%, 10=48.70%, 20=50.97%, 50=0.07%, 100=0.01%
     lat (usec): 250=0.07%, 500=0.03%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=21079MB, aggrb=2108MB/s, minb=2158MB/s, maxb=2158MB/s, mint=10001msec, maxt=10001msec

OK, 2.1 GB/s of random read bandwidth and 13.28 us round-trip using direct_access, not too shabby. However, this is still using a 1MB RAM disk, so I need to look into fixing the vmalloc issues.

Update: I tried booting with mem=3072m, and hacked xiprd to use __va(0x10000000) instead of vmalloc’ing. I also tried an ioremap_cache() which failed. Using __va() appeared to work, but rebooted the system when trying to make the filesystem, so this is definitely a WIP.

mmap + pramfs

Work in progress, I patched pramfs into the kernel source, rebuilt it, rebooted with mem=3072m, and tried creating a pramfs filesystem with

sudo mount -t pramfs -o physaddr=0x100000000,init=2000M,bs=4k none /mnt/pram

When I boot the kernel with only 3GB of RAM, 0x100000000 is a valid system RAM address which is not in use – so this works fine. But when I run fio against it, my system locks up when laying out the last file. So this is still a work in progress, maybe I ran past the RAM or who knows. I’ll keep trying and update if I find anything useful.

I tried a few more times, with mmap and libaio engines but couldn’t get it to work – so I filed bug 3118126 for the curious.

UPDATE: I finally got back around to compiling PRAMFS without memory protection and XIP. I guess the kernel doesn’t like setting PTE’s for RAM it’s not tracking, which makes sense.

Results for libaio:

sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
 --size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/ \
 --name=rand-read --rw=randread --time_based --group_reporting
(laying out files, etc)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [841M/0K /s] [210K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=1741
  read : io=49262MB, bw=840623KB/s, iops=210155, runt= 60008msec
    slat (usec): min=33, max=30023, avg=37.08, stdev=14.81
    clat (usec): min=0, max=20027, avg= 0.41, stdev= 1.21
    bw (KB/s) : min=48520, max=105888, per=12.47%, avg=104790.37, stdev=408.00
  cpu          : usr=2.49%, sys=77.28%, ctx=55025, majf=0, minf=356
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=12611029/0, short=0/0
     lat (usec): 2=99.98%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=49262MB, aggrb=840623KB/s, minb=860798KB/s,
    maxb=860798KB/s, mint=60008msec, maxt=60008msec

For the impatient, that’s ~840 MB/s with libaio.

How about mmap + PRAMFS?

sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 \
--size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/\
 --name=rand-read --rw=randread --time_based --group_reporting
(cut)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [6827M/0K /s] [1707K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=2260
  read : io=382030MB, bw=6367MB/s, iops=1630K, runt= 60001msec
    clat (usec): min=1, max=22316, avg= 4.35, stdev=21.06
    bw (KB/s) : min=44208, max=347336, per=1.31%, avg=85323.68, stdev=525.67
  cpu          : usr=48.01%, sys=31.28%, ctx=60143, majf=460800, minf=97339361
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=97799801/0, short=0/0
     lat (usec): 2=0.01%, 4=38.14%, 10=61.33%, 20=0.01%, 50=0.47%
     lat (usec): 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=382030MB, aggrb=6367MB/s, minb=6520MB/s, maxb=6520MB/s, mint=60001msec, maxt=60001msec

Awesome! 6.3 GB/s of read speed using MMAP. I’m totally loving MMAP, it rocks.

mmap + block IO

I chose ext4 for this test, because it’s the most prevalent at the time of this writing and because rumor pegs it as being more stable than btrfs.

I loaded my xiprd driver so it would be a more fair comparison – RAM disk against RAM disk, hopefully less oranges and more apples to apples.


$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 --size=180m \
 --direct=1 --runtime=60 --directory=/mnt/xip/ --name=rand-read \
--rw=randread --time_based --group_reporting
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=2504
  read : io=392968MB, bw=6549MB/s, iops=1677K, runt= 60001msec
    clat (usec): min=1, max=24708, avg= 4.89, stdev=33.22
    bw (KB/s) : min=188848, max=367640, per=3.86%, avg=258557.40, stdev= 0.00
  cpu          : usr=49.12%, sys=30.48%, ctx=55807, majf=460797, minf=100139571
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=100599920/0, short=0/0
     lat (usec): 2=0.01%, 4=29.03%, 10=70.76%, 20=0.17%, 50=0.03%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=392968MB, aggrb=6549MB/s, minb=6707MB/s, maxb=6707MB/s, mint=60001msec, maxt=60001msec

The main highlights here are the 6549 MB/s of rand read bandwidth, and 4.89 usec of clat. That completely blows direct_access+MMAP out of the water, and is pretty awesome for a MMAP'd ramdisk.

libaio + block IO

Just for complete and total fairness, I re-ran the fio against my RAM-disk with libaio (async IO) instead of MMAP:

 $ sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
--size=180m --direct=1 --runtime=60 --directory=/mnt/xip/ \
--name=rand-read --rw=randread --time_based --group_reporting
(laying out files)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [4521M/0K /s] [1130K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=4835
  read : io=264571MB, bw=4410MB/s, iops=1129K, runt= 60000msec
    slat (usec): min=2, max=32699, avg= 6.53, stdev=33.72
    clat (usec): min=0, max=21591, avg= 0.66, stdev=10.33
    bw (KB/s) : min=161100, max=368408, per=6.85%, avg=309377.89, stdev=12854.65
  cpu          : usr=18.43%, sys=60.72%, ctx=74361, majf=0, minf=367
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=67730119/0, short=0/0
     lat (usec): 2=99.84%, 4=0.15%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=264571MB, aggrb=4410MB/s, minb=4515MB/s, maxb=4515MB/s, mint=60000msec, maxt=60000msec

Skipping to the bottom line, using an MMAP'd RAMdisk was ~2GB and ~2us faster. That is pretty awesome.

Conclusion(s)

UPDATE: I finally added some PRAMFS numbers below. Enjoy!

Regular, memory-mapped files win by a small amount when it comes to bandwidth -- but PRAMFS is a very close contender.

How about latency?

Regular mmap'd files ALMOST win here also. PRAMFS wins by a hair on this test, but only when using MMAP - PRAMFS + libaio is rather... not good. This is *not* what I expected, but is rather fascinating. Granted, all of this data is collected against a 2 GB RAM disk... but it is pretty interesting that regular bio-based transfers between the page cache and my RAM disk beat out a 1MB mmap'd, kmalloc'd, direct_access()'d RAMdisk.

What does this mean in real life?

I have no idea... probably the only real conclusion you can draw from all this data / graphs is something like

  1. Memory-mapped files can be faster for some workloads
  2. XIP is no match for regular MMAP'd files, in its present state

Oh, and here's a link to the spreadsheet (mmap_fio_data.xls) I used to generate the above graphs.

Atto

Advertisements