Home > Uncategorized > True Zero-Copy with XIP vs PRAMFS

True Zero-Copy with XIP vs PRAMFS


Introduction

A couple of weeks ago, Louis Brandy wrote a blog post entitled “Memory mapped IO for fun and profit,” which I found rather fascinating. I’ve always loved memory mapped files, for a lot of the reasons listed on his blog. It really makes your software simpler to be able to treat a chunk of memory like… a chunk of memory, letting the OS handle the IO behind the scenes as deemed appropriate.

I really liked his blog post, it was a great example usage case where mmap() didn’t just make the code easier to debug and/or understand, it actually provided a real, noticeable performance boost.

But one thing that’s always bugged me about MMAP’d files is they really aren’t “zero copy.” They’re more like ‘one copy’, because you still have to fault and fetch/flush pages via the normal IO mechanism. What if there was a way to actually perform zero-copy MMAP’d files? Granted, you either need RAM or something else that is direct-mapped into the processor (PCM/CMOx/Memristors anyone?), but I’d like to test my theory that this could lead to higher performance than the current mechanism.

The diagram below shows the various paths that files can take through the system, I’ve tried to color code it but probably made a mess of it.

The green arrows show the normal I/O flow with two copies of the data from retrieved from storage, the red arrows show the normal MMAP’d files, and the blue line labelled “mmap + direct_access” is what I’m all keen to test.

The basic idea I’m trying to convey is that conventional IO (either from a RAM disk or a regular SATA HDD) involves a fair amount of copying. Data is copied into the user program’s stack/heap, but it’s also copied into the page cache. So mmap’d files are neat because you can skip some of the copying and do I/O behind the scenes to flush dirty pages, etc. But what I’m after is a true zero-copy, which requires a RAMdisk.

I dig some digging, and found the following snippet from Documentation/filesystems/xip.txt:

A block device operation named direct_access is used to retrieve a reference (pointer) to a block on-disk. The reference is supposed to be cpu-addressable, physical address and remain valid until the release operation is performed.

I did some more digging, and I think what I’m really looking for is like ext2’s XIP feature — but not so much from a “I want to execute this code from NOR” standpoint, more like I want MMAP’d files which are directly addressable by the CPU.

And then there’s PRAMFS, a new-ish filesystem which doesn’t use block devices or the block layer at all – preferring to directly control the target [RAM/PRAM/???] with some page table protection thrown in for fun.

So, I set out to test which of these three methods gives the best bang for our proverbial buck.

People who just want to see the graphs / data, feel free to skip to the conclusion.

The Contenders

In no order of preference, here are the various IO methods I plan to test

  • ext2 + direct_access
  • pramfs
  • mmap() + block IO
  • libaio + block IO [no mmap]

My current plan is to use the fio benchmark with the “mmap” IO engine, because a) I have no useful examples / use cases, b) fio really is a flexible IO tester 🙂

Creating the ext2 + direct_access method was actually quite complex to setup, I decided to write a custom RAMdisk with direct_access (xiprd) and I ended up rebuilding my kernel for ext2+xip support… I’m not going to go into detail on getting this setup / working, google for “how to compile a kernel the Ubuntu way” if you’re curious.

Benchmark machine setup:

  • i7-920 @ 2.67 GHz
  • 6GB DDR3 memory
  • Ubuntu 10.04 with vanilla kernel 2.6.36.1 + ftrace + ext2 XIP

ext2 + direct_access

Once I got my custom kernel with ext2 “xip” (execute in place) support built in, I loaded my custom block device driver (xiprd) with a 2 GB RAM-disk. Of course, trying to run the fio MMAP benchmark resulted in a crash and a system reboot… followed by a harried bug hunt.

This is still a work in progress, I think the bug has something to do with my using vmalloc() instead of kmalloc(), but switching to kmalloc() limits my RAMdisk size to 1MB. Using vmalloc, I keep getting the error:


page_offset=14
[ 1720.844373] ramdisk=ffff880106e00000, kaddr=ffff880106e0e000, pfn=1076750
[ 1720.954378] fio[3749]: segfault at 7fe0265b9328 ip 000000000041b17e sp 00007fff43825f40 error 4 in fio[400000+3b000]

Hacking xiprd to use kmalloc, I then had to force mkfs.ext2 to use block size of 4K (-b 4096) otherwise the mount failed.

$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 \
--numjobs=10 --size=80k --direct=1 --runtime=10 --directory=/mnt/xip/ \
 --name=rand-read --rw=randread --group_reporting --time_based
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=4363
  read : io=21079MB, bw=2108MB/s, iops=539567, runt= 10001msec
    clat (usec): min=2, max=28356, avg=13.28, stdev=53.77
  cpu          : usr=12.75%, sys=55.71%, ctx=31149, majf=0, minf=5396522
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=5396212/0, short=0/0
     lat (usec): 4=0.05%, 10=48.70%, 20=50.97%, 50=0.07%, 100=0.01%
     lat (usec): 250=0.07%, 500=0.03%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=21079MB, aggrb=2108MB/s, minb=2158MB/s, maxb=2158MB/s, mint=10001msec, maxt=10001msec

OK, 2.1 GB/s of random read bandwidth and 13.28 us round-trip using direct_access, not too shabby. However, this is still using a 1MB RAM disk, so I need to look into fixing the vmalloc issues.

Update: I tried booting with mem=3072m, and hacked xiprd to use __va(0x10000000) instead of vmalloc’ing. I also tried an ioremap_cache() which failed. Using __va() appeared to work, but rebooted the system when trying to make the filesystem, so this is definitely a WIP.

mmap + pramfs

Work in progress, I patched pramfs into the kernel source, rebuilt it, rebooted with mem=3072m, and tried creating a pramfs filesystem with

sudo mount -t pramfs -o physaddr=0x100000000,init=2000M,bs=4k none /mnt/pram

When I boot the kernel with only 3GB of RAM, 0x100000000 is a valid system RAM address which is not in use – so this works fine. But when I run fio against it, my system locks up when laying out the last file. So this is still a work in progress, maybe I ran past the RAM or who knows. I’ll keep trying and update if I find anything useful.

I tried a few more times, with mmap and libaio engines but couldn’t get it to work – so I filed bug 3118126 for the curious.

UPDATE: I finally got back around to compiling PRAMFS without memory protection and XIP. I guess the kernel doesn’t like setting PTE’s for RAM it’s not tracking, which makes sense.

Results for libaio:

sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
 --size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/ \
 --name=rand-read --rw=randread --time_based --group_reporting
(laying out files, etc)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [841M/0K /s] [210K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=1741
  read : io=49262MB, bw=840623KB/s, iops=210155, runt= 60008msec
    slat (usec): min=33, max=30023, avg=37.08, stdev=14.81
    clat (usec): min=0, max=20027, avg= 0.41, stdev= 1.21
    bw (KB/s) : min=48520, max=105888, per=12.47%, avg=104790.37, stdev=408.00
  cpu          : usr=2.49%, sys=77.28%, ctx=55025, majf=0, minf=356
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=12611029/0, short=0/0
     lat (usec): 2=99.98%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=49262MB, aggrb=840623KB/s, minb=860798KB/s,
    maxb=860798KB/s, mint=60008msec, maxt=60008msec

For the impatient, that’s ~840 MB/s with libaio.

How about mmap + PRAMFS?

sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 \
--size=180m  --direct=1 --runtime=60 --directory=/mnt/pram/\
 --name=rand-read --rw=randread --time_based --group_reporting
(cut)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [6827M/0K /s] [1707K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=2260
  read : io=382030MB, bw=6367MB/s, iops=1630K, runt= 60001msec
    clat (usec): min=1, max=22316, avg= 4.35, stdev=21.06
    bw (KB/s) : min=44208, max=347336, per=1.31%, avg=85323.68, stdev=525.67
  cpu          : usr=48.01%, sys=31.28%, ctx=60143, majf=460800, minf=97339361
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=97799801/0, short=0/0
     lat (usec): 2=0.01%, 4=38.14%, 10=61.33%, 20=0.01%, 50=0.47%
     lat (usec): 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=382030MB, aggrb=6367MB/s, minb=6520MB/s, maxb=6520MB/s, mint=60001msec, maxt=60001msec

Awesome! 6.3 GB/s of read speed using MMAP. I’m totally loving MMAP, it rocks.

mmap + block IO

I chose ext4 for this test, because it’s the most prevalent at the time of this writing and because rumor pegs it as being more stable than btrfs.

I loaded my xiprd driver so it would be a more fair comparison – RAM disk against RAM disk, hopefully less oranges and more apples to apples.


$ sudo fio --bs=4k --ioengine=mmap --iodepth=1 --numjobs=10 --size=180m \
 --direct=1 --runtime=60 --directory=/mnt/xip/ --name=rand-read \
--rw=randread --time_based --group_reporting
(laying out files)
rand-read: (groupid=0, jobs=10): err= 0: pid=2504
  read : io=392968MB, bw=6549MB/s, iops=1677K, runt= 60001msec
    clat (usec): min=1, max=24708, avg= 4.89, stdev=33.22
    bw (KB/s) : min=188848, max=367640, per=3.86%, avg=258557.40, stdev= 0.00
  cpu          : usr=49.12%, sys=30.48%, ctx=55807, majf=460797, minf=100139571
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=100599920/0, short=0/0
     lat (usec): 2=0.01%, 4=29.03%, 10=70.76%, 20=0.17%, 50=0.03%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=392968MB, aggrb=6549MB/s, minb=6707MB/s, maxb=6707MB/s, mint=60001msec, maxt=60001msec

The main highlights here are the 6549 MB/s of rand read bandwidth, and 4.89 usec of clat. That completely blows direct_access+MMAP out of the water, and is pretty awesome for a MMAP'd ramdisk.

libaio + block IO

Just for complete and total fairness, I re-ran the fio against my RAM-disk with libaio (async IO) instead of MMAP:

 $ sudo fio --bs=4k --ioengine=libaio --iodepth=1 --numjobs=10 \
--size=180m --direct=1 --runtime=60 --directory=/mnt/xip/ \
--name=rand-read --rw=randread --time_based --group_reporting
(laying out files)
Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [4521M/0K /s] [1130K/0 iops] [eta 00m:00s]
rand-read: (groupid=0, jobs=10): err= 0: pid=4835
  read : io=264571MB, bw=4410MB/s, iops=1129K, runt= 60000msec
    slat (usec): min=2, max=32699, avg= 6.53, stdev=33.72
    clat (usec): min=0, max=21591, avg= 0.66, stdev=10.33
    bw (KB/s) : min=161100, max=368408, per=6.85%, avg=309377.89, stdev=12854.65
  cpu          : usr=18.43%, sys=60.72%, ctx=74361, majf=0, minf=367
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=67730119/0, short=0/0
     lat (usec): 2=99.84%, 4=0.15%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=264571MB, aggrb=4410MB/s, minb=4515MB/s, maxb=4515MB/s, mint=60000msec, maxt=60000msec

Skipping to the bottom line, using an MMAP'd RAMdisk was ~2GB and ~2us faster. That is pretty awesome.

Conclusion(s)

UPDATE: I finally added some PRAMFS numbers below. Enjoy!

Regular, memory-mapped files win by a small amount when it comes to bandwidth -- but PRAMFS is a very close contender.

How about latency?

Regular mmap'd files ALMOST win here also. PRAMFS wins by a hair on this test, but only when using MMAP - PRAMFS + libaio is rather... not good. This is *not* what I expected, but is rather fascinating. Granted, all of this data is collected against a 2 GB RAM disk... but it is pretty interesting that regular bio-based transfers between the page cache and my RAM disk beat out a 1MB mmap'd, kmalloc'd, direct_access()'d RAMdisk.

What does this mean in real life?

I have no idea... probably the only real conclusion you can draw from all this data / graphs is something like

  1. Memory-mapped files can be faster for some workloads
  2. XIP is no match for regular MMAP'd files, in its present state

Oh, and here's a link to the spreadsheet (mmap_fio_data.xls) I used to generate the above graphs.

Atto

Advertisements
  1. novelocrat
    November 25, 2010 at 6:48 am

    I’m pretty sure mmap()ing from a ramdisk should be zero copy, at least if the particular ramdisk driver allows it. Why shouldn’t the backing storage be mapped directly, if the filesystem and device don’t need to be able to journal data modifications?

    • November 25, 2010 at 7:39 am

      Good question. The ramdisk I used for these tests is a custom block device I wrote, so it definitely doesn’t support the mmap() system call directly. Struct block_device_operations has no mmap, just direct_access. I made the ramdisk a block device so I could benchmark libaio vs direct_access.

      Feel free to examine the source code on github and make suggestions/improvements…

  2. November 26, 2010 at 2:05 am

    Interesting result!
    I, too, have always found mmap-ed files to be a very useful and interesting concept. It does take a 64-bit architecture for them to be really useful, as with 32-bit it is too easy to run out of address space. Luckily, these are nearly ubiquitous these days.
    The only drawback to using mmap-ed memory structures as-is compared to old fashioned ‘loading’ is that all pointers are relative. This can give a (small) performance hit when accessing heavily self-referential structures such as trees. (handling pointers between multiple mmapped objects is an even more interesting problem 🙂

    • November 26, 2010 at 6:39 am

      Thanks for the comment and the detail. Yes, 64-bit computers make mmap’ing much more possible/useful. Not sure why self-referential structures take a performance hit – can you explain further?

  3. November 26, 2010 at 6:56 am

    Well, to be clear I was refering to data structures that are usable as-is after mmapping. If you use mmapped I/O simply as a replacement for traditional I/O, you won’t have this problem, but you’d still have the same loading and allocation overhead constructing data structures.

    One place where this can come in useful is paging in games. Instead of a long loading sequence, you want to be able to just mmap a part of your world into memory and use it instantly.

    For example: Normally, when you load a tree from a file with old-fasioned I/O, you’d simply allocate memory for each node, and make the nodes refer to each other using pointers.

    In the mmapped case, as you don’t know in advance where an mmapped memory block will end up in the address space, you can’t simply use pointers — you’d have to either do relocation on load (again causing unwanted overhead on loading) or do relocation on-the-fly when accessing the data (also known as lazy relocation). The second option is clearly preferable.

    However, this adds some access overhead depending on how complex your relocation scheme is. If you just add an offset, it is pretty quick. But if you have multiple mmapped files which somehow need to refer to each other (for example, models refering to textures) it can become pretty hairy.

  4. Marco
    December 30, 2010 at 10:42 am

    Good report and benchmarking activity. I have a comment about the results of latency comparing pramfs+libaio and other approaches. The latency is due to the memory protection. It needs extra locking and tlb cache flush, maybe without it, the latency would be comparable to “pramfs+mmap” use case.

    • December 31, 2010 at 6:24 am

      Extra locking and flushing would definitely add to the latency. Either way, pramfs is pretty cool and has some definite uses so keep up the good work. Thanks for the comment!

  5. Marco
    December 31, 2010 at 4:03 am

    Instead to use your xiprd device driver, you should use the classic block ram device but with xip support, you can find the option under Device drivers -> block driver of the kernel menu.

    • December 31, 2010 at 6:22 am

      Thanks for the tip – I originally started out writing xiprd because I was curious to tinker with direct_access, and this post grew from that. I’ll take a look at the standard ram disk when I get a chance. Thanks for stopping by!

  6. June 9, 2013 at 3:05 am

    thus maps the file to a region of virtual memory. Memory-mapped files are the critical piece of the storage engine in MongoDB. By using memory mapped files MongoDB can treat the contents of its data files as if they were in memory. This provides MongoDB with an extremely fast and simple method for accessing and manipulating data.

  1. November 25, 2010 at 12:52 am
  2. January 2, 2011 at 8:13 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: