Earlier this fall, we wrote about why you might want to consider Extism for your next WebAssembly project. We got into the nitty-gritty of the work you have to do in order to simply send strings to WebAssembly plugins today. Today, we’ll walk through what Extism does behind the scenes to make it easy to communicate with plug-in guests using high-level types and, more importantly, how that performs in your application.
Extism is a WebAssembly framework that provides a shared virtual memory space (a kernel) accessible to both the host and the guest allowing for RPC-like interactions between the two, along with a set of minimal platform capabilities (logging, http requests, getting & setting global variables.)
Breaking that down: “virtual memory” is a technique that puts a layer of indirection between an address space and offsets into storage. It is most famously implemented in concert between your operating system and your processor hardware’s memory mapping unit (“MMU”.) This virtual memory gives processes the illusion of a contiguous address space backed by memory. This can be implemented in terms of page tables or segmentation registers, but generally the idea is to form a sort of map from address numbers to destination storage addresses.
Extism provides a software version of virtual memory for Wasm. The host exposes
a series of functions1 that enable Wasm guest programs to load and
store values in a virtual memory space controlled by the host. The host also
provides alloc
and free
functions to the guest for creating new blocks of
virtual memory.
Input and output use a separate set of functions apart from general virtual
memory access. This means that guest exports don’t need to take any arguments
explicitly, instead using input_load_u64
or input_load_u8
to access input
memory. Extism’s output_set
lets a plugin indicate which virtual memory
region to use as output data using an offset.
Putting it all together: during an Extism plugin call, the host moves the
plugin input to the Extism virtual memory region. The host then transfers
control to the guest. The guest uses input_length
to determine the amount of
local memory to allocate, then uses input_load_u64
to copy the input out of
the Extism virtual memory space into local memory one 64-bit double-word at
time, copying any spare bytes using input_load_u8
.
After the guest performs some processing, it calls alloc
to allocate virtual
memory for output. The guest then copies its output from local memory into
virtual memory using store_u64(offset, value)
, again picking up any spare
bytes using store_u8(offset, value)
. Finally, the guest calls
set_output(offset)
to set the output pointer to the newly allocated virtual
address.
The goal of all of this is to coordinate access to memory shared between the host and guest. One alternative approach would be to expose the guest’s memory allocator directly to the host. This approach is slightly more efficient than Extism’s, but puts control over the address space firmly in the plug-in’s corner. Depending on the guest language, it can be difficult to get a build that exposes allocation functions reliably; and even then, using the guest’s address space makes it difficult to diagnose memory corruption issues when they crop up. By using a virtual memory staging area, Extism trades a bit of performance for usability — drawing a clear line around memory issues and unifying the debugging experience across guest languages2.
That said, you might start to wonder how this performs in practice. I certainly
did! Luckily, Zach pointed me in the direction of Extism’s Criterion
benchmarks. In particular, he wrote a benchmark, reflect
, that tests the
roundtrip time from host to guest, to host function and back, then back to the
host; this benchmark is parameterized with payloads of 64 KiB, 640KiB, and
6.25MiB (653356
times 10
to the power of 0
, 1
, and 2
.)
Given what we know from the above, we have 4 loops over Extism kernel functions to copy memory to and from virtual memory:
- One to copy memory from the input into the plugin.
- One to copy memory from the plugin into extism memory before calling the host function.
- One to copy memory back to the local plugin from the host function output in extism memory.
- One to copy plugin memory into extism output.
Each loop divvies up Extism function calls between 8-byte u64
calls and
1-byte u8
calls, so each loop should take (input_size >> 3) + (input_size & 7)
, or 819,200 function calls in each direction. For reflect_100
’s
6.25MiB payload, this represents 3MM
function calls. Whew! All of the host interactions with Extism virtual memory
use memcpy
directly — in this case, there should be four memcpy
calls
total. There are a few ancillary calls between the plugin and extism — one
input_length()
call and one output_set()
call, for example, but they’re
background noise against the tens or hundreds of thousands of load/store calls.
How fast is this? How slow is this?
reflect/reflect 1 time: [223.95 µs 224.45 µs 224.99 µs]
thrpt: [277.79 MiB/s 278.45 MiB/s 279.07 MiB/s]
reflect/reflect 10 time: [2.0941 ms 2.0965 ms 2.0989 ms]
thrpt: [297.78 MiB/s 298.12 MiB/s 298.45 MiB/s]
reflect/reflect 100 time: [21.936 ms 21.958 ms 21.982 ms]
thrpt: [284.33 MiB/s 284.63 MiB/s 284.93 MiB/s]
Well, on my M2 Macbook Pro3, reflect(65536e1)
estimates 2.1ms per
invocation, while reflect(65536e2)
takes about 21.9ms. Criterion reports
that our throughput is around 284.63 MiB/s. That’s 149,230,348 Extism
kernel calls per second for the large payload ((4 * 819200)/21.958ms = ~149230.3488 ops/ms
, multiplied to seconds.) Each of those calls transfers 8
bytes, so multiplied out we’re transferring 1.11GiB/s. Divide that again by
four roundtrips and we get 298,460,697.69 bytes per second — or
284.63MiB/s, exactly what Criterion reports as the throughput. Criterion
arrived at this number more directly: (65536e2/21.958ms) * 1000
. This suggests
to me that our estimate of the number of kernel calls is probably not far off.
So, how does this stack up?
Comparing this to the Node Addon API benchmarks on the same machine,
the granular speed of calling an Extism kernel function seems to be “within the
ballpark”: at the low end, this machine runs about 37MM ops/s on the low end
and ~159MM operations per second on the high end (for
property_descriptor_noexcept
getters and “no arguments core function”,
respectively.)
Comparing this to dd if=/dev/zero of=test bs=65536 count=400
, well, we’re
nowhere close to as fast: this M2 Mac hits 1.37GiB/s; five times as fast as
Extism’s reflect
throughput. But it’s not all doom and gloom. Memory takes a
W
-shaped path through reflect
. reflect
bounces the same bytes from the
host to the guest to the host to the guest and finally back to the host. How
does a V
shape perform? Or simply writing bytes in? Well, those numbers are
pretty rosy:
consume/consume 65536 bytes
time: [48.917 µs 49.008 µs 49.101 µs]
thrpt: [1.2431 GiB/s 1.2454 GiB/s 1.2477 GiB/s]
consume/consume 655360 bytes
time: [373.00 µs 373.48 µs 373.97 µs]
thrpt: [1.6321 GiB/s 1.6342 GiB/s 1.6363 GiB/s]
consume/consume 6553600 bytes
time: [3.8832 ms 3.8912 ms 3.8994 ms]
thrpt: [1.5653 GiB/s 1.5685 GiB/s 1.5718 GiB/s]
echo/echo 65536 bytes time: [77.862 µs 78.030 µs 78.197 µs]
thrpt: [799.27 MiB/s 800.98 MiB/s 802.70 MiB/s]
echo/echo 655360 bytes time: [666.71 µs 667.28 µs 667.87 µs]
thrpt: [935.81 MiB/s 936.64 MiB/s 937.43 MiB/s]
echo/echo 6553600 bytes time: [6.8523 ms 6.8598 ms 6.8675 ms]
thrpt: [910.08 MiB/s 911.11 MiB/s 912.11 MiB/s]
To avoid burying the lede: flushing bytes to Wasm is fast! It’s within the
ballpark of our dd
benchmark. Looking at the numbers:
> bytes_oneway = (819200/3.8912 * 8000) / (1024 << 20)
1.5685432835629112 // 1.5GiB/s
> bytes_twoway = ((2 * 819200)/6.8598 * 8000) / 2 / (1024 << 10)
911.1052800373188 // 911.1MiB/s
> bytes_fourway = (4 * 819200)/21.958 * 8000 / 4 / (1024 << 10)
284.6343018489845 // 284.63MiB/s
So, what we really want to know is, roughly, how much time will it cost to transfer a given amount of data?
Inverting the relationship between operations per millisecond, we get the time
per operation: 1/(819200/3.8912) = 0.00000475
, or 4.75ns, from our
one-way test and 1/(819200 * 4 / 21.958) = 0.0000067
, or 6.7ns, from our
four-way test. That’s a pretty wide variance per operation: I’d guess that the
three extra memcpy
calls are responsible for the gap (two for moving memory
into and out of the host function, and one for moving the return value out of
the guest.)
That said, 4.75 nanoseconds per call ain’t bad4.
This gives us a nice way to guess how long moving a given payload between the
guest and host will take: input size / 8 * 0.00000475
. Trying that with
65335e1
, we get an estimate of 387µs, which is 14µs higher than our observed
value. Still, the error gets worse for higher values, so it might be best to
fudge that and add a nanosecond per roundtrip. Again, this is all back of the
napkin math!
From what we’ve demonstrated, integrating an Extism plugin is unlikely to become a performance bottleneck on your application by itself, especially for common sizes of data. While it’s useful to have a mental model of how Extism achieves its ergonomics, hopefully this walkthrough gives you an idea of how the Extism team thinks about balancing trade-offs. After all, to make all software programmable, not only does Extism have to make plugins fast, it also has to make integrating them safe, simple, and straightforward to debug.
Footnotes
-
load_u64
,load_u8
,store_u64
, andstore_u8
. Every load and store function comes in two variants:u64
andu8
, for operating on 8-byte chunks and 1-byte values respectively. ↩ -
That is not to say that a
realloc
-based approach isn’t feasible, just that it benefits from additional tooling written carefully for each guest language. For example, the Component model exposes guest memory allocation functions as part of its canonical ABI; but it also generates host interfaces to safely interact with that memory using an interface definition language. ↩ -
Remember, we’re not doing Science (TM) here! ↩
-
This is fast in part because Extism takes advantage of wasmtime’s
Linker
: the Extism kernel is actually implemented as another Wasm module! Zach has spent a lot of time making sure the Extism kernel is as safe as it is fast. We’re even able to reuse the kernel in our Go SDK, based on Wazero! ↩
Whether you're curious about WebAssembly or already putting it into production, we've got plenty more to share.
We are here to help, so click & let us know:
Get in touch