Back of the Napkin Wasm Performance: How Does Extism Work?

Check out the Extism project on GitHub!

Earlier this fall, we wrote about why you might want to consider Extism for your next WebAssembly project. We got into the nitty-gritty of the work you have to do in order to simply send strings to WebAssembly plugins today. Today, we’ll walk through what Extism does behind the scenes to make it easy to communicate with plug-in guests using high-level types and, more importantly, how that performs in your application.

Extism is a WebAssembly framework that provides a shared virtual memory space (a kernel) accessible to both the host and the guest allowing for RPC-like interactions between the two, along with a set of minimal platform capabilities (logging, http requests, getting & setting global variables.)

Breaking that down: “virtual memory” is a technique that puts a layer of indirection between an address space and offsets into storage. It is most famously implemented in concert between your operating system and your processor hardware’s memory mapping unit (“MMU”.) This virtual memory gives processes the illusion of a contiguous address space backed by memory. This can be implemented in terms of page tables or segmentation registers, but generally the idea is to form a sort of map from address numbers to destination storage addresses.

Extism provides a software version of virtual memory for Wasm. The host exposes a series of functions¹ that enable Wasm guest programs to load and store values in a virtual memory space controlled by the host. The host also provides alloc and free functions to the guest for creating new blocks of virtual memory.

Input and output use a separate set of functions apart from general virtual memory access. This means that guest exports don’t need to take any arguments explicitly, instead using input_load_u64 or input_load_u8 to access input memory. Extism’s output_set lets a plugin indicate which virtual memory region to use as output data using an offset.

Putting it all together: during an Extism plugin call, the host moves the plugin input to the Extism virtual memory region. The host then transfers control to the guest. The guest uses input_length to determine the amount of local memory to allocate, then uses input_load_u64 to copy the input out of the Extism virtual memory space into local memory one 64-bit double-word at time, copying any spare bytes using input_load_u8.

After the guest performs some processing, it calls alloc to allocate virtual memory for output. The guest then copies its output from local memory into virtual memory using store_u64(offset, value), again picking up any spare bytes using store_u8(offset, value). Finally, the guest calls set_output(offset) to set the output pointer to the newly allocated virtual address.

The goal of all of this is to coordinate access to memory shared between the host and guest. One alternative approach would be to expose the guest’s memory allocator directly to the host. This approach is slightly more efficient than Extism’s, but puts control over the address space firmly in the plug-in’s corner. Depending on the guest language, it can be difficult to get a build that exposes allocation functions reliably; and even then, using the guest’s address space makes it difficult to diagnose memory corruption issues when they crop up. By using a virtual memory staging area, Extism trades a bit of performance for usability — drawing a clear line around memory issues and unifying the debugging experience across guest languages².

That said, you might start to wonder how this performs in practice. I certainly did! Luckily, Zach pointed me in the direction of Extism’s Criterion benchmarks. In particular, he wrote a benchmark, reflect, that tests the roundtrip time from host to guest, to host function and back, then back to the host; this benchmark is parameterized with payloads of 64 KiB, 640KiB, and 6.25MiB (653356 times 10 to the power of 0, 1, and 2.)

Given what we know from the above, we have 4 loops over Extism kernel functions to copy memory to and from virtual memory:

One to copy memory from the input into the plugin.
One to copy memory from the plugin into extism memory before calling the host function.
One to copy memory back to the local plugin from the host function output in extism memory.
One to copy plugin memory into extism output.

Each loop divvies up Extism function calls between 8-byte u64 calls and 1-byte u8 calls, so each loop should take (input_size >> 3) + (input_size & 7), or 819,200 function calls in each direction. For reflect_100’s 6.25MiB payload, this represents 3MM function calls. Whew! All of the host interactions with Extism virtual memory use memcpy directly — in this case, there should be four memcpy calls total. There are a few ancillary calls between the plugin and extism — one input_length() call and one output_set() call, for example, but they’re background noise against the tens or hundreds of thousands of load/store calls.

How fast is this? How slow is this?

  reflect/reflect 1       time:   [223.95 µs 224.45 µs 224.99 µs]
                        thrpt:  [277.79 MiB/s 278.45 MiB/s 279.07 MiB/s]
reflect/reflect 10      time:   [2.0941 ms 2.0965 ms 2.0989 ms]
                        thrpt:  [297.78 MiB/s 298.12 MiB/s 298.45 MiB/s]
reflect/reflect 100     time:   [21.936 ms 21.958 ms 21.982 ms]
                        thrpt:  [284.33 MiB/s 284.63 MiB/s 284.93 MiB/s]

Well, on my M2 Macbook Pro³, reflect(65536e1) estimates 2.1ms per invocation, while reflect(65536e2) takes about 21.9ms. Criterion reports that our throughput is around 284.63 MiB/s. That’s 149,230,348 Extism kernel calls per second for the large payload ((4 * 819200)/21.958ms = ~149230.3488 ops/ms, multiplied to seconds.) Each of those calls transfers 8 bytes, so multiplied out we’re transferring 1.11GiB/s. Divide that again by four roundtrips and we get 298,460,697.69 bytes per second — or 284.63MiB/s, exactly what Criterion reports as the throughput. Criterion arrived at this number more directly: (65536e2/21.958ms) * 1000. This suggests to me that our estimate of the number of kernel calls is probably not far off.

So, how does this stack up?

Comparing this to the Node Addon API benchmarks on the same machine, the granular speed of calling an Extism kernel function seems to be “within the ballpark”: at the low end, this machine runs about 37MM ops/s on the low end and ~159MM operations per second on the high end (for property_descriptor_noexcept getters and “no arguments core function”, respectively.)

Comparing this to dd if=/dev/zero of=test bs=65536 count=400, well, we’re nowhere close to as fast: this M2 Mac hits 1.37GiB/s; five times as fast as Extism’s reflect throughput. But it’s not all doom and gloom. Memory takes a W-shaped path through reflect. reflect bounces the same bytes from the host to the guest to the host to the guest and finally back to the host. How does a V shape perform? Or simply writing bytes in? Well, those numbers are pretty rosy:

  consume/consume 65536 bytes
                        time:   [48.917 µs 49.008 µs 49.101 µs]
                        thrpt:  [1.2431 GiB/s 1.2454 GiB/s 1.2477 GiB/s]
consume/consume 655360 bytes
                        time:   [373.00 µs 373.48 µs 373.97 µs]
                        thrpt:  [1.6321 GiB/s 1.6342 GiB/s 1.6363 GiB/s]
consume/consume 6553600 bytes
                        time:   [3.8832 ms 3.8912 ms 3.8994 ms]
                        thrpt:  [1.5653 GiB/s 1.5685 GiB/s 1.5718 GiB/s]
echo/echo 65536 bytes   time:   [77.862 µs 78.030 µs 78.197 µs]
                        thrpt:  [799.27 MiB/s 800.98 MiB/s 802.70 MiB/s]
echo/echo 655360 bytes  time:   [666.71 µs 667.28 µs 667.87 µs]
                        thrpt:  [935.81 MiB/s 936.64 MiB/s 937.43 MiB/s]
echo/echo 6553600 bytes time:   [6.8523 ms 6.8598 ms 6.8675 ms]
                        thrpt:  [910.08 MiB/s 911.11 MiB/s 912.11 MiB/s]

To avoid burying the lede: flushing bytes to Wasm is fast! It’s within the ballpark of our dd benchmark. Looking at the numbers:

  > bytes_oneway = (819200/3.8912 * 8000) / (1024 << 20)
1.5685432835629112 // 1.5GiB/s
> bytes_twoway = ((2 * 819200)/6.8598 * 8000) / 2 / (1024 << 10)
911.1052800373188 // 911.1MiB/s
> bytes_fourway = (4 * 819200)/21.958 * 8000 / 4 / (1024 << 10)
284.6343018489845 // 284.63MiB/s

So, what we really want to know is, roughly, how much time will it cost to transfer a given amount of data?

Inverting the relationship between operations per millisecond, we get the time per operation: 1/(819200/3.8912) = 0.00000475, or 4.75ns, from our one-way test and 1/(819200 * 4 / 21.958) = 0.0000067, or 6.7ns, from our four-way test. That’s a pretty wide variance per operation: I’d guess that the three extra memcpy calls are responsible for the gap (two for moving memory into and out of the host function, and one for moving the return value out of the guest.)

That said, 4.75 nanoseconds per call ain’t bad⁴.

This gives us a nice way to guess how long moving a given payload between the guest and host will take: input size / 8 * 0.00000475. Trying that with 65335e1, we get an estimate of 387µs, which is 14µs higher than our observed value. Still, the error gets worse for higher values, so it might be best to fudge that and add a nanosecond per roundtrip. Again, this is all back of the napkin math!

From what we’ve demonstrated, integrating an Extism plugin is unlikely to become a performance bottleneck on your application by itself, especially for common sizes of data. While it’s useful to have a mental model of how Extism achieves its ergonomics, hopefully this walkthrough gives you an idea of how the Extism team thinks about balancing trade-offs. After all, to make all software programmable, not only does Extism have to make plugins fast, it also has to make integrating them safe, simple, and straightforward to debug.

load_u64, load_u8, store_u64, and store_u8. Every load and store function comes in two variants: u64 and u8, for operating on 8-byte chunks and 1-byte values respectively. ↩
That is not to say that a realloc-based approach isn’t feasible, just that it benefits from additional tooling written carefully for each guest language. For example, the Component model exposes guest memory allocation functions as part of its canonical ABI; but it also generates host interfaces to safely interact with that memory using an interface definition language. ↩
Remember, we’re not doing Science (TM) here! ↩
This is fast in part because Extism takes advantage of wasmtime’s Linker: the Extism kernel is actually implemented as another Wasm module! Zach has spent a lot of time making sure the Extism kernel is as safe as it is fast. We’re even able to reuse the kernel in our Go SDK, based on Wazero! ↩

Back of the Napkin Wasm Performance: How Does Extism Work?

Back of the Napkin Wasm Performance: How Does Extism Work?

Footnotes