Database UDFs, or user-defined functions, are functions that are created by users to add custom functionality to a database. With UDFs, users can extend the capabilities of a database beyond its built-in functions and tailor it to their specific needs. UDFs are commonly used to perform complex calculations, manipulate data, or perform custom operations that are not natively supported by the database. They help to reduce the amount of code that needs to be written and improve performance by avoiding the need to move data outside of the database.
Supercharge UDFs with WebAssembly + Extism
We’ve previously discussed the impact of UDFs on modern database workloads, as well as the myriad benefits that WebAssembly (Wasm) brings to the mix. In this post, we’ll show an example of this by augmenting PostgreSQL with an extension that runs Wasm-powered UDFs. We’ll accomplish this feat by using the WebAssembly framework, Extism, which gives us several nice features on top of core Wasm:
- Extism has Plug-in Development Kits (PDKs) for several programming languages, enabling users to write UDFs in their favorite language and then compile them to Wasm.
- With Extism, we don’t need to worry about the low-level implementation details for each Wasm runtime or how memory allocation/deallocation is handled.
- Extism hosts provide developer-friendly constructs on top of Wasm’s sandboxing and capability-based security features.
What’s pgrx
pgrx
is a Rust framework for developing PostgreSQL extensions. It abstracts
away the internal implementation details of PostgreSQL and allows us to focus on
writing our extensions in idiomatic Rust. We can easily define functions from
Rust:
#[pg_extern]
fn to_lowercase(input: &str) -> String {
input.to_lowercase()
}
and they’ll be available in SQL:
pg_extism=# select to_lowercase('HELLO WORLD!');
to_lowercase
--------------
hello world!
(1 row)
Introducing pg_extism
pg_extism
uses pgrx
and the
Extism Rust SDK to create a PostgreSQL extension that can run Extism plugins.
Users can easily define their own custom plugins and use them in their SQL
queries. The implementation is based on these 2 functions:
extism_call
takes a path of a Extism plugin (a .wasm
file), and the name of
a function exported by the Extism plugin, and an input JSON object.
extism_call
then instantiates the plugin using Extism, and calls it and
returns the result as a JSON object.
#[pg_extern]
fn extism_call(path: &str, name: &str, input: Json) -> Result<Json, Error> {
let json_string = serde_json::to_string(&input.0).unwrap();
let mut plugin = new_plugin(path);
let data = match plugin.call(name, json_string) {
Ok(v) => v,
Err(e) => error!("Error while calling plugin: {}", e),
};
let output = match std::str::from_utf8(data) {
Ok(v) => v,
Err(e) => error!("Invalid UTF-8 sequence: {}", e),
};
let response_json: serde_json::Value = serde_json::from_str(output).unwrap();
Ok(pgx::Json(response_json))
}
extism_define
takes a path of a Extism plugin (a .wasm
file) and a function name and
creates an SQL function.
#[pg_extern]
fn extism_define(path: &str, name: &str) -> Result<(), Error> {
let mut plugin = new_plugin(path);
if !plugin.has_function("metadata") {
error!("Expected a `metadata` function.");
}
let metadata_json = match plugin.call("metadata", "") {
Ok(v) => v,
Err(err) => error!("Failed to call metadata function: {}", err),
};
let metadata: PluginMetadata = match serde_json::from_slice(metadata_json) {
Ok(v) => v,
Err(err) => error!("Failed to deserialize metadata: {}", err),
};
// Write an SQL function that calls `extism_call` when it's run
// See below for examples of the generated SQL
let sql = generate_dynamic_function(path, name, &metadata);
Ok(pgx::Spi::run(&sql)?)
}
In PostgreSQL, users will only need to call extism_define
explicitly:
select extism_define('/path/to/chatgpt.wasm', 'llm');
This will define an SQL function called llm
with the following schema:
CREATE OR REPLACE FUNCTION llm(prompt TEXT, payload TEXT) RETURNS TEXT AS $$
DECLARE
result_json json;
input_param json;
BEGIN
-- Construct JSON object from parameters
input_param := json_build_object('payload', payload,'prompt', prompt);
-- Call the extism_define function using the provided parameters
SELECT extism_call('/path/to/chatgpt.wasm', 'chatgpt', input_param) INTO result_json;
-- Return the desired field from the result JSON
RETURN (result_json->'response')::TEXT;
EXCEPTION
WHEN others THEN
-- Handle exceptions if necessary
RAISE NOTICE 'An error occurred: %', SQLERRM;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
Notice how we’re using extism_call
to invoke the Extism plugin when the SQL
function is called. If the plugin returns an array, then we will generate a
similar SQL function:
select extism_define('/path/to/read_csv.wasm', 'read_csv');
CREATE OR REPLACE FUNCTION read_csv(path TEXT) RETURNS SETOF JSON AS $$
DECLARE
result_json json;
input_param json;
BEGIN
-- Construct JSON object from parameters
input_param := json_build_object('path', path);
-- Call the extism_define function using the provided parameters
SELECT extism_call('/path/to/read_csv.wasm', '_start', input_param) INTO result_json;
-- Return the desired field from the result JSON
RETURN QUERY SELECT value::JSON FROM json_array_elements((result_json->'rows')::json);
EXCEPTION
WHEN others THEN
-- Handle exceptions if necessary
RAISE NOTICE 'An error occurred: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;
Notice how we’re using json_array_elements
to turn the json array into
SETOF JSON
so that it can be used in SQL queries more naturally.
Writing plugins for the database
And how does the generate_dynamic_function
know what parameters are necessary,
what’s the return type, and what Extism function to invoke? Each pg_extism
plugin must export a function called metadata
that returns a JSON object like
this:
{
"entryPoint": "name of exported wasm function to call",
"parameters": {
"param1": "String",
"param2": "Json",
"param3": "Number"
},
"returnType": "StringArray",
"returnField": "name of the data field on the response JSON object"
}
ChatGPT plugin
This is an Extism plugin that sends a prompt to OpenAI’s ChatGPT API and returns the result. This is the implementation in JavaScript (using Extism’s JS PDK):
function metadata() {
const metadata = {
entryPoint: "chatgpt",
parameters: {
"prompt": "String",
"payload": "String",
},
returnType: "String",
returnField: "response",
};
Host.outputString(JSON.stringify(metadata));
}
function chatgpt() {
const { prompt, payload } = JSON.parse(Host.inputString());
let body = JSON.stringify({
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": `${prompt} ${payload}`,
},
],
});
let httpResponse = Http.request(
{
url: "https://api.openai.com/v1/chat/completions",
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${Config.get("openai_apikey")}`,
},
},
body,
);
let response = JSON.parse(httpResponse.body);
if (!response.choices || !response.choices[0]) {
throw new Error(`Unexpected response from OpenAI: ${httpResponse.body}`);
}
console.log(
"LLM: Received Response from OpenAI",
response.choices[0].message.content,
);
Host.outputString(JSON.stringify({
response: response.choices[0].message.content,
}));
}
module.exports = { chatgpt, metadata };
Users can use the plugin in SQL like this:
pg_extism=# select extism_define('/path/to/chatgpt.wasm', 'llm');
pg_extism=# select llm('summarize this paragraph in less than 5 words: ', 'this tablet is great for reading the text is not a good as it is on my tab s2 9.7 in but this is about $200 cheaper i got it because i wanted a more natural book feel reading my books and the tab s2 was not doing it for me');
llm
-----------------------------------
Good tablet, cheaper than Tab S2.
(1 row)
ReadCSV plugin
This plugin is implemented in Go, using Extism’s Go PDK.
// Some code is omitted for brevity
//export metadata
func metadata() int32 {
plugin := PluginMetadata{
EntryPoint: "_start", // we want the Host to call the main method
Parameters: map[string]string{
"path": "String",
},
ReturnType: "JsonArray",
ReturnField: "rows",
}
out, err := json.Marshal(plugin)
if err != nil {
panic(err)
}
pdk.Output(out)
return 0
}
func read_csv() int32 {
params := Params{}
err := json.Unmarshal(pdk.Input(), ¶ms)
if err != nil {
panic(err)
}
file, err := os.Open(params.Path)
if err != nil {
panic(err)
}
defer file.Close()
reader := csv.NewReader(file)
headers, err := reader.Read()
if err != nil {
panic(err)
}
var jsonArray []map[string]interface{}
for {
record, err := reader.Read()
if err != nil {
break
}
jsonObject := make(map[string]interface{})
for i, value := range record {
jsonObject[headers[i]] = value
}
jsonArray = append(jsonArray, jsonObject)
}
buff, err := json.Marshal(map[string]interface{}{"rows": jsonArray})
if err != nil {
panic(err)
}
mem := pdk.AllocateBytes(buff)
pdk.OutputMemory(mem)
return 8
}
func main() {
read_csv()
}
To test the plugins, I have downloaded a subset of this Amazon Reviews dataset from Kaggle.
-- Create a reviews table
CREATE TABLE reviews (
id SERIAL PRIMARY KEY,
product_id TEXT,
name TEXT,
rating NUMERIC,
review_text TEXT,
review_summary TEXT,
review_sentiment TEXT
);
-- Insert data from /path/to/reviews.csv into reviews table
INSERT INTO reviews (product_id, name, rating, review_text)
SELECT
json_data->>'id' AS product_id,
json_data->>'name' AS name,
(json_data->>'rating')::numeric AS rating,
json_data->>'review_text' AS review_text
FROM
read_csv('/path/to/reviews.csv') AS json_data;
-- Ask ChatGPT to summerize and run sentiment analysis on the reviews
update reviews set
review_summary = llm('summerize this in 5 words', review_text),
review_sentiment = llm('give me sentiment of this review (from -1 to 1): ', review_text);
-- Check the results
pg_extism=# select rating, review_summary, review_sentiment from reviews limit 3;
rating | review_summary | review_sentiment
--------+--------------------------------------------------------+------------------
5 | "Perfect size and functional design." | "1"
4 | "Efficient, user-friendly, convenient, touchy button." | "0.8"
4 | "Convert PDF to Kindle, email articles." | "0.4"
Future work
While our plugin system for PostgreSQL works, it’s not very fast. Here are some steps that can make it faster and more efficient:
Make communication more efficient by using a different serialization protocol
While JSON is very easy to work with, it can be slow and inefficient. This can add a lot of overhead in chatty scenarios. We can use a binary serialization format like Protocol Buffers or MessagePack to make things faster by both providing faster serialization/deserialization and decreasing the payload size.
Pre-initialize plugins using wizer
Before running Wasm modules, Wasm runtimes need to link appropriate imported
functions and initialize their state.
Wizer is a Bytecode Alliance tool
that allows you to pre-initialize the modules and save a snapshot of them in
extism_define
. And then, in extism_call
you can run the pre-initialized
plugin. This can make initialization up to 6 times faster!
Optimize modules using binaryen
binaryen is a Wasm optimizer and compiler toolchain that can make Wasm modules smaller and faster by stripping away unnecessary code and optimizing the wasm opcodes.
Conclusion
We’re excited by all of the opportunities Wasm unlocks, by allowing us to extend systems in a safe and scalable manner. We were able to write PostgreSQL functions easily in two of our favorite programming languages: JavaScript and Go. If you’re also excited about WebAssembly, we want to talk to you! So please reach out on @twitter, email us: hello@dylibso.com, join our newsletter, or hop into the Discord and talk Extism & plug-ins with us.
The source code for pg_extism is available at https://github.com/dylibso/pg_extism
- Muhammad Azeez, Senior Software Engineer
Whether you're curious about WebAssembly or already putting it into production, we've got plenty more to share.
We are here to help, so click & let us know:
Get in touch