Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: json datatype rfc #4515

Merged
merged 16 commits into from
Sep 19, 2024
141 changes: 141 additions & 0 deletions docs/rfcs/2024-08-06-json-datatype.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
Feature Name: Json Datatype
Tracking Issue: https://github.com/GreptimeTeam/greptimedb/issues/4230
Date: 2024-8-6
Author: "Yuhan Wang <[email protected]>"
---

# Summary
This RFC proposes a method for storing and querying JSON data in the database.

# Motivation
JSON is widely used across various scenarios. Direct support for writing and querying JSON can significantly enhance the database's flexibility.
WenyXu marked this conversation as resolved.
Show resolved Hide resolved

# Details

## User Interface
The feature introduces a new data type for the database, similar to the common JSON type. Data is written as JSON strings and can be queried using functions.

For example:
```SQL
CREATE TABLE IF NOT EXISTS test (
ts TIMESTAMP TIME INDEX,
a INT,
b JSON
WenyXu marked this conversation as resolved.
Show resolved Hide resolved
killme2008 marked this conversation as resolved.
Show resolved Hide resolved
);

INSERT INTO test VALUES(
0,
0,
'{
"name": "jHl2oDDnPc1i2OzlP5Y",
"timestamp": "2024-07-25T04:33:11.369386Z",
"attributes": { "event_attributes": 48.28667 }
}'
);

SELECT json_get(b, 'name') FROM test;
+---------------------+
| b.name |
+---------------------+
| jHl2oDDnPc1i2OzlP5Y |
+---------------------+
WenyXu marked this conversation as resolved.
Show resolved Hide resolved

SELECT json_get_by_paths(b, 'attributes', 'event_attributes') + 1 FROM test;
WenyXu marked this conversation as resolved.
Show resolved Hide resolved
+-------------------------------+
| b.attributes.event_attributes |
+-------------------------------+
| 49.28667 |
+-------------------------------+

```

## Storage

### Schema Inference
Unlike other types, the schema of JSON data is inconsistent. For different JSON columns, we introduce a dynamic schema inference method for storing the data.

For example:
```JSON
{
"a": "jHl2oDDnPc1i2OzlP5Y",
"b": "2024-07-25T04:33:11.369386Z",
"c": { "d": 48.28648 }
}
```
This will be parsed at runtime and stored as a corresponding `Struct` type in Arrow:
```Rust
Struct(
Field("a", Utf8),
Field("b", Utf8),
Field("c", Struct(Field("d", Float64))),
)
WenyXu marked this conversation as resolved.
Show resolved Hide resolved
```

Dynamic schema inference helps achieve compression in some scenarios. See [benchmark](https://github.com/CookiePieWw/json-format-in-parquet-benchmark/) for more information.

## Schema Change
The schema must remain consistent for a column within a table. When inserting data with different schemas, schema changes may occur. There are two types of schema changes:
WenyXu marked this conversation as resolved.
Show resolved Hide resolved

1. Field Addition

Newly added fields can be incorporated into the schema, treating added fields in previously inserted data as null:
```Rust
Struct(
Field("a", Utf8),
)
+
Struct(
Field("a", Utf8),
Field("e", Int32)
)
=
Struct(
Field("a", Utf8),
Field("e", Int32)
)
```

2. Field Modification

Compatible fields can be altered to the widest type, similar to integral promotion in C:
WenyXu marked this conversation as resolved.
Show resolved Hide resolved
```Rust
Struct(
Field("a", Int16),
)
+
Struct(
Field("a", Int32),
)
WenyXu marked this conversation as resolved.
Show resolved Hide resolved
=
Struct(
Field("a", Int32),
)
```

Non-compatible fields will fallback to a binary array to store the JSONB encoding:
WenyXu marked this conversation as resolved.
Show resolved Hide resolved
```Rust
Struct(
Field("a", Struct(Field("b", Float64))),
)
+
Struct(
Field("a", Int32),
)
=
Struct(
Field("a", BinaryArray), // JSONB
)
```

Like schema inference, schema changes are performed automatically without manual configuration.
killme2008 marked this conversation as resolved.
Show resolved Hide resolved

# Drawbacks

1. This datatype is best suited for data with similar schemas. Varying schemas can lead to frequent schema changes and fallback to JSONB.
2. Schema inference and change bring additional writing overhead in favor of better compression rate.
WenyXu marked this conversation as resolved.
Show resolved Hide resolved

# Alternatives

1. JSONB, a widely used binary representation format of json.
2. JSONC: A tape representation format for JSON with similar writing and query performance and better compression in some cases. See [discussion](https://github.com/apache/datafusion/issues/7845#issuecomment-2068061465) and [repo](https://github.com/CookiePieWw/jsonc) for more information.