Skip to content

Commit

Permalink
docs: update rfc according to impl
Browse files Browse the repository at this point in the history
  • Loading branch information
CookiePieWw committed Sep 17, 2024
1 parent df03e91 commit 9b2a311
Showing 1 changed file with 66 additions and 50 deletions.
116 changes: 66 additions & 50 deletions docs/rfcs/2024-08-06-json-datatype.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,21 @@ JSON is widely used across various scenarios. Direct support for writing and que

# Details

## User Interface
The feature introduces a new data type, `JSON`, for the database. Similar to the common JSON type, data is written as JSON strings and can be queried using functions.
## Storage and Query

For example:
The type system of GreptimeDB is based on the types of arrow/datafusion, each type has a corresponding physical type from arrow/datafusion. Thus, the json type is built on top of the `Binary` type, utilizing current implementation of both `Value` and `Vector` of it. JSON type performs the same as Binary type inside the storage layer and query engine.

This also brings 2 problems: insertion and query interface.

## Insertion

User commonly write JSON data as strings. Thus we need to make conversion between string and binary data. There are 2 ways to do this:

1. MySQL and PostgreSQL servers provide auto-conversion between string and JSON data. When a string is inserted into a JSON column, the server will try to parse the string as JSON data and convert it to binary data of JSON type. The non-JSON string will be rejected.

2. A function `parse_json` is provided to convert string to JSON data. The function will return a binary data of JSON type. If the string is not a valid JSON string, the function will return an error.

For example, in MySQL client:
```SQL
CREATE TABLE IF NOT EXISTS test (
ts TIMESTAMP TIME INDEX,
Expand All @@ -34,70 +45,75 @@ INSERT INTO test VALUES(
}'
);

SELECT json_get(b, 'name') FROM test;
+---------------------+
| b.name |
+---------------------+
| jHl2oDDnPc1i2OzlP5Y |
+---------------------+

SELECT CAST(json_get_by_paths(b, 'attributes', 'event_attributes') AS DOUBLE) + 1 FROM test;
+-------------------------------+
| b.attributes.event_attributes |
+-------------------------------+
| 49.28667 |
+-------------------------------+

INSERT INTO test VALUES(
0,
0,
parse_json('{
"name": "jHl2oDDnPc1i2OzlP5Y",
"timestamp": "2024-07-25T04:33:11.369386Z",
"attributes": { "event_attributes": 48.28667 }
}')
);
```
Are both valid.

## Storage and Query
For former the conversion is done by the server, while for the latter the conversion is done by the query engine.

Data of `JSON` type is stored as JSONB format in the database. For storage layer and query engine, data is represented as a binary array and can be queried through pre-defined JSON functions. For clients, data is shown as strings.
## Query Interface

Insertions of `JSON` goes through following steps:
Correspondingly, users prefer to display JSON data as strings. Thus we need to make conversion between binary data and string data. There are alsol 2 ways to do this: auto-conversions on MySQL and PostgreSQL servers, and function `json_to_string`.

1. Client gets JSON strings and sends it to the frontend.
2. Frontend encode JSON strings to binary data of JSONB format and sends it to the datanode.
3. Datanode stores binary data in the database.
For example, in MySQL client:
```SQL
SELECT b FROM test;

SELECT json_to_string(b) FROM test;
```
Insertion:
Encode Store
JSON Strings ┌────────────┐ JSONB ┌────────────┐ JSONB
client ------------->│ Frontend │------>│ Datanode │------> Storage
└────────────┘ └────────────┘
```
Will both return the JSON string.

Specifically, we attach a message to the binary data of JSON type in the `metadata` of `Field` in arrow/datafusion schema. Frontend servers could identify the type of the binary data and convert it to string data if necessary. But for functions with a JSON return type, the metadata method is not applicable. Thus the functions of JSON type should specify the return type explicitly, such as `json_get_int` and `json_get_float` which return `INT` and `FLOAT` respectively.

The data of `JSON` type is represented by `Binary` data type in arrow. There are 2 types of JSON queries: get JSON elements through keys and compute over JSON elements.
## Functions
Similar to the common JSON type, data is written as JSON strings and can be queried with functions.

For the former, the query engine performs queries directly over binary data. We provide functions like `json_get` and `json_get_by_paths` to extract JSON elements through keys.
For example:
```SQL
CREATE TABLE IF NOT EXISTS test (
ts TIMESTAMP TIME INDEX,
a INT,
b JSON
);

For the latter, users need to manually specify the data type of the JSON elements for computing. Users can use `CAST` to convert the JSON elements to the specified data type. Computation without explicit conversion will result in an error.
INSERT INTO test VALUES(
0,
0,
'{
"name": "jHl2oDDnPc1i2OzlP5Y",
"timestamp": "2024-07-25T04:33:11.369386Z",
"attributes": { "event_attributes": 48.28667 }
}'
);

Queries of `JSON` goes through following steps:
SELECT json_get_int(b, 'name') FROM test;
+---------------------+
| b.name |
+---------------------+
| jHl2oDDnPc1i2OzlP5Y |
+---------------------+

1. Client sends query to frontend, and frontend sends it to datafusion, which is the query engine of GreptimeDB.
2. Datafusion performs query over binray data of JSONB format, and returns binary data to frontend.
3. If no computation is needed, frontend directly decodes the binary data to JSON strings and return it to clients.
4. If computation is needed, the binary data is decoded and converted to the specified data type to perform computation. There's no need for further decoding in the frontend.
SELECT json_get_float(b, 'attributes.event_attributes') FROM test;
+--------------------------------+
| b.attributes.event_attributes |
+--------------------------------+
| 48.28667 |
+--------------------------------+

```
Queries without computation, decoding in frontend:
Decode Query
JSON Strings ┌────────────┐ JSONB ┌──────────────┐ JSONB
client <-------------│ Frontend │<------│ Datafusion │<------ Storage
└────────────┘ └──────────────┘
Queries with computation, decoding in datafusion:
Query
Data of Specified Type ┌────────────┐ Data of Specified Type ┌──────────────┐ JSONB
client <-----------------------│ Frontend │<-----------------------│ Datafusion │<------ Storage
└────────────┘ └──────────────┘
```
And more functions can be added in the future.

# Drawbacks

As a general purpose data type, JSONB may not be as efficient as specialized data types for specific scenarios.
As a general purpose JSON data type, JSONB may not be as efficient as specialized data types for specific scenarios.

# Alternatives

Expand Down

0 comments on commit 9b2a311

Please sign in to comment.