Skip to content

Commit d7b3bfa

Browse files
authored
docs: add new contributing doc (#341)
* docs: add new contributing doc * docs: add CONTRIBUTING english doc * docs: format doc
1 parent 9ba5100 commit d7b3bfa

File tree

4 files changed

+229
-21
lines changed

4 files changed

+229
-21
lines changed

CONTRIBUTING-zh_CN.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# 贡献指南
2+
3+
[English](./CONTRIBUTING.md) | 简体中文
4+
5+
## 开发
6+
7+
> [!Tips]
8+
> 在开始之前,你需要先确保你本地的 java 环境已经设置好了,否则你将无法从语法文件中生成。你可以运行 `java --version` 来检查。
9+
10+
- **安装依赖**
11+
12+
```bash
13+
pnpm install
14+
```
15+
16+
- **编译 g4 文件**
17+
18+
```bash
19+
# 编译全部 g4 文件
20+
pnpm antlr4
21+
# 指定编译某种语言
22+
pnpm antlr4 --lang mysql
23+
```
24+
25+
- **运行单元测试**
26+
27+
```bash
28+
pnpm test
29+
```
30+
31+
- **运行性能基准测试**
32+
33+
```bash
34+
pnpm benchmark
35+
```
36+
37+
## 源码目录
38+
39+
- `src/grammar`: 存放 g4 文件(语法文件)
40+
- `src/lib`: 从 g4 语法文件生成的产物(通过运行 `pnpm antlr4` 命令生成)
41+
- `src/parser`: SQL 解析器类的实现
42+
- `src/parser/common`: SQL 解析器的基类和工具方法
43+
- `test`: 单元测试
44+
- `benchmark`: 性能基准测试
45+
46+
## 如何添加一种新的 SQL 语言
47+
48+
1. **添加新的语法文件**
49+
50+
将新的 g4 语法文件添加到 `src/grammar/<SQL name>`,语法文件命名采用大驼峰格式,语法文件内的语法规则需要符合以下要求:
51+
52+
- 根规则统一命名为 `program`
53+
- 支持解析多条 SQL 语句;
54+
- 开启[忽略大小写选项](https://github.com/antlr/antlr4/blob/dev/doc/options.md#caseinsensitive)(如果该 SQL 语言不区分大小写);
55+
- 所有关键字的词法规则需以 `KW_` 开头(例如 `KW_SELECT: 'SELECT';`),这有助于在自动补全功能中区分关键字词法规则;
56+
57+
2. **从语法文件生成文件**
58+
59+
运行以下命令从新的语法文件生成相应的文件:
60+
61+
```bash
62+
pnpm antlr4 --lang <SQL name>
63+
```
64+
65+
确认在 `src/lib/<SQL name>/` 目录下生成了相应的 Lexer、Parser、Listener 和 Visitor 文件。
66+
67+
3. **实现 SQL 解析器类**
68+
69+
创建文件 `src/parser/<SQL name>/index.ts` 并实现相应的 SQL 解析器类,该类应继承自 `BasicSQL` 基类,首先实现 `createLexerFromCharStream``createParserFromTokenStream` 方法,其他方法可以暂时为空。
70+
71+
4. **添加基础单元测试**
72+
73+
`test/parser/<SQL name>` 下添加基础单元测试,包括:
74+
75+
- 词法分析器
76+
- 访问者
77+
- 监听器
78+
- `parser.validate` 方法
79+
80+
你可以参考其他 SQL 解析器的单元测试。
81+
82+
5. **SQL 语法单元测试**
83+
84+
`test/parser/<SQL name>/syntax` 目录下添加 SQL 语法的单元测试,确保**覆盖所有 SQL 语法规则**,建议根据官方语法文档逐条添加测试,以确保语法文件的准确性。
85+
86+
6. **实现 SQLSplitListener**
87+
88+
实现 `SQLSplitListener` 并在 SQL 解析器类中添加 `splitListener` getter,同时添加 `parser.splitSQLByStatement` 方法的单元测试,用于将 SQL 按语句切分。
89+
90+
7. **自动补全功能**
91+
92+
实现自动补全功能所需的 `processCandidates``preferredRules` 方法,在开始这一步之前,需要熟悉 [antlr4-c3](https://github.com/mike-lischke/antlr4-c3),然后在 `test/parser/<SQL name>/suggestion` 目录下添加与自动补全相关的单元测试。
93+
94+
8. **上下文信息收集**
95+
96+
实现 `SQLEntityCollector` 类和 `createEntityCollector` 方法,用于收集 SQL 上下文信息,从而增强自动补全功能,详情请参考[这里](https://github.com/DTStack/dt-sql-parser/discussions/250#discussioncomment-8215715),然后在 `test/parser/<SQL name>/contextCollect` 目录下添加实体收集方法的单元测试。
97+
98+
## 语法文件来源
99+
100+
SQL 语法文件通常较为复杂,如果你想在 dt-sql-parser 中添加一种新的 SQL,不建议从头开始编写,可以考虑以下来源,按推荐顺序排列:
101+
102+
1. **SQL 官方仓库**
103+
104+
有些 SQL 官方仓库使用 Antlr4 作为 SQL 解析器,可以在其源码中找到对应的语法文件,例如:
105+
- [TrinoSQL](https://github.com/trinodb/trino/blob/385/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4)
106+
- [SparkSQL](https://github.com/apache/spark/blob/v3.5.0/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4)
107+
108+
来自官方仓库的语法文件通常最为可靠、稳定且性能较好。
109+
110+
2. **grammar-v4 仓库**
111+
112+
这是 Antlr 官方维护的语法文件仓库,包含多种 SQL 语法文件,这里的文件相对可靠。
113+
114+
3. **社区/其他开源仓库**
115+
116+
从社区或其他开源仓库获取的语法文件可能不太可靠,可能需要大量时间来修复语法规则。

CONTRIBUTING.md

Lines changed: 106 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,118 @@
1-
# dt-sql-parser
1+
# CONTRIBUTING
22

3-
## Get Start
3+
English | [简体中文](./CONTRIBUTING-zh_CN.md)
44

5-
installing the dependencies after cloned project:
5+
## Development
66

7-
```bash
8-
yarn install
9-
```
7+
> [!Tips]
8+
> Before starting, you need to make sure your local Java environment is set up, otherwise you will not be able to generate from the grammar file. You can check it by running `java --version`.
109
11-
- test
10+
- **Install dependencies**
1211

13-
```bash
14-
yarn test
15-
```
12+
```bash
13+
pnpm install
14+
```
1615

17-
## Compile the grammar sources
16+
- **Compile g4 Files**
1817

19-
Compile one language:
18+
```bash
19+
# Compile all g4 files
20+
pnpm antlr4
21+
# Compile for a specific language
22+
pnpm antlr4 --lang mysql
23+
```
2024

21-
```bash
22-
yarn antlr4 --lang=mysql
23-
```
25+
- **Run Unit Tests**
2426

25-
Compile all languages:
27+
```bash
28+
pnpm test
29+
```
2630

27-
```bash
28-
yarn antlr4 --all
29-
```
31+
- **Run Benchmark Tests**
3032

31-
## Branch Organization
33+
```bash
34+
pnpm benchmark
35+
```
3236

33-
## Source Code Organization
37+
## Directory Overview
38+
39+
- `src/grammar`: Contains g4 files (grammar files)
40+
- `src/lib`: Generated files from g4 grammar (produced by running `pnpm antlr4`)
41+
- `src/parser`: Implementations of SQL Parser classes
42+
- `src/parser/common`: Base classes and utility methods for SQL Parsers
43+
- `test`: Unit tests
44+
- `benchmark`: Benchmark tests
45+
46+
## How to Add a New SQL Language
47+
48+
1. **Add New Grammar Files**
49+
50+
Add the new g4 grammar file to `src/grammar/<SQL name>`. Name the file in PascalCase. The grammar rules within the file should adhere to the following:
51+
52+
- The root rule should be named `program`.
53+
- Support parsing multiple SQL statements.
54+
- Enable [case-insensitive options](https://github.com/antlr/antlr4/blob/dev/doc/options.md#caseinsensitive) (if the SQL language is case-insensitive).
55+
- Lexical rules for all keywords should prefix with `KW_` (e.g., `KW_SELECT: 'SELECT';`). This aids in differentiating keyword lexical rules for autocomplete functionality.
56+
57+
2. **Generate Files from Grammar**
58+
59+
Run the following command to generate files from the new grammar:
60+
61+
```bash
62+
pnpm antlr4 --lang <SQL name>
63+
```
64+
65+
Check that the corresponding Lexer, Parser, Listener, and Visitor files are generated in the `src/lib/<SQL name>/` directory.
66+
67+
3. **Implement SQL Parser Class**
68+
69+
Create a file `src/parser/<SQL name>/index.ts` and implement the corresponding SQL Parser class. This class should extend from the `BasicSQL` base class. Initially, implement the `createLexerFromCharStream` and `createParserFromTokenStream` methods; other methods can be left empty for now.
70+
71+
4. **Add Basic Unit Tests**
72+
73+
Add basic unit tests in `test/parser/<SQL name>` for:
74+
75+
- Lexer
76+
- Visitor
77+
- Listener
78+
- `parser.validate` method
79+
80+
You can reference tests from other SQL parsers.
81+
82+
5. **SQL Syntax Unit Tests**
83+
84+
Add unit tests for SQL syntax in the `test/parser/<SQL name>/syntax` directory. Ensure coverage of **all** SQL syntax rules. It is recommended to add tests based on the official grammar documentation to ensure accuracy.
85+
86+
6. **Implement SQLSplitListener**
87+
88+
Implement the `SQLSplitListener` and add the `splitListener` getter in the SQL Parser class. Also, add unit tests for the `parser.splitSQLByStatement` method, which splits SQL into individual statements.
89+
90+
7. **Autocomplete Features**
91+
92+
Implement methods `processCandidates` and `preferredRules` for autocomplete functionality. Familiarize yourself with [antlr4-c3](https://github.com/mike-lischke/antlr4-c3). Then, add autocomplete-related unit tests in `test/parser/<SQL name>/suggestion`.
93+
94+
8. **Context Information Collection**
95+
96+
Implement the `SQLEntityCollector` class and the `createEntityCollector` method in the SQL Parser class for SQL context information collection. This enhances the autocomplete functionality. For more details, refer to [here](https://github.com/DTStack/dt-sql-parser/discussions/250#discussioncomment-8215715).
97+
98+
Then, add tests for entity collection methods in `test/parser/<SQL name>/contextCollect`.
99+
100+
## Sources for Grammar Files
101+
102+
SQL grammar files can be quite complex. If you want to add a new SQL language to dt-sql-parser, it is not recommended to start from scratch. Consider the following sources, listed in order of preference:
103+
104+
1. **Official SQL Repositories**:
105+
106+
Some official SQL repositories use Antlr4 for SQL parsing. You can find the corresponding grammar files in their source code. For example:
107+
- [TrinoSQL](https://github.com/trinodb/trino/blob/385/core/trino-parser/src/main/antlr4/io/trino/sql/parser/SqlBase.g4)
108+
- [SparkSQL](https://github.com/apache/spark/blob/v3.5.0/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4)
109+
110+
Grammar files from official repositories are generally the most reliable, stable, and performant.
111+
112+
2. **Grammar-v4 Repository**:
113+
114+
This is the official grammar file repository maintained by Antlr. It includes a variety of SQL grammar files. The files here are typically reliable.
115+
116+
3. **Community/Other Open Source Repositories**:
117+
118+
Grammar files obtained from the community or other open source repositories may be less reliable and often require significant time to fix grammar issues.

README-zh_CN.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,9 @@ dt-sql-parser 的自动补全功能依赖于 [antlr4-c3](https://github.com/mike
433433

434434
<br/>
435435

436+
## 贡献指南
437+
438+
[CONTRIBUTING-zh_CN](./CONTRIBUTING-zh_CN.md)
436439

437440
## 许可证
438441

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,6 +436,10 @@ For the editor, this strategy is also more intuitive. After the user enters `SHO
436436
437437
<br/>
438438
439+
## Contributing
440+
441+
Refer to [CONTRIBUTING](./CONTRIBUTING.md)
442+
439443
## License
440444
441445
[MIT](./LICENSE)

0 commit comments

Comments
 (0)