From fb0853bc139e18eb6a95a05600c640fcc21e7e72 Mon Sep 17 00:00:00 2001 From: Aleksei Zinovev Date: Thu, 5 Dec 2024 10:40:06 +0100 Subject: [PATCH] Add guide for custom SQL database support with HSQLDB (#986) * Add guide for custom SQL database support with HSQLDB This commit introduces documentation detailing the process of extending the Kotlin DataFrame library to support custom SQL databases, using HSQLDB as an example. The guide includes prerequisites, implementation of a custom database type, and example code for managing database tables and schemas. Additionally, updates have been made to reflect the possibility of registering custom SQL databases in existing files. * Add Gradle instructions to custom SQL database guide --- docs/StardustDocs/d.tree | 1 + docs/StardustDocs/topics/readSqlDatabases.md | 84 ++++----- .../topics/readSqlFromCustomDatabase.md | 169 ++++++++++++++++++ 3 files changed, 200 insertions(+), 54 deletions(-) create mode 100644 docs/StardustDocs/topics/readSqlFromCustomDatabase.md diff --git a/docs/StardustDocs/d.tree b/docs/StardustDocs/d.tree index 0beebd495a..caa426de53 100644 --- a/docs/StardustDocs/d.tree +++ b/docs/StardustDocs/d.tree @@ -46,6 +46,7 @@ + diff --git a/docs/StardustDocs/topics/readSqlDatabases.md b/docs/StardustDocs/topics/readSqlDatabases.md index 134c62ffe6..181d2fb914 100644 --- a/docs/StardustDocs/topics/readSqlDatabases.md +++ b/docs/StardustDocs/topics/readSqlDatabases.md @@ -32,10 +32,11 @@ Also, there are a few **extension functions** available on `Connection`, **NOTE:** This is an experimental module, and for now, we only support four databases: MS SQL, MariaDB, MySQL, PostgreSQL, and SQLite. +Moreover, since release 0.15 we support the possibility to register custom SQL database, read more in our [guide](readSqlFromCustomDatabase.md). + Additionally, support for JSON and date-time types is limited. Please take this into consideration when using these functions. - ## Getting started with reading from SQL database in Gradle Project In the first, you need to add a dependency @@ -70,7 +71,7 @@ implementation("com.mysql:mysql-connector-j:$version") Maven Central version could be found [here](https://mvnrepository.com/artifact/com.mysql/mysql-connector-j). -For SQLite: +For **SQLite**: ```kotlin implementation("org.xerial:sqlite-jdbc:$version") @@ -78,7 +79,7 @@ implementation("org.xerial:sqlite-jdbc:$version") Maven Central version could be found [here](https://mvnrepository.com/artifact/org.xerial/sqlite-jdbc). -For MS SQL: +For **MS SQL**: ```kotlin implementation("com.microsoft.sqlserver:mssql-jdbc:$version") @@ -158,7 +159,7 @@ otherwise, it will be considered non-nullable for the newly created `DataFrame` These functions read all data from a specific table in the database. Variants with a limit parameter restrict how many rows will be read from the table. -**readSqlTable(dbConfig: DbConnectionConfig, tableName: String, limit: Int, inferNullability: Boolean): AnyFrame** +**readSqlTable(dbConfig: DbConnectionConfig, tableName: String, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** Read all data from a specific table in the SQL database and transform it into an `AnyFrame` object. @@ -166,6 +167,9 @@ The `dbConfig: DbConnectionConfig` parameter represents the configuration for a created under the hood and managed by the library. Typically, it requires a URL, username, and password. +The `dbType` parameter is the type of database, could be a custom object, provided by user, optional, default is `null`, +to know more, read the [guide](readSqlFromCustomDatabase.md). + ```kotlin import org.jetbrains.kotlinx.dataframe.io.DbConnectionConfig @@ -180,7 +184,7 @@ The `limit: Int` parameter allows setting the maximum number of records to be re val users = DataFrame.readSqlTable(dbConfig, "Users", limit = 100) ``` -**readSqlTable(connection: Connection, tableName: String, limit: Int, inferNullability: Boolean): AnyFrame** +**readSqlTable(connection: Connection, tableName: String, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** Another variant, where instead of `dbConfig: DbConnectionConfig` we use a JDBC connection: `Connection` object. @@ -210,7 +214,7 @@ val users = connection.readDataFrame("Users", 100) connection.close() ``` -**Connection.readDataFrame(sqlQueryOrTableName: String, limit: Int, inferNullability: Boolean): AnyFrame** +**Connection.readDataFrame(sqlQueryOrTableName: String, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** Read all data from a specific table in the SQL database and transform it into an `AnyFrame` object. @@ -222,7 +226,7 @@ It should not contain `;` symbol. All other parameters are described above. -**DbConnectionConfig.readDataFrame(sqlQueryOrTableName: String, limit: Int, inferNullability: Boolean): AnyFrame** +**DbConnectionConfig.readDataFrame(sqlQueryOrTableName: String, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** If you do not have a connection object or need to run a quick, isolated experiment reading data from an SQL database, @@ -233,7 +237,7 @@ you can delegate the creation of the connection to `DbConnectionConfig`. These functions execute an SQL query on the database and convert the result into a `DataFrame` object. If a limit is provided, only that many rows will be returned from the result. -**readSqlQuery(dbConfig: DbConnectionConfig, sqlQuery: String, limit: Int, inferNullability: Boolean): AnyFrame** +**readSqlQuery(dbConfig: DbConnectionConfig, sqlQuery: String, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** Execute a specific SQL query on the SQL database and retrieve the resulting data as an AnyFrame. @@ -249,7 +253,7 @@ val dbConfig = DbConnectionConfig("URL_TO_CONNECT_DATABASE", "USERNAME", "PASSWO val df = DataFrame.readSqlQuery(dbConfig, "SELECT * FROM Users WHERE age > 35") ``` -**readSqlQuery(connection: Connection, sqlQuery: String, limit: Int, inferNullability: Boolean): AnyFrame** +**readSqlQuery(connection: Connection, sqlQuery: String, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** Another variant, where instead of `dbConfig: DbConnectionConfig` we use a JDBC connection: `Connection` object. @@ -301,6 +305,8 @@ The `dbType: DbType` parameter specifies the type of our database (e.g., Postgre supported by a library. Currently, the following classes are available: `H2, MsSql, MariaDb, MySql, PostgreSql, Sqlite`. +Also, users have an ability to pass objects, describing their custom databases, more information in [guide](readSqlFromCustomDatabase.md). + ```kotlin import org.jetbrains.kotlinx.dataframe.io.db.PostgreSql import java.sql.ResultSet @@ -308,9 +314,9 @@ import java.sql.ResultSet val df = DataFrame.readResultSet(resultSet, PostgreSql) ``` -**readResultSet(resultSet: ResultSet, connection: Connection, limit: Int, inferNullability: Boolean): AnyFrame** +**readResultSet(resultSet: ResultSet, connection: Connection, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** -Another variant, where instead of `dbType: DbType` we use a JDBC connection: `Connection` object. +Another variant, we use a JDBC connection: `Connection` object. ```kotlin import java.sql.Connection @@ -340,7 +346,7 @@ val df = rs.readDataFrame(connection, 10) connection.close() ``` -**ResultSet.readDataFrame(connection: Connection, limit: Int, inferNullability: Boolean): AnyFrame** +**ResultSet.readDataFrame(connection: Connection, limit: Int, inferNullability: Boolean, dbType: DbType?): AnyFrame** Reads the data from a `ResultSet` and converts it into a `DataFrame`. @@ -352,7 +358,7 @@ that the `ResultSet` belongs to. These functions read all data from all tables in the connected database. Variants with a limit parameter restrict how many rows will be read from each table. -**readAllSqlTables(dbConfig: DbConnectionConfig, limit: Int, inferNullability: Boolean): Map\** +**readAllSqlTables(dbConfig: DbConnectionConfig, limit: Int, inferNullability: Boolean, dbType: DbType?): Map\** Retrieves data from all the non-system tables in the SQL database and returns them as a map of table names to `AnyFrame` objects. @@ -368,7 +374,7 @@ val dbConfig = DbConnectionConfig("URL_TO_CONNECT_DATABASE", "USERNAME", "PASSWO val dataframes = DataFrame.readAllSqlTables(dbConfig) ``` -**readAllSqlTables(connection: Connection, limit: Int, inferNullability: Boolean): Map\** +**readAllSqlTables(connection: Connection, limit: Int, inferNullability: Boolean, dbType: DbType?): Map\** Another variant, where instead of `dbConfig: DbConnectionConfig` we use a JDBC connection: `Connection` object. @@ -389,7 +395,7 @@ The purpose of these functions is to facilitate the retrieval of table schema. By providing a table name and either a database configuration or connection, these functions return the [DataFrameSchema](schema.md) of the specified table. -**getSchemaForSqlTable(dbConfig: DbConnectionConfig, tableName: String): DataFrameSchema** +**getSchemaForSqlTable(dbConfig: DbConnectionConfig, tableName: String, dbType: DbType?): DataFrameSchema** This function captures the schema of a specific table from an SQL database. @@ -405,7 +411,7 @@ val dbConfig = DbConnectionConfig("URL_TO_CONNECT_DATABASE", "USERNAME", "PASSWO val schema = DataFrame.getSchemaForSqlTable(dbConfig, "Users") ``` -**getSchemaForSqlTable(connection: Connection, tableName: String): DataFrameSchema** +**getSchemaForSqlTable(connection: Connection, tableName: String, dbType: DbType?): DataFrameSchema** Another variant, where instead of `dbConfig: DbConnectionConfig` we use a JDBC connection: `Connection` object. @@ -427,7 +433,7 @@ These functions return the schema of an SQL query result. Once you provide a database configuration or connection and an SQL query, they return the [DataFrameSchema](schema.md) of the query result. -**getSchemaForSqlQuery(dbConfig: DbConnectionConfig, sqlQuery: String): DataFrameSchema** +**getSchemaForSqlQuery(dbConfig: DbConnectionConfig, sqlQuery: String, dbType: DbType?): DataFrameSchema** This function executes an SQL query on the database and then retrieves the resulting schema. @@ -443,7 +449,7 @@ val dbConfig = DbConnectionConfig("URL_TO_CONNECT_DATABASE", "USERNAME", "PASSWO val schema = DataFrame.getSchemaForSqlQuery(dbConfig, "SELECT * FROM Users WHERE age > 35") ``` -**getSchemaForSqlQuery(connection: Connection, sqlQuery: String): DataFrameSchema** +**getSchemaForSqlQuery(connection: Connection, sqlQuery: String, dbType: DbType?): DataFrameSchema** Another variant, where instead of `dbConfig: DbConnectionConfig` we use a JDBC connection: `Connection` object. @@ -472,11 +478,11 @@ val schema = connection.getDataFrameSchema("SELECT * FROM Users WHERE age > 35") connection.close() ``` -**Connection.getDataFrameSchema(sqlQueryOrTableName: String): DataFrameSchema** +**Connection.getDataFrameSchema(sqlQueryOrTableName: String, dbType: DbType?): DataFrameSchema** Retrieves the schema of an SQL query result or an SQL table using the provided database configuration. -**DbConnectionConfig.getDataFrameSchema(sqlQueryOrTableName: String): DataFrameSchema** +**DbConnectionConfig.getDataFrameSchema(sqlQueryOrTableName: String, dbType: DbType?): DataFrameSchema** Retrieves the schema of an SQL query result or an SQL table using the provided database configuration. @@ -507,6 +513,8 @@ The `dbType: DbType` parameter specifies the type of our database (e.g., Postgre supported by a library. Currently, the following classes are available: `H2, MariaDb, MySql, PostgreSql, Sqlite`. +Also, users have an ability to pass objects, describing their custom databases, more information in [guide](readSqlFromCustomDatabase.md). + ```kotlin import org.jetbrains.kotlinx.dataframe.io.db.PostgreSql import java.sql.ResultSet @@ -514,42 +522,10 @@ import java.sql.ResultSet val schema = DataFrame.getSchemaForResultSet(resultSet, PostgreSql) ``` -**getSchemaForResultSet(connection: Connection, sqlQuery: String): DataFrameSchema** - -Another variant, where instead of `dbType: DbType` we use a JDBC connection: `Connection` object. - -```kotlin -import java.sql.Connection -import java.sql.DriverManager - -val connection = DriverManager.getConnection("URL_TO_CONNECT_DATABASE") - -val schema = DataFrame.getSchemaForResultSet(resultSet, connection) - -connection.close() -``` - ### Extension functions for schema reading from the ResultSet The same example, rewritten with the extension function: -```kotlin -import java.sql.Connection -import java.sql.DriverManager - -val connection = DriverManager.getConnection("URL_TO_CONNECT_DATABASE") - -val schema = resultSet.getDataFrameSchema(connection) - -connection.close() -``` - -if you are using this extension function - -**ResultSet.getDataFrameSchema(connection: Connection): DataFrameSchema** - -or - ```kotlin import org.jetbrains.kotlinx.dataframe.io.db.PostgreSql import java.sql.ResultSet @@ -566,7 +542,7 @@ based on These functions return a list of all [`DataFrameSchema`](schema.md) from all the non-system tables in the SQL database. They can be called with either a database configuration or a connection. -**getSchemaForAllSqlTables(dbConfig: DbConnectionConfig): Map\** +**getSchemaForAllSqlTables(dbConfig: DbConnectionConfig, dbType: DbType?): Map\** This function retrieves the schema of all tables from an SQL database and returns them as a map of table names to [`DataFrameSchema`](schema.md) objects. @@ -583,7 +559,7 @@ val dbConfig = DbConnectionConfig("URL_TO_CONNECT_DATABASE", "USERNAME", "PASSWO val schemas = DataFrame.getSchemaForAllSqlTables(dbConfig) ``` -**getSchemaForAllSqlTables(connection: Connection): Map\** +**getSchemaForAllSqlTables(connection: Connection, dbType: DbType?): Map\** This function retrieves the schema of all tables using a JDBC connection: `Connection` object and returns them as a list of [`DataFrameSchema`](schema.md). diff --git a/docs/StardustDocs/topics/readSqlFromCustomDatabase.md b/docs/StardustDocs/topics/readSqlFromCustomDatabase.md new file mode 100644 index 0000000000..be70f88a31 --- /dev/null +++ b/docs/StardustDocs/topics/readSqlFromCustomDatabase.md @@ -0,0 +1,169 @@ +[//]: # (title: How to Extend DataFrame Library for Custom SQL Database Support: Example with HSQLDB) + +# How to Extend DataFrame Library for Custom SQL Database Support: Example with HSQLDB + +This guide demonstrates how advanced users can extend the Kotlin DataFrame library to support a custom SQL database, +using HSQLDB as an example. By following these steps, +you will be able to integrate your custom database into the DataFrame library, +allowing for seamless DataFrame creation, manipulation, and querying. + +This guide is intended for Gradle projects, +but the experience will be similar in Kotlin Notebooks, +as demonstrated in this [Kotlin DataFrame SQL Example](https://github.com/zaleslaw/KotlinDataFrame-SQL-Examples/blob/master/notebooks/customdb.ipynb). + +--- + +## Prerequisites + +1. **Create a Gradle Project**: + +Add the following dependencies and dataframe plugin to your `build.gradle.kts`: + +```kotlin +plugins { + id("org.jetbrains.kotlinx.dataframe") version "$dataframe_version" +} + +dependencies { + implementation("org.jetbrains.kotlinx:dataframe:$dataframe_version") + implementation("org.hsqldb:hsqldb:$version") +} +``` + +2. **Install HSQLDB**: + +Follow the [HSQLDB Quick Guide](https://www.tutorialspoint.com/hsqldb/hsqldb_quick_guide.htm) to set up HSQLDB locally. + +3. **Start the HSQLDB Server**: + +Launch a terminal or command prompt and execute the following command: + +```bash +java -classpath lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:hsqldb/demodb --dbname.0 testdb +``` + +## Implementing Custom Database Type Support + +To enable HSQLDB integration, implement a custom `DbType` by overriding required methods. + + +**Create the HSQLDB Type** + +```kotlin +/** + * Represents the HSQLDB database type. + * + * This class provides methods to convert data from a ResultSet to the appropriate type for HSQLDB, + * and to generate the corresponding column schema. + */ +public object HSQLDB : DbType("hsqldb") { + override val driverClassName: String + get() = "org.hsqldb.jdbcDriver" + + override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? { + return null + } + + override fun isSystemTable(tableMetadata: TableMetadata): Boolean { + val locale = Locale.getDefault() + fun String?.containsWithLowercase(substr: String) = this?.lowercase(locale)?.contains(substr) == true + val schemaName = tableMetadata.schemaName + val name = tableMetadata.name + return schemaName.containsWithLowercase("information_schema") || + schemaName.containsWithLowercase("system") || + name.containsWithLowercase("system_") + } + + override fun buildTableMetadata(tables: ResultSet): TableMetadata = + TableMetadata( + tables.getString("TABLE_NAME"), + tables.getString("TABLE_SCHEM"), + tables.getString("TABLE_CAT"), + ) + + override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? { + return null + } +} +``` + +**Defining Helper Functions** + +Define utility functions to manage database connections and tables. +For example purposes, we create a small function that can populate the table with a schema and some sample data. + +```kotlin +const val URL = "jdbc:hsqldb:hsql://localhost/testdb" +const val USER_NAME = "SA" +const val PASSWORD = "" + + +fun removeTable(con: Connection): Int { + val stmt = con.createStatement() + return stmt.executeUpdate("""DROP TABLE orders""") +} + +fun createAndPopulateTable(con: Connection) { + val stmt = con.createStatement() + stmt.executeUpdate( + """CREATE TABLE IF NOT EXISTS orders ( + id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, + item VARCHAR(50) NOT NULL, + price DOUBLE NOT NULL, + order_date DATE + ); + """.trimIndent() + ) + + stmt.executeUpdate( + """INSERT INTO orders (item, price, order_date) + VALUES ('Laptop', 1500.00, NOW())""".trimIndent() + ) + + stmt.executeUpdate( + """INSERT INTO orders (item, price, order_date) + VALUES ('Smartphone', 700.00, NOW())""".trimIndent() + ) +} +``` + +**Define the Table Schema** + +Use the `@DataSchema` annotation to define a [**custom data schema**](schemasCustom.md) for the `orders` table. + +```kotlin +@DataSchema +interface Orders { + val id: Int + val item: String + val price: Double + val orderDate: java.util.Date +} +``` + +**End-to-End Example** + +Finally, use the following code to create, populate, read, and delete the table in HSQLDB. + +```kotlin +fun main() { + DriverManager.getConnection(URL, USER_NAME, PASSWORD).use { con -> + createAndPopulateTable(con) + + val df = con + .readDataFrame("SELECT * FROM orders", dbType = HSQLDB) + .rename { all() }.into { it.name.lowercase(Locale.getDefault()).toCamelCaseByDelimiters(DELIMITERS_REGEX) } + .cast(verify = true) + + df.filter { it.price > 800 }.print() + + removeTable(con) + } +} +``` + +Running the `main` function above will output filtered rows from the `orders` table where `price > 800`. + +It will also demonstrate how to define and use custom SQL database extensions in the DataFrame library. + +Find a full example project [here](https://github.com/zaleslaw/KotlinDataFrame-SQL-Examples/tree/master/src/main/kotlin/customdb).