From 1fe4d5cd92a8b795702fd40e813c086d474858b2 Mon Sep 17 00:00:00 2001 From: Antonin Delpeuch Date: Tue, 7 Mar 2023 10:40:33 +0100 Subject: [PATCH 1/6] Reformat extension migration instructions --- .../migrating-older-extensions.md | 117 ++++++++++-------- 1 file changed, 67 insertions(+), 50 deletions(-) diff --git a/docs/technical-reference/migrating-older-extensions.md b/docs/technical-reference/migrating-older-extensions.md index fc5ce3c4..812a8eac 100644 --- a/docs/technical-reference/migrating-older-extensions.md +++ b/docs/technical-reference/migrating-older-extensions.md @@ -4,6 +4,18 @@ title: Migrating older extensions sidebar_label: Migrating older extensions --- +This page lists changes in OpenRefine that require significant adaptations from extensions. + +Table of contents: +* Migrating from Ant to Maven (October 2018, between 3.0 and 3.1-beta) +* Migrating to Wikimedia's i18n jQuery plugin (November 2018, between 3.1-beta and 3.1) +* Migrating from org.json to Jackson (December 2018, between 3.1 and 3.2-beta) +* Many changes for 4.0 + * Migrating from in-memory project data storage to the runner architecture + * Changes in project serialization format + * Changes in package names + * Changes in Maven module structure + ## Migrating from Ant to Maven {#migrating-from-ant-to-maven} ### Why are we doing this change? {#why-are-we-doing-this-change} @@ -22,39 +34,42 @@ You will need to write a `pom.xml` in the root folder of your extension to confi For any library that your extension depends on, you should try to find a matching artifact in the Maven Central repository. If you can find such an artifact, delete the `.jar` file from your extension and add the dependency in your `pom.xml` file. If you cannot find such an artifact, it is still possible to incorporate your own `.jar` file using `maven-install-plugin` that you can configure in your `pom.xml` file as follows: - - - org.apache.maven.plugins - maven-install-plugin - 2.5.2 - - - install-wdtk-datamodel - process-resources - - ${basedir}/lib/my-proprietary-library.jar - default - com.my.company - my-library - 0.5.3-SNAPSHOT - jar - true - - - install-file - - - - - +```xml + + org.apache.maven.plugins + maven-install-plugin + 2.5.2 + + + install-wdtk-datamodel + process-resources + + ${basedir}/lib/my-proprietary-library.jar + default + com.my.company + my-library + 0.5.3-SNAPSHOT + jar + true + + + install-file + + + + + +``` And add the dependency to the `` section as usual: - - com.my.company - my-library - 0.5.3-SNAPSHOT - +```xml + + com.my.company + my-library + 0.5.3-SNAPSHOT + +``` ## Migrating to Wikimedia's i18n jQuery plugin {#migrating-to-wikimedias-i18n-jquery-plugin} @@ -70,26 +85,28 @@ The migration was made between 3.1-beta and 3.1, with this commit: https://githu You will need to update your translation files, merging nested objets in one global object, concatenating keys. You can do this by running the following Python script on all your JSON translation files: - import json - import sys - - with open(sys.argv[1], 'r') as f: - j = json.loads(f.read()) - - result = {} - def translate(obj, path): - res = {} - if type(obj) == str: - result['/'.join(path)] = obj - else: - for k, v in obj.items(): - new_path = path + [k] - translate(v, new_path) - - translate(j, []) - - with open(sys.argv[1], 'w') as f: - f.write(json.dumps(result, ensure_ascii=False, indent=4)) +```python +import json +import sys + +with open(sys.argv[1], 'r') as f: + j = json.loads(f.read()) + +result = {} +def translate(obj, path): + res = {} + if type(obj) == str: + result['/'.join(path)] = obj + else: + for k, v in obj.items(): + new_path = path + [k] + translate(v, new_path) + +translate(j, []) + +with open(sys.argv[1], 'w') as f: + f.write(json.dumps(result, ensure_ascii=False, indent=4)) +``` Then your javascript files which retrieve the translated strings should be updated: `$.i18n._('core-dialogs')['cancel']` becomes `$.i18n('core-dialogs/cancel')`. You can do this with the following `sed` script: From 377e9df09321ddc8be6cc826adff18f1a8eb2658 Mon Sep 17 00:00:00 2001 From: Antonin Delpeuch Date: Wed, 5 Apr 2023 18:07:54 +0200 Subject: [PATCH 2/6] Start structure for 4.0 extension migration guide --- .../migrating-older-extensions.md | 29 ++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/docs/technical-reference/migrating-older-extensions.md b/docs/technical-reference/migrating-older-extensions.md index 812a8eac..10e02baa 100644 --- a/docs/technical-reference/migrating-older-extensions.md +++ b/docs/technical-reference/migrating-older-extensions.md @@ -10,11 +10,12 @@ Table of contents: * Migrating from Ant to Maven (October 2018, between 3.0 and 3.1-beta) * Migrating to Wikimedia's i18n jQuery plugin (November 2018, between 3.1-beta and 3.1) * Migrating from org.json to Jackson (December 2018, between 3.1 and 3.2-beta) -* Many changes for 4.0 +* Changes for 4.0 * Migrating from in-memory project data storage to the runner architecture * Changes in project serialization format * Changes in package names * Changes in Maven module structure + * Changes in the HTTP API offered by OpenRefine's backend ## Migrating from Ant to Maven {#migrating-from-ant-to-maven} @@ -184,3 +185,29 @@ Example: `WikibaseSchema` [before](https://github.com/OpenRefine/OpenRefine/blob Any class that is stored in OpenRefine's preference now needs to implement the `com.google.refine.preferences.PreferenceValue` interface. The static `load` method and the `write` method used previously for deserialization should be deleted and regular Jackson serialization and deserialization should be implemented instead. Note that you do not need to explicitly serialize the class name, this is already done for you by the interface. Example: `TopList` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/main/src/com/google/refine/preference/TopList.java) and [after](https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/preference/TopList.java) + +## Changes for 4.0 + +Most changes for 4.0 happen in the backend. The frontend code remains mostly the same. +* If your extension only makes frontend changes, you might be able to migrate it without much trouble (perhaps it already works out of the box?). It is worth checking the section on frontend architecture changes and the HTTP API changes if you are making + calls to the backend yourself. +* If your extension includes backend functionality, there might be more work involved. Although an incremental migration (starting from your existing code) might be possible, it might be easier to rewrite those features from scratch following our guide for + extension developers. + +### Migrating from in-memory project data storage to the runner architecture + +### Changes in project serialization format + +### Changes in package names + +### Changes in Maven module structure + +### Changes in the HTTP API offered by OpenRefine's backend + +#### Pagination changes +#### Changes in applying operations +#### get-models command +#### Options of the CSV/TSV importer +#### Removal of the reconciliation pool + + From 2c90bcea07f5e151198a7dbbaf7d9e50e945f3eb Mon Sep 17 00:00:00 2001 From: Antonin Delpeuch Date: Mon, 10 Apr 2023 14:57:58 +0200 Subject: [PATCH 3/6] Document changes in the HTTP API --- .../migrating-older-extensions.md | 114 +++++++++++++++--- 1 file changed, 97 insertions(+), 17 deletions(-) diff --git a/docs/technical-reference/migrating-older-extensions.md b/docs/technical-reference/migrating-older-extensions.md index 10e02baa..50db195c 100644 --- a/docs/technical-reference/migrating-older-extensions.md +++ b/docs/technical-reference/migrating-older-extensions.md @@ -6,17 +6,6 @@ sidebar_label: Migrating older extensions This page lists changes in OpenRefine that require significant adaptations from extensions. -Table of contents: -* Migrating from Ant to Maven (October 2018, between 3.0 and 3.1-beta) -* Migrating to Wikimedia's i18n jQuery plugin (November 2018, between 3.1-beta and 3.1) -* Migrating from org.json to Jackson (December 2018, between 3.1 and 3.2-beta) -* Changes for 4.0 - * Migrating from in-memory project data storage to the runner architecture - * Changes in project serialization format - * Changes in package names - * Changes in Maven module structure - * Changes in the HTTP API offered by OpenRefine's backend - ## Migrating from Ant to Maven {#migrating-from-ant-to-maven} ### Why are we doing this change? {#why-are-we-doing-this-change} @@ -188,10 +177,12 @@ Example: `TopList` [before](https://github.com/OpenRefine/OpenRefine/blob/3.1/ma ## Changes for 4.0 +Version 4.0 features [better support for large datasets and long-running operations](https://github.com/OpenRefine/OpenRefine/wiki/Changes-for-4.0). + Most changes for 4.0 happen in the backend. The frontend code remains mostly the same. * If your extension only makes frontend changes, you might be able to migrate it without much trouble (perhaps it already works out of the box?). It is worth checking the section on frontend architecture changes and the HTTP API changes if you are making calls to the backend yourself. -* If your extension includes backend functionality, there might be more work involved. Although an incremental migration (starting from your existing code) might be possible, it might be easier to rewrite those features from scratch following our guide for +* If your extension includes backend functionality, there might be more work involved. Although an incremental migration (starting from your existing code) might be possible, it might be easier to rewrite those features mostly from scratch following our guide for extension developers. ### Migrating from in-memory project data storage to the runner architecture @@ -204,10 +195,99 @@ Most changes for 4.0 happen in the backend. The frontend code remains mostly the ### Changes in the HTTP API offered by OpenRefine's backend -#### Pagination changes -#### Changes in applying operations -#### get-models command +#### The `get-rows` command + +The `get-rows` command offered by the backend to fetch batches of rows or records has changed. In 3.x, the command expected: +* `engine`: the configuration of the engine, indicating whether the rows or records mode should be used, as well as the active facets; +* `limit`: a page size; +* `start`: the number of filtered rows/records before the page. + +Note that the `start` parameter is not always the id of the first row or record to return: if facets are applied, there might be rows/records filtered out before the requested page, in which case the first row id returned will be greater than the `start` +parameter. For the backend, this is inefficient: this means that all rows before the requested page must be processed to check whether they match the facets or not. This was also the source of UX issues as the scrolling position in the grid could often +not be preserved after some operation was applied. + +In the new architecture, the command now expects: +* `engine`: the configuration of the engine, as before; +* `limit`: a page size, as before; +* exactly one of: + * `start`: a lower bound on the first row/record id to return + * `end`: an upper bound on the last row/record id to return + +If no facets are applied, the combination of `start` and `limit` will give the same results as in the previous version, with the first row id returned being given by the value of the `start` parameter. But when facets are applied, the behaviour differs: +the backend will start inspecting the row/record at the given `start` offset, and return the first `limit` matching rows/records. + +The format of the response has changed too. In 3.x, the contents of reconciliation objects used to be stored separately, in a `pool` object. Those reconciliation objects are now stored directly in the cell objects they belong to, and the reconciliation +pool was removed. + +Corresponding issues: [#3562](https://github.com/OpenRefine/OpenRefine/issues/3562), PR [#5411](https://github.com/OpenRefine/OpenRefine/pull/5411). + +#### The `get-models` command + +The output of the `get-models` command has been impacted in several ways: +* The `recordsModel` field was removed +* Its `hasRecord` field was moved to the `columnModel` field. + +The `hasRecords` field has also changed meaning. It used to be set to `true` when the grid contained more rows than records. Both for performance reasons and UX considerations, we have changed this to indicate whether the importer and operations leading to +the current project state created a record structure by design. This should be a more faithful indication of whether the records mode should be offered to the user in this project state. + +Corresponding issues: [#5661](https://github.com/OpenRefine/OpenRefine/issues/5661) and commit [64c552bb1](https://github.com/OpenRefine/OpenRefine/commit/64c552bb1503031fea07e0a299831a0f5b73fee5). + +#### Applying operations + +In 3.x, each operation that the user can run on a project came with the following Java classes in the backend: +* an Operation class, which holds the metadata for the operation and is responsible for its JSON serialization (which is exposed in the history tab, among others) +* a Change class (often reused by different operations), which is responsible for actually applying the operation to the project (carrying out the corresponding transformation) +* a Command class, which exposes an HTTP API to initiate the operation on a project. + +Therefore, each operation came with its own HTTP endpoint to apply it, and the frontend can call that endpoint when the user clicks on some menu item or validates some dialog, for instance. + +In 4.x, those dedicated HTTP enpdoints were removed in favour of using the generic `apply-operations` command, which was used by the Undo/Redo tab to let the user apply +a sequence of operations defined by a JSON array. + +In the frontend, a new utility method was introduced: `Refine.postOperation`. This method can be used to apply an operation by supplying the same JSON represetation one would find in the history tab. Under the hood, it calls the `apply-operations` command. +If an extension used the Javascript functions `Refine.postProcess` or `Refine.postCoreProcess`, we recommend you migrate it to use `Refine.postOperation` instead. Note that the JSON serialization of the operation and the parameters expected by the +dedicated command in 3.x do not always match perfectly, so it is worth double-checking the syntax when doing the migration. See PR [#5559](https://github.com/OpenRefine/OpenRefine/pull/5559/files#diff-865c93310c384b82a3d586f825aa52005a7a4705320d6a7f96219dc4bc029979) for examples of migrations in the core tool. + +Corresponding issues: [#5539](https://github.com/OpenRefine/OpenRefine/issues/5539), PR [#5559](https://github.com/OpenRefine/OpenRefine/pull/5559) + +#### Support for sampling in facet evaluation + +The `compute-facets` command supports sampling, to evaluate facets only on a subset of rows or records as a way of speeding up the computation. +This feature can be enabled by adding an `aggregationLimit` field to the engine configuration JSON passed to the backend, as follows: + +```json +{ + "mode": "record-based", + "facets": [ + { + "type": "list", + "name": "country", + "columnName": "country", + "expression": "value", + "omitBlank": false, + "omitError": false, + "selection": [], + "selectBlank": false, + "selectError": false, + "invert": false + } + ], + "aggregationLimit": 10000 +} +``` + +This will cap the evaluation of facets to the given limit. The number of rows or records actually processed might vary and is returned in the response, using the following parameters: +* `agregatedCount`: number of rows or records which were actually processed; +* `filteredCount`: out of those processed rows or records, how many matched the filters defined by the facets; +* `limitReached`: true when the aggregation stopped because of the `aggregationLimit` set in the request, false when the entire dataset was processed. + #### Options of the CSV/TSV importer -#### Removal of the reconciliation pool - + +The CSV/TSV importer has got a new option: `multiLine`. This boolean option should be set to `true` when each row of the project is known to correspond to a single line in the source file. This implies that cells are known not to contain unescaped newline +characters. + +For backwards compatibility, `multiLine` is considered `false` if it is not present in the importing options. But it is enabled by default by the new UI, since it enables significant performance improvements. + +Extensions or third-party tools should consider adding `multiLine: true` to the importing options they pass to OpenRefine when creating a project, since this will make it possible to process significantly larger datasets. + From 23673902b8754bbe16bb07c4606c5c8299a84043 Mon Sep 17 00:00:00 2001 From: Antonin Delpeuch Date: Fri, 14 Jul 2023 16:19:48 +0200 Subject: [PATCH 4/6] Add details about new data access and processing in 4.0 --- .../migrating-older-extensions.md | 146 ++++++++++++++++-- 1 file changed, 134 insertions(+), 12 deletions(-) diff --git a/docs/technical-reference/migrating-older-extensions.md b/docs/technical-reference/migrating-older-extensions.md index 50db195c..56ee7096 100644 --- a/docs/technical-reference/migrating-older-extensions.md +++ b/docs/technical-reference/migrating-older-extensions.md @@ -185,16 +185,147 @@ Most changes for 4.0 happen in the backend. The frontend code remains mostly the * If your extension includes backend functionality, there might be more work involved. Although an incremental migration (starting from your existing code) might be possible, it might be easier to rewrite those features mostly from scratch following our guide for extension developers. +### Changes in package names + +The first issue you might encounter when trying to migrate an extension to 4.0 is that the package names have changed from `com.google.refine.*` to `org.openrefine.*`. +You are encouraged to run such a replacement on your extension to update the import statements. + +### Changes in Maven module structure + +OpenRefine's code base was also made more modular, to make it easier for extensions to declare finer dependencies on the parts of OpenRefine they actually depend on. Those modules are available on Maven Central in the `org.openrefine` group id. +* `refine-model` contains the core classes which define the data model of the application (`Project`, `Row`, `Cell`…) +* `refine-workflow` contains the application logic of the tool: all operations, facets, importers, exporters, clusterers available in the tool without extensions; +* `refine-testing` contains testing utilities that can be reused by extensions for their own unit tests; +* `refine-grel` contains the implementation of the so-called General Refine Expression Language, OpenRefine's default expression language; +* `refine-local-runner` contains the default implementation of the `Runner`/`Grid`/`ChangeData` interfaces, optimized for data cleaning workflows executed on a single machine; +* `refine-util` contains various utilities. + +On top of that, as in 3.x, the `main` Butterfly module exposes the application functionality via an HTTP API, and the `server` module runs the actual Butterfly server, offering access to the `main` module and all installed extensions. + ### Migrating from in-memory project data storage to the runner architecture -### Changes in project serialization format +Before 4.0, the project data could be accessed simply via mutable fields of the `Project` class. For instance `project.rows` was simply a `List` which could be modified freely. -### Changes in package names +Since 4.0, project data is encapsulated in the `Grid` interface, which represents the state of the project at a given point in a project's history. The `Grid` interface encompasses the following fields: +* the column model (list of columns and their metadata); +* the cells, grouped into rows or records depending on the needs; +* the overlay models, generally defined by extensions, which make it possible to store additional information in the project and benefit from the versioning mechanism. -### Changes in Maven module structure +#### Accessing project data + +To access the grid in a project, use `project.getCurrentGrid()`. This gives you access to the underlying data, for instance `[Grid::rowCount](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#rowCount())` or `[Grid::getRow](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#getRow(long))` which retrieves a row by its index. +Note that it is worth thinking twice about how you access grid data using the methods offered by the `Grid` interface, to make sure it is as efficient as possible. For instance, if you wanted to do something for each row in the project, you could do +something like this: + +```java +// Warning, do not do this! See efficient version below +for (long i = 0; i != grid.rowCount(); i++) { + Row row = grid.getRow(i); + // do something with the row +} +``` +Running this code would be rather inefficient with the default implementation of `Grid`, as accessing an individual row might involve opening files and scanning them to find the appropriate row. This typically happens when the project data does not fit in +memory. Instead, you should rather use the following: +```java +try (CloseableIterator iterator = grid.iterateRows(RowFilter.ANY_ROW)) { + for (Row row : iterator) { + // do something with the row + } +} +``` +This variant has the benefit of doing a single pass on the grid, which is much more efficient. +The try-with-resources block ensures that any files opened to support the iterator are closed when we leave the loop (be it because the end of the grid was reached, +or because the iteration was interrupted by a `break`, `return` or `throw` statement). + +Check [the documentation of the `Grid` interface](https://javadoc-v4.openrefine.org/org/openrefine/model/grid) to find the method that feels the most fitting for your needs. As a fallback solution, you can always use `Grid::collectRows()` to obtain a standard Java List, but of course this forces the loading of the +entire project data in memory. + +#### Modifying project data + +To make changes to the grid, you need to run an operation on the project so that the changes are properly logged in the history. The operation will derive a new grid which will become the +current one. This is done by implementing an `[Operation](https://javadoc-v4.openrefine.org/org/openrefine/operations/operation)` and running it via `project.getHistory().addEntry(operation)`. + +The `Grid` interface provides various methods to derive a new grid with some changes. For instance, to execute the same transformation on all rows, one can use +the `[Grid::mapRows(RowMapper, ColumnModel)](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#mapRows(org.openrefine.model.RowMapper,org.openrefine.model.ColumnModel))` method. +Its first agrument supplies a function which is applied on each row, and the second is the new column model of the resulting grid (which might not be the same as in the initial grid, for instance when adding a new column). +Note that there is no guarantee on the order in which the mapping function will be executed, as the execution might be eager or lazy, sequential or parallel +depending on the implementations. As such this function should be pure. + +Often, you will want to run transformations that are not pure, or which should be executed only once for each row because they are expensive. +This means that the data produced by the transformation should be persisted, that the progress of computing this data should be reported to the user, and that the user is able to pause and resume this computation. +All those features are available to you, at the small cost of going through a slightly more complicated API. +The transformation is implemented in two steps: +* deriving a `ChangeData` object from the original grid. This `ChangeData` object contains the results of the expensive or stateful computation run on the grid, indexed by the identifiers of the row/record they were generated from. + This derivation is obtained by applying a function on each row, via a `RowChangeDataProducer` (or `RecordChangeDataProducer` in records mode). +* then, this `ChangeData` is joined back with the original grid, to obtain the final grid. This makes use of a `RowChangeDataJoiner` (or similarly, `RecordChangeDataJoiner`) which, given a row from the old grid and the result of + the expensive computation on that row, returns the row on the new grid. + +In addition, to make it possible to recover from crashes (which can happen during the computation of a `ChangeData` object), the `Grid` interface makes it possible to supply an incomplete `ChangeData` object +from a previous attempt to compute the operation, such that the new `ChangeData` object can avoid recomputing the rows that were already computed. +This also requires supplying a serializer object to define how the expensively computed data can be saved on disk. + +All in all, the code to implement such an operation will generally look like this: +```java + protected static class MyChangeDataProducer implements RowChangeDataProducer { + // ... implements a function which computes an expensive Long from a Row + } + + protected static class MyChangeDataJoiner implements RowChangeDataJoiner { + // ... implements a function which inserts the computed Long value inside a Row, for instance in a new column + } + + protected static class MyChangeDataSerializer implements ChangeDataSerializer { + // ... defines how our Long values are represented as strings (which is needed to persist them) + } + + // main method of the operation + @Override + public ChangeResult apply(Grid projectState, ChangeContext context) throws OperationException { + + ChangeData changeData = null; + try { + changeData = context.getChangeData("expensive_longs", new MyChangeDataSerializer(), existingChangeData -> { + return projectState.mapRows(engine.combinedRowFilters(), new MyChangeDataProducer(), existingChangeData); + }); + } catch (IOException e) { + throw new IOOperationException(e); + } + + MyChangeDataJoiner joiner = new MyChangeDataJoiner(); + Grid joined = projectState.join(changeData, joiner, projectState.getColumnModel()); + + return new ChangeResult(joined, GridPreservation.PRESERVES_RECORDS, null); + } +``` + +Real-world examples can be found in the ExpressionBasedOperation or PerformWikibaseEditsOperation classes. + +#### Creating new grids + +There are also situations where we need to create a new `Grid` instance without applying a transformation on an existing one. +This is for example the case in any importer, which needs to create a project from scratch. This can also be helpful for some operations +which are not able to formulate their changes easily using the transformation methods offered in the `Grid` interface. In this case, they +can take all the data out of the original grid, run an arbitrary algorithm on this data and create a new grid to store the result (this should +only be used as a fallback solution, since it generally comes with poor memory management). + +Creating grids can be done via [the `Runner` interface](https://javadoc-v4.openrefine.org/org/openrefine/model/runner), which acts as a factory class for grids. It offers multiple options: +* from a list of rows, which is only viable if the grid is small enough to fit in memory; +* from an iterable of rows, which makes it possible to avoid loading all rows in memory. As a consequence, the iterable source will generally be iterated on multiple times, on demand, when methods of the resulting grid are called. +* by loading it from a file on disk, if the grid has been serialized in the expected format; +* by reading a text file (or collection of text files in the same folder), interpreting each line as a one-cell row. This can be useful as a basis to write importers which use a textual format. ### Changes in the HTTP API offered by OpenRefine's backend +#### Use of HTTP status codes + +In 3.x and before, all commands systematically returned the HTTP 200 status code, regardless of whether they executed successfully or not. +In 4.0, more meaningful status codes were introduced. We encourage extensions to do the same. +A catch-all event listener reports any failing command to the user. + +It is important to note that error status codes should only be returned in cases where an error signals a problem in OpenRefine or an extension. +For instance, previewing an expression which contains a syntax error should not return an HTTP status code representing an error, because it is expected that users submit invalid expressions and such errors are displayed in a specific way in the expression +preview dialog. + #### The `get-rows` command The `get-rows` command offered by the backend to fetch batches of rows or records has changed. In 3.x, the command expected: @@ -281,13 +412,4 @@ This will cap the evaluation of facets to the given limit. The number of rows or * `filteredCount`: out of those processed rows or records, how many matched the filters defined by the facets; * `limitReached`: true when the aggregation stopped because of the `aggregationLimit` set in the request, false when the entire dataset was processed. -#### Options of the CSV/TSV importer - -The CSV/TSV importer has got a new option: `multiLine`. This boolean option should be set to `true` when each row of the project is known to correspond to a single line in the source file. This implies that cells are known not to contain unescaped newline -characters. - -For backwards compatibility, `multiLine` is considered `false` if it is not present in the importing options. But it is enabled by default by the new UI, since it enables significant performance improvements. - -Extensions or third-party tools should consider adding `multiLine: true` to the importing options they pass to OpenRefine when creating a project, since this will make it possible to process significantly larger datasets. - From d9f19f381fb308c2466e3fc24b28beb00dff1355 Mon Sep 17 00:00:00 2001 From: Antonin Delpeuch Date: Sat, 15 Jul 2023 14:48:14 +0200 Subject: [PATCH 5/6] Expand migration instructions after attempting to migrate the CommonsExtension --- .../migrating-older-extensions.md | 88 ++++++++++++++++++- 1 file changed, 86 insertions(+), 2 deletions(-) diff --git a/docs/technical-reference/migrating-older-extensions.md b/docs/technical-reference/migrating-older-extensions.md index 56ee7096..a311c510 100644 --- a/docs/technical-reference/migrating-older-extensions.md +++ b/docs/technical-reference/migrating-older-extensions.md @@ -185,10 +185,21 @@ Most changes for 4.0 happen in the backend. The frontend code remains mostly the * If your extension includes backend functionality, there might be more work involved. Although an incremental migration (starting from your existing code) might be possible, it might be easier to rewrite those features mostly from scratch following our guide for extension developers. -### Changes in package names +### Changes in package and class names The first issue you might encounter when trying to migrate an extension to 4.0 is that the package names have changed from `com.google.refine.*` to `org.openrefine.*`. You are encouraged to run such a replacement on your extension to update the import statements. +The following Bash command can be run in a source directory, performing the replacement on all files contained in subdirectories: +```bash +find . -type f -exec sed -i 's/com\.google\.refine/org.openrefine/g' {} \; +``` +Note that this must be done in Java files (both main and test classes), but also in the `controller.js` file where components are registered. + +On top of this, the following classes have been renamed: +* `com.google.refine.model.Column` became `org.openrefine.model.ColumnMetadata`, to make it clear that this class only stores metadata and none of the actual data in the column; +* `com.google.refine.model.ReconStats` was removed: those statistics used to be part of the column metadata, but they are now computed at the same time as facets. They are now represented by `org.openrefine.browsing.columns.ColumnStats` which stores + broader statistics than just reconciliation status. +* Other reconciliation model classes, such as the `Recon` or `ReconCandidate` classes, were moved from `com.google.refine.model` to `org.openrefine.model.recon`. ### Changes in Maven module structure @@ -211,6 +222,33 @@ Since 4.0, project data is encapsulated in the `Grid` interface, which represent * the cells, grouped into rows or records depending on the needs; * the overlay models, generally defined by extensions, which make it possible to store additional information in the project and benefit from the versioning mechanism. +#### Immutability of core data model classes + +All classes involved in representing the state of the project, such as `Grid`, `Row`, `Cell`, `Recon` and others, are now immutable. +This was introduced to make sure that any changes made to the project are done by deriving a new grid and adding it to the history, +ensuring proper versioning. The use of immutable classes is also widely regarded as a good practice which makes it easier to guarantee the correctness of data processing applications, especially in the presence of parallelism (which is used in OpenRefine). + +A lot of the code changes involved in migrating an extension will be directly related to this change. +For instance, while in 3.x you could do something like this: +```java +Column column = new Column(); +column.setName("My column"); +column.setReconConfig(config); +``` + +In 4.0, the setters have been removed and the corresponding code looks like this: +```java +ColumnMetadata column = new ColumnMetadata("My column") + .withReconConfig(config); +``` + +#### Serializability of classes + +Many classes in OpenRefine are now required to be serializable with Java serialization. +This is done to enable integrations with distributed execution engines such as Apache Spark (see the [refine-spark](https://github.com/OpenRefine/refine-spark) extension). +This should generally not cause much trouble during migration, beyond your IDE prompting you to add `serialVersionUID` fields in those classes. + + #### Accessing project data To access the grid in a project, use `project.getCurrentGrid()`. This gives you access to the underlying data, for instance `[Grid::rowCount](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#rowCount())` or `[Grid::getRow](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#getRow(long))` which retrieves a row by its index. @@ -233,7 +271,7 @@ try (CloseableIterator iterator = grid.iterateRows(RowFilter.ANY_ROW } } ``` -This variant has the benefit of doing a single pass on the grid, which is much more efficient. +This variant has the benefit of doing a single pass on the grid, opening any required file only once, which is much more efficient. The try-with-resources block ensures that any files opened to support the iterator are closed when we leave the loop (be it because the end of the grid was reached, or because the iteration was interrupted by a `break`, `return` or `throw` statement). @@ -412,4 +450,50 @@ This will cap the evaluation of facets to the given limit. The number of rows or * `filteredCount`: out of those processed rows or records, how many matched the filters defined by the facets; * `limitReached`: true when the aggregation stopped because of the `aggregationLimit` set in the request, false when the entire dataset was processed. +### Changes in GREL + +In 4.0, two separate classes of GREL functions were introduced: +* the pure functions, which only perform a (generally lightweight) computation on their arguments, without interacting with any external system (disk, network, other OpenRefine components). Those functions extend the `PureFunction` abstract class. This is + for instance the case of the `trim()` or `parseJson()` functions. +* the other functions, which are allowed to perform side-effects or access contextual data beyond their own arguments. This is the case of the `facetCount()` function (since it is able to access project data which is not supplied to it as an argument) or + the `cross()` function (which is even able to access other projects). + +If your extension defines custom GREL functions, they implement the `Function` interface in OpenRefine 3.x. For each of those functions, it is worth checking whether they are pure. If that is the case, change them to implement the `PureFunction` interface +instead. This will make it possible to evaluate them on the fly (lazily). Otherwise, any expression that relies on such an unpure function will be treated as expensive to compute, meaning that deriving a new column based on such an expression will be +treated as a long-running operation, which stores its results on disk. + +### Changes in importers + +The structure of the importers has changed, mostly due to the migration to immutable data structures. +In OpenRefine 3.x, importers were passed an empty project that they had to fill with data - hence relying crucially on the mutability of the `Project` class. + +This can be seen in the signature of the `ImportingParser::parse` method: +```java +public void parse( + Project project, + ProjectMetadata metadata, + ImportingJob job, + List fileRecords, + String format, + int limit, + ObjectNode options, + List exceptions); +``` + +Instead, in 4.0, importers do not interact with `Project` instances at all. Their task is simply to return a `Grid` +given the importing parameters they have been passed. To be able to do so, they are also passed a `Runner` instance, +since this factory object is required to create grids. + +```java +public Grid parse( + Runner runner, + ProjectMetadata metadata, + ImportingJob job, + List fileRecords, + String format, + long limit, + ObjectNode options) throws Exception; +``` +Also note the migration to throwing exceptions when encountering errors instead of storing those in a List. + From cb32b0c8cc940addd8955f6afe67c2cb12993ef1 Mon Sep 17 00:00:00 2001 From: Antonin Delpeuch Date: Sat, 2 Mar 2024 11:11:05 +0100 Subject: [PATCH 6/6] Fix markup --- docs/technical-reference/migrating-older-extensions.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/technical-reference/migrating-older-extensions.md b/docs/technical-reference/migrating-older-extensions.md index a311c510..1b450eb7 100644 --- a/docs/technical-reference/migrating-older-extensions.md +++ b/docs/technical-reference/migrating-older-extensions.md @@ -251,7 +251,7 @@ This should generally not cause much trouble during migration, beyond your IDE p #### Accessing project data -To access the grid in a project, use `project.getCurrentGrid()`. This gives you access to the underlying data, for instance `[Grid::rowCount](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#rowCount())` or `[Grid::getRow](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#getRow(long))` which retrieves a row by its index. +To access the grid in a project, use `project.getCurrentGrid()`. This gives you access to the underlying data, for instance [`Grid::rowCount`](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#rowCount()) or [`Grid::getRow`](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#getRow(long))` which retrieves a row by its index. Note that it is worth thinking twice about how you access grid data using the methods offered by the `Grid` interface, to make sure it is as efficient as possible. For instance, if you wanted to do something for each row in the project, you could do something like this: @@ -281,10 +281,10 @@ entire project data in memory. #### Modifying project data To make changes to the grid, you need to run an operation on the project so that the changes are properly logged in the history. The operation will derive a new grid which will become the -current one. This is done by implementing an `[Operation](https://javadoc-v4.openrefine.org/org/openrefine/operations/operation)` and running it via `project.getHistory().addEntry(operation)`. +current one. This is done by implementing an [`Operation`](https://javadoc-v4.openrefine.org/org/openrefine/operations/operation) and running it via `project.getHistory().addEntry(operation)`. The `Grid` interface provides various methods to derive a new grid with some changes. For instance, to execute the same transformation on all rows, one can use -the `[Grid::mapRows(RowMapper, ColumnModel)](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#mapRows(org.openrefine.model.RowMapper,org.openrefine.model.ColumnModel))` method. +the [`Grid::mapRows(RowMapper, ColumnModel)`](https://javadoc-v4.openrefine.org/org/openrefine/model/grid#mapRows(org.openrefine.model.RowMapper,org.openrefine.model.ColumnModel)) method. Its first agrument supplies a function which is applied on each row, and the second is the new column model of the resulting grid (which might not be the same as in the initial grid, for instance when adding a new column). Note that there is no guarantee on the order in which the mapping function will be executed, as the execution might be eager or lazy, sequential or parallel depending on the implementations. As such this function should be pure.