Skip to content

guardian/typerighter

Repository files navigation

Typerighter

image

Typerighter is the server-side part of a service to check a document against a set of user-defined rules. It's designed to work like a spelling or grammar checker. It contains two services, the checker and the rule manager – see architecture for more information.

We use it at the Guardian to check content against our style guide. Max Walker, the subeditor who inspired the creation of Typerighter, has written an introduction here.

To understand our goals for the tool, see the vision document.

For setup, see the docs directory.

For an example of a Typerighter client (the part that presents the spellcheck-style interface to the user), see prosemirror-typerighter.

How it works: an overview

The Typerighter Rule Manager produces a JSON artefact (stored in S3) which is ingested by the Checker service. This artefact represents all the rules in our system, currently including user-defined regex rules, user-defined Language Tool pattern rules (defined as XML) and Language Tool core rules (pre-defined rules from Language Tool). Historically, rules were derived from a Google Sheet, rather than the Rule Manager.

Each rule in the service corresponds to a Matcher that receives the document and passes back a list of RuleMatch. We have the following Matcher implementations:

  • RegexMatcher uses regular expressions
  • LanguageToolMatcher is powered by the LanguageTool project, and uses a combination of native LanguageTool rules and user-defined XML rules as its corpus

Matches contain the range that match applies to, a description of why the match has occurred, and any relevant suggestions – see the RuleMatch interface for the full description.

Architecture

Roles

  • Rule owner: a person responsible for maintaining the rules that Typerighter consumes.
  • Rule user: a person checking their copy with the checker service.

The system consists of two Scala services:

  • The rule-manager service, which is responsible for the lifecycle of Typerighter's corpus of rules, and publishes them as an artefact
  • The checker service, which consumes that artefact and responds to requests to check copy against the corpus of rules with matches.

They're arranged like so:

flowchart LR
  checker[Checker service]
  manager[Manager service]
  sheet[Google Sheet]
  client[Typerighter client]
  s3[(typerighter-rules.json)]
  db[(Postgres DB)]
  owner{{Rule owner role}}
  user{{Rule user role}}

  sheet--"Get rules"-->manager
  manager--"Write rules"-->db
  db--"Read rules"--> manager
  manager--"Write rule artefact"-->s3
  checker--"Read rule artefact"-->s3
  client--"Request matches"-->checker

  owner-."Force manager to re-fetch sheet".->manager
  user-."Request document check".->client
  owner-."Edit rules".->sheet
Loading

The checker service

Typerighter's built to manage document checks of every kind, include checks that we haven't yet thought of. To that end, a MatcherPool is instantiated for each running checker service, which is responsible for managing incoming checks, including parallelism, backpressure, and ensuring that our checks are given to the appropriate matchers.

A MatcherPool accepts any matcher instance that satisfies the Matcher trait. Two core Matcher implementations include RegexMatcher, that checks copy with regular expressions, and LanguageToolMatcher, that checks copy with an instance of a JLanguageTool. The MatcherPool is excited to accommodate new matchers in the future! Here's a diagram to illustrate:

flowchart TD
   CH(["Check requests"])
   MP-."matches[]".->CH
   MP[MatcherPool]--has many--->MS
   CH-.document.->MP
   subgraph MS[Matchers]
    R[RegexMatcher]
    L[LanguageToolMatcher]
    F[...FancyHypotheticalAIMatcher]
   end
Loading

Implementation

Both the Checker and Rule Manager services are built in Scala with the Play framework. Data in the Rule Manager is stored in a Postgres database, queried via ScalikeJDBC.

Google credentials are fetched from SSM using AWS Credentials or Instance Role.

It's worth noting that, at the moment, there are a fair few assumptions built into this repository that are Guardian-specific:

We'd be delighted to participate in discussions, or consider PRs, that aimed to make Typerighter easier to use in a less institionally specific context.

Integration

The prosemirror-typerighter plugin provides an integration for the Prosemirror rich text editor.

If you'd like to provide your own integration, this service will function as a standalone REST platform, but you'll need to use pan-domain-authentication to provide a valid auth cookie with your requests.

Upgrading LanguageTool

LanguageTool has core rules that we use, and as we upgrade LT, these could change underneath us.

There's a script to see if rules have changed as a result of an upgrade in ./script/js/compare-rule-xml.js.

Formatting

Prettier formatting

Prettier is installed in the client app using the Guardian's recommended config. To format files you can run npm run format:write. A formatting check will run as part of CI.

To configure the IntelliJ Prettier plugin to format on save see the guide here. To configure the VS Code Prettier plugin see here.

Scala formatting

Typerighter uses Scalafmt to ensure consistent linting across all Scala files.

To lint all files you can run sbt scalafmtAll To confirm all files are linted correctly, you can run sbt scalafmtCheckAll

You can configure your IDE to format scala files on save according to the linting rules defined in .scalafmt.conf

For intellij there is a guide to set up automated linting on save here and here. For visual studio code with metals see here

Automatic formatting

The project contains a pre-commit hook which will automatically run the Scala formatter on all staged files. To enable this, run ./script/setup from the root of the project.

Developer how-tos

Connecting to the rule-manager database in CODE or PROD

Sometimes it's useful to connect to the databases running in AWS to inspect the data locally.

We can use ssm-scala to create an SSH tunnel that exposes the remote database on a local port. For example, to connect to the CODE database, we can run:

ssm ssh -x -t typerighter-rule-manager,CODE -p composer --rds-tunnel 5000:rule-manager-db,CODE

You should then be able to connect the database on localhost:5000. You'll need to use the username and password specified in AWS parameter store at /${STAGE}/flexible/typerighter-rule-manager/db.default.username and db.default.password.

Don't forget to kill the connection once you're done! Here's a handy one-liner: kill $(lsof -ti {PORT_NUMBER})