Typerighter is the server-side part of a service to check a document against a set of user-defined rules. It's designed to work like a spelling or grammar checker. It contains two services, the checker and the rule manager – see architecture for more information.
We use it at the Guardian to check content against our style guide. Max Walker, the subeditor who inspired the creation of Typerighter, has written an introduction here.
To understand our goals for the tool, see the vision document.
For setup, see the docs directory.
For an example of a Typerighter client (the part that presents the spellcheck-style interface to the user), see prosemirror-typerighter.
The Typerighter Rule Manager produces a JSON artefact (stored in S3) which is ingested by the Checker service. This artefact represents all the rules in our system, currently including user-defined regex rules, user-defined Language Tool pattern rules (defined as XML) and Language Tool core rules (pre-defined rules from Language Tool). Historically, rules were derived from a Google Sheet, rather than the Rule Manager.
Each rule in the service corresponds to a Matcher
that receives the document and passes back a list of RuleMatch
. We have the following Matcher
implementations:
RegexMatcher
uses regular expressionsLanguageToolMatcher
is powered by the LanguageTool project, and uses a combination of native LanguageTool rules and user-defined XML rules as its corpus
Matches contain the range that match applies to, a description of why the match has occurred, and any relevant suggestions – see the RuleMatch
interface for the full description.
- Rule owner: a person responsible for maintaining the rules that Typerighter consumes.
- Rule user: a person checking their copy with the checker service.
The system consists of two Scala services:
- The rule-manager service, which is responsible for the lifecycle of Typerighter's corpus of rules, and publishes them as an artefact
- The checker service, which consumes that artefact and responds to requests to check copy against the corpus of rules with matches.
They're arranged like so:
flowchart LR
checker[Checker service]
manager[Manager service]
sheet[Google Sheet]
client[Typerighter client]
s3[(typerighter-rules.json)]
db[(Postgres DB)]
owner{{Rule owner role}}
user{{Rule user role}}
sheet--"Get rules"-->manager
manager--"Write rules"-->db
db--"Read rules"--> manager
manager--"Write rule artefact"-->s3
checker--"Read rule artefact"-->s3
client--"Request matches"-->checker
owner-."Force manager to re-fetch sheet".->manager
user-."Request document check".->client
owner-."Edit rules".->sheet
Typerighter's built to manage document checks of every kind, include checks that we haven't yet thought of. To that end, a MatcherPool
is instantiated for each running checker service, which is responsible for managing incoming checks, including parallelism, backpressure, and ensuring that our checks are given to the appropriate matchers.
A MatcherPool
accepts any matcher instance that satisfies the Matcher
trait. Two core Matcher
implementations include RegexMatcher
, that checks copy with regular expressions, and LanguageToolMatcher
, that checks copy with an instance of a JLanguageTool
. The MatcherPool
is excited to accommodate new matchers in the future! Here's a diagram to illustrate:
flowchart TD
CH(["Check requests"])
MP-."matches[]".->CH
MP[MatcherPool]--has many--->MS
CH-.document.->MP
subgraph MS[Matchers]
R[RegexMatcher]
L[LanguageToolMatcher]
F[...FancyHypotheticalAIMatcher]
end
Both the Checker and Rule Manager services are built in Scala with the Play framework. Data in the Rule Manager is stored in a Postgres database, queried via ScalikeJDBC.
Google credentials are fetched from SSM using AWS Credentials or Instance Role.
It's worth noting that, at the moment, there are a fair few assumptions built into this repository that are Guardian-specific:
- We assume the use of AWS cloud services, and default to the
eu-west-1
region. This is configurable on a per-project basis with the configuration parameteraws.region
. - Building and deployment is handled by riff-raff, the Guardian's deployment platform.
- Configuration is handled by simple-configuration.
We'd be delighted to participate in discussions, or consider PRs, that aimed to make Typerighter easier to use in a less institionally specific context.
The prosemirror-typerighter plugin provides an integration for the Prosemirror rich text editor.
If you'd like to provide your own integration, this service will function as a standalone REST platform, but you'll need to use pan-domain-authentication to provide a valid auth cookie with your requests.
LanguageTool has core rules that we use, and as we upgrade LT, these could change underneath us.
There's a script to see if rules have changed as a result of an upgrade in ./script/js/compare-rule-xml.js.
Prettier is installed in the client app using the Guardian's recommended config. To format files you can run npm run format:write
. A formatting check will run as part of CI.
To configure the IntelliJ Prettier plugin to format on save see the guide here. To configure the VS Code Prettier plugin see here.
Typerighter uses Scalafmt to ensure consistent linting across all Scala files.
To lint all files you can run sbt scalafmtAll
To confirm all files are linted correctly, you can run sbt scalafmtCheckAll
You can configure your IDE to format scala files on save according to the linting rules defined in .scalafmt.conf
For intellij there is a guide to set up automated linting on save here and here. For visual studio code with metals see here
The project contains a pre-commit hook which will automatically run the Scala formatter on all staged files. To enable this, run ./script/setup
from the root of the project.
Sometimes it's useful to connect to the databases running in AWS to inspect the data locally.
We can use ssm-scala
to create an SSH tunnel that exposes the remote database on a local port. For example, to connect to the CODE database, we can run:
ssm ssh -x -t typerighter-rule-manager,CODE -p composer --rds-tunnel 5000:rule-manager-db,CODE
You should then be able to connect the database on localhost:5000
. You'll need to use the username and password specified in AWS parameter store at /${STAGE}/flexible/typerighter-rule-manager/db.default.username
and db.default.password
.
Don't forget to kill the connection once you're done! Here's a handy one-liner: kill $(lsof -ti {PORT_NUMBER})