2025 Q4 Roadmap

Live tracking of work continues in this GH project board: https://github.com/orgs/edgi-govdata-archiving/projects/32

🔄 = Carry-over from Q3 (#183).


## Overall Status

This year…
- Q1 was spent getting basic systems that had been turned off back online and getting a basic handle again on the infrastructure and projects.
- Q2 was focused on critical automation and a few basic upkeep tasks. I was traveling for half the quarter.
- Q3 felt like things were a bit on pause while I had other professional projects going on. Slow progress.

There is a *lot* of leftover work here. For 2026 Q1, I expect to drop any leftover stuff that in the “non-critical” section that doesn’t get done this quarter.


## Critical Goals

- [ ] 🔄 Public Tracker.

    There are a bunch of things about the publishing process for [EDGI’s current public web tracker](https://envirodatagov.org/enviro-fed-web-tracker/) that are problematic and need more automation, but the general presentation of it (as a Google Sheet in an iframe, with a separate “about” page) that are not great. We want to solve both these problems with a simple static site that gets generated automatically from a simplified Google Sheet.

- [ ] Address IA needs for web-monitoring-diff: edgi-govdata-archiving/web-monitoring-diff#202 (This may depend on [#145](https://github.com/edgi-govdata-archiving/web-monitoring-diff#145), see below.)

- [ ] 🔄 Bring web-monitoring-diff back up to date and off life-support.
    - [ ] Wrap up the small refactor I started two years ago before everything shut down. It’s kind of in the way of other work. (edgi-govdata-archiving/web-monitoring-diff#145)
    - [ ] Replace cChardet, which is blocking us from compatibility with current versions of Python. (edgi-govdata-archiving/web-monitoring-diff#165)
    - [ ] And/or support installations without cChardet. (edgi-govdata-archiving/web-monitoring-diff#196)

- [x] 🔄 Automate checking for and recording network errors.

    In Q1, we added the ability to record these in the DB, but there is no automated system that creates those records. In Q2/Q3 I said this hasn’t been a huge thorn in our sides, but it sure is now! I was previously concerned about recording too many *spurious* errors, but in hindsight, it would have been much better to get this automated and then start to address any problematic cases.

    Finally done, required a variety of updates to get this across the finish line:
    - https://github.com/edgi-govdata-archiving/web-monitoring-processing/pull/869
    - https://github.com/edgi-govdata-archiving/web-monitoring-db/pull/1289
    - https://github.com/edgi-govdata-archiving/web-monitoring-crawler/pull/6
    - https://github.com/edgi-govdata-archiving/web-monitoring-crawler/commit/7d29830719792ddcd0e3d79b96acab5fd99169e4
    - https://github.com/edgi-govdata-archiving/web-monitoring-crawler/commit/ef0e74502fbf1a93e3f67475f2306cb4d1833f2b


 ## Non-Critical Goals

- [ ] Add more caching (elasticache?) and similar tooling [back] into the stack. We are OK doubling the current spend (maybe more!) on AWS. *Ideally,* we want to get to a place where we are more comfortable sharing this tool widely and broadly with the public.

- [ ] Admin tooling for adding/updating pages.
    - [ ] Add a new page/URL. There are more complexities than are obvious to doing this, and I've never gotten around to it, but after adding a bunch for PEDP, it feels more important than in the past.
    - [ ] Add URLs to a page/change canonical URLs/promote redirects to canonical/alternative URLs.
    - [ ] Add/remove tags and maintainers.

    This stuff kinda ties with possibly merging UI into the DB/API as one server. I suspect what I may start doing is starting to build out a new/alternative, non-React-based UI with this stuff and go from there.

- [ ] 🔄 Regulation research tooling. TBH this has been in a bit of a holding pattern. That may continue; need to understand how to balance this against web monitoring.

- [ ] 🔄 Clean up docs a bit; try and make them more current.

- [ ] 🔄 Alert analysts of possible new pages (and maybe removed pages?). This has always been a significant need that’s never been solved well. Some ideas we could do independently or together:
    - [ ] Keep track of seen URLs from all the links in all the pages we monitor. (edgi-govdata-archiving/web-monitoring#173)
    - [ ] Keep track of `sitemap.xml` for sites that have them (e.g. `epa.gov`). It would be neat to turn these into a git repo so they are also browsable through time, but that’s not the core need.


## Other Ideas and Nice-to-Haves

Not top priority, but these things are on the brain.

- **Minimize web-monitoring-ui or even merge it into web-monitoring-db.**

    The way these are split up makes a lot of things very hard to do or work on, makes infrastructure harder (more things to deploy, more CPU + memory requirements), and invites all kinds of weird little problems (CORS, cross-origin caching, login complexity, etc.). We originally designed things as microservices because of the way the team was structured and the skills people had, but that turned out to be over-ambitious in practice (in my opinion). Today, it’s an even more active problem when just one person is maintaining them all.

    We also had all kinds of ambitious ideas about the UI project giving analysts a direct interface to their task lists, being able to post their annotations/analysis directly to the DB, and so on. This never got done, and would require a *lot* more work, both in the UI, and in ways for the analysts to get their data back out or query it, before it would ever be better than the analysts working directly with spreadsheets as they do today. As things stand today, this stuff would be neat, but I don’t it is ever going to get done.

    If we drop all these ideas, the UI really doesn’t need to be nearly as complex or special as it is. It also doesn’t need its own server. At the simplest, it could be served as a static site (via GH pages, from CloudFront/S3, or even just from the API server as an asset). It could also just be normal front-end code in the web-monitoring-db server, but that requires a lot more rewriting and rethinking (it *does* pave a nicer, more monolithic path back towards including annotations/analysis forms for analysts, though).

    This would be some nice cleanup, but could turn into a big project. So a bit questionable.

- **Consider whether web-monitoring-db should be rewritten in Python, and be more monolithic.** The above stuff about merging away web-monitoring-ui feeds directly into this. Web-monitoring-db is really the odd duck here, written in Ruby and Rails while everything else is Python (or JS if it’s front-end). This was originally done because the first stuff I helped out with at EDGI was Ruby-based, and I thought there was crew of Ruby folks who would be contributing. That turned out not to be true. I think Rails is fantastic, but the plethora of languages and frameworks here has historically made contributing to this project very hard. Rewriting it in Python also makes it easier to pull other pieces (e.g. the differ, all the import and processing scripts, all the task sheet stuff) together, and would reduce some code duplication.

    I don’t expect this to go anywhere — this project is probably much too big and unrealistic at this point. But I want to log it.

- **Get rid of Kubernetes.** It’s been clear to me for several years now that managing your own Kubernetes cluster is not worthwhile for a project of this size. (I’m not sure it’s worthwhile for any org that cannot afford a dedicated (dev)ops/SRE person to own it.) Managed Kubernetes (AWS EKS, Google GKE, etc.) is better, but also still tends to be more complicated and obtuse than an infrastructure provider’s own stuff (e.g. AWS ECS+Fargate).

    This is also a big project on its own that probably won’t happen. Additionally, it’s possible it could be more expensive than the current situation (we have our services very efficiently and tightly packed into 3 EC2 instances, and you can’t make decisions that are quite as granular on ECS, for example), although there are other management tradeoffs.

    Note that a simplified, more monolithic structure as discussed above also makes it easier to run this project on other systems/services/infrastructure types. BUT we are probably somewhat coupled to AWS at this point, where all our data is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

2025 Q4 Roadmap #184

Overall Status

Critical Goals

Non-Critical Goals

Other Ideas and Nice-to-Haves

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

2025 Q4 Roadmap #184

Description

Overall Status

Critical Goals

Non-Critical Goals

Other Ideas and Nice-to-Haves

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions