Importer project cleanup #41

kellnerd · 2023-09-11T16:34:20Z

Continuation of #40, depends on metabrainz/bookbrainz-data-js#309

Most commits are only deleting unused stuff, so this one should be easy to review.
Additionally I've fixed a rare bug in ca8f751, improved a few types and resolved a few linter errors. (Now that the outdated legacy code is gone, the eslint output finally became clearer as it contained less irrelevant messages.)

As you can see above, I haven't moved any files yet as I'm still unsure what would be a good folder structure. Until we have found one, I would prefer not to move any files as this only makes it very hard to trace the git history of a file. 8965604 is still annoying me from time to time in this regard. (I wonder why it was decided to have two unrelated project in one repository, that has tripped me a few times already when I accidentally ran npm or yarn inside the root directory...)

While I haven't encountered one of these previously, they cause an infinite retry loop without this patch. Strangely enough the logged error was an author validation error, although these do not get thrown... I was not able to reproduce this error with the same sample (OL100223A): > Invalid Name section - Invalid name section name null > Invalid name section sort name null > Invalid name section language null

Delete type definitions which have been moved to bookbrainz-data, see metabrainz/bookbrainz-data-js@7679731

The exploration tool was used to collect statistics about available JSON attributes in OL dumps. If we ever need this again, it will be easier to rewrite it from scratch.

Also uninstall the now unused `async` package.

Now that the outdated legacy code is gone, the eslint output finally became clearer as it contained less irrelevant messages.

The config.sample.json file and the importer README are reasonably self-explanatory for now.

MonkeyDo · 2023-09-12T17:09:42Z

Looks all good so far !

I agree with moving files in another PR. Happy to have an IRC chat to figure out folder structure if you want.

Convert identifier mapping into an enum?

It might indeed make sense to do that.
Looking at you, identifier-links.js
We could deploy that as a minor version of the ORM after deploying the TS cleanup major version, if that works for you.

kellnerd · 2023-09-12T18:29:55Z

Happy to have an IRC chat to figure out folder structure if you want.

Let's postpone that until I've got the OpenLibrary importer (including relationships) working, then I will have a better understanding how complex each producer implementation will be and how many common code is left which does not belong into bb-data.
Currently it's hard to make a call whether we should have a simple "one flat folder per producer" hierarchy and simply move queue.ts into helpers/ or its own folder.
The only thing which I'm certain about is that it would make sense to have only executable scripts like bbiq.ts directly inside the src/ folder.

We could deploy that as a minor version of the ORM after deploying the TS cleanup major version, if that works for you.

Since the identifier mapping enum will be used by both bb-site and bb-utils, it should definitely live in bb-data in the end.
I agree with having that in a later ORM release only as it's not that relevant for my GSoC project.
So far I've only done the refactorings which also improve DX for GSoC and only kept a list of all the other things that I've noticed which could also be improved (there are quite a lot).

MonkeyDo · 2023-09-13T09:31:43Z

So does that mean that work on this PR is done, and it's ready for final review and merging?

kellnerd · 2023-09-13T12:26:19Z

Yes, I would say the work is done.
After metabrainz/bookbrainz-data-js#309 has been merged and the new version has been released on npm, I will do a version bump commit here and then it can be merged.

MonkeyDo

I set up everything from scratch following the instructions in the readme, and successfully produced messages for 100 authors, then started the consumer which successfully consumed 100 messages.
Everything is working fine and the code looks much cleaner, nice work !

Even though I get errors on insertion in the DB, the tool works great and the CLI is very usable (testing the failure queue while I was at it).

IMO the default mechanism should be to post failures to the failure queue, with the ability to pass an option to change the failure queue name or disable it. (probably fine to change this in a separate PR once we've used the tool a bit more, maybe my opinion was actually 💩 )

kellnerd · 2023-10-19T17:23:58Z

Good to hear that you have successfully tested it, with 100 messages even 😮
So far I've only tested with sets of a dozen authors because I still think there is too much redundant logging making the output hard to parse.

Most of the current errors seem to be the result from languages which are missing in the BB (sample) database but are present in MB. The reason for these rare languages appearing in our sample data at all is the bad performance of the language detection for author aliases, which are way too short to reliably do that. I think we should rather leave the alias language blank.

I agree that changing the defaults to have a failure queue makes sense. Better store these messages by default than losing them by default. I think the primary motivation for the current default is that the CLI parameter was easier to implement that way 😁

MonkeyDo · 2023-10-19T17:34:28Z

Oh, I tested the producer with more previously but didn't run the importer on it:

Indeed the logs are quite verbose, and indeed all I see are language-related issues.
I don't think the alias language can be null though :/ Not sure how to deal with that.

As for the CLI option, how about allowing both naming and disabling with the same option with --failure-queu my_failure_queue and --failure-queu false?
Or do you prefer a separate --no-failure-queu option instead (incompatible with the --failure-queu option)?

kellnerd added 11 commits August 7, 2023 21:22

other(ts): define a type for OpenLibrary entity types

f4be6bf

refactor(ts): prefer shared types from bookbrainz-data

15abafc

Delete type definitions which have been moved to bookbrainz-data, see metabrainz/bookbrainz-data-js@7679731

chore: delete old OL producer CLI which has been superseded

320e27d

chore: delete old OpenLibrary exploration tool

ee47aa9

The exploration tool was used to collect statistics about available JSON attributes in OL dumps. If we ever need this again, it will be easier to rewrite it from scratch.

chore: delete old consumer CLI which has been superseded

6d761fd

chore: delete buggy old queue implementation which is now unused

bf7ab53

chore: delete unused asyncCluster which never worked as expected

f118cce

Also uninstall the now unused `async` package.

chore: resolve a few relevant linter errors

81b3a33

Now that the outdated legacy code is gone, the eslint output finally became clearer as it contained less irrelevant messages.

chore: drop npm scripts which depend on now deleted files

725db0e

chore: delete outdated config file README

bcf47cd

The config.sample.json file and the importer README are reasonably self-explanatory for now.

kellnerd mentioned this pull request Sep 11, 2023

Refresh importer project #40

Merged

19 tasks

kellnerd added 4 commits October 16, 2023 12:49

refactor(ts): prefer shared type QueuedEntity from bookbrainz-data

bb3dc0a

fix(consumer): follow the latest changes of the ORM

ce6ebcc

chore(deps): update bookbrainz-data to v5.0.0

52212de

fix(consumer): log serialized JSON instead of [object Object]

e6da505

MonkeyDo approved these changes Oct 19, 2023

View reviewed changes

MonkeyDo merged commit 7c3a71e into metabrainz:master Oct 19, 2023

kellnerd deleted the importer-cleanup branch October 22, 2023 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importer project cleanup #41

Importer project cleanup #41

kellnerd commented Sep 11, 2023 •

edited

Loading

MonkeyDo commented Sep 12, 2023

kellnerd commented Sep 12, 2023

MonkeyDo commented Sep 13, 2023

kellnerd commented Sep 13, 2023

MonkeyDo left a comment •

edited

Loading

kellnerd commented Oct 19, 2023

MonkeyDo commented Oct 19, 2023 •

edited

Loading

Importer project cleanup #41

Importer project cleanup #41

Conversation

kellnerd commented Sep 11, 2023 • edited Loading

MonkeyDo commented Sep 12, 2023

kellnerd commented Sep 12, 2023

MonkeyDo commented Sep 13, 2023

kellnerd commented Sep 13, 2023

MonkeyDo left a comment • edited Loading

Choose a reason for hiding this comment

kellnerd commented Oct 19, 2023

MonkeyDo commented Oct 19, 2023 • edited Loading

kellnerd commented Sep 11, 2023 •

edited

Loading

MonkeyDo left a comment •

edited

Loading

MonkeyDo commented Oct 19, 2023 •

edited

Loading