Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importer project cleanup #41

Merged
merged 15 commits into from
Oct 19, 2023
Merged

Conversation

kellnerd
Copy link
Contributor

@kellnerd kellnerd commented Sep 11, 2023

Continuation of #40, depends on metabrainz/bookbrainz-data-js#309

Most commits are only deleting unused stuff, so this one should be easy to review.
Additionally I've fixed a rare bug in ca8f751, improved a few types and resolved a few linter errors. (Now that the outdated legacy code is gone, the eslint output finally became clearer as it contained less irrelevant messages.)

As you can see above, I haven't moved any files yet as I'm still unsure what would be a good folder structure. Until we have found one, I would prefer not to move any files as this only makes it very hard to trace the git history of a file. 8965604 is still annoying me from time to time in this regard. (I wonder why it was decided to have two unrelated project in one repository, that has tripped me a few times already when I accidentally ran npm or yarn inside the root directory...)

While I haven't encountered one of these previously, they cause an
infinite retry loop without this patch.

Strangely enough the logged error was an author validation error,
although these do not get thrown...
I was not able to reproduce this error with the same sample (OL100223A):

> Invalid Name section - Invalid name section name null
> Invalid name section sort name null
> Invalid name section language null
Delete type definitions which have been moved to bookbrainz-data, see
metabrainz/bookbrainz-data-js@7679731
The exploration tool was used to collect statistics about available JSON
attributes in OL dumps.
If we ever need this again, it will be easier to rewrite it from scratch.
Also uninstall the now unused `async` package.
Now that the outdated legacy code is gone, the eslint output finally
became clearer as it contained less irrelevant messages.
The config.sample.json file and the importer README are reasonably
self-explanatory for now.
@kellnerd kellnerd mentioned this pull request Sep 11, 2023
19 tasks
@MonkeyDo
Copy link
Member

Looks all good so far !

I agree with moving files in another PR. Happy to have an IRC chat to figure out folder structure if you want.

Convert identifier mapping into an enum?

It might indeed make sense to do that.
Looking at you, identifier-links.js
We could deploy that as a minor version of the ORM after deploying the TS cleanup major version, if that works for you.

@kellnerd
Copy link
Contributor Author

Happy to have an IRC chat to figure out folder structure if you want.

Let's postpone that until I've got the OpenLibrary importer (including relationships) working, then I will have a better understanding how complex each producer implementation will be and how many common code is left which does not belong into bb-data.
Currently it's hard to make a call whether we should have a simple "one flat folder per producer" hierarchy and simply move queue.ts into helpers/ or its own folder.
The only thing which I'm certain about is that it would make sense to have only executable scripts like bbiq.ts directly inside the src/ folder.

We could deploy that as a minor version of the ORM after deploying the TS cleanup major version, if that works for you.

Since the identifier mapping enum will be used by both bb-site and bb-utils, it should definitely live in bb-data in the end.
I agree with having that in a later ORM release only as it's not that relevant for my GSoC project.
So far I've only done the refactorings which also improve DX for GSoC and only kept a list of all the other things that I've noticed which could also be improved (there are quite a lot).

@MonkeyDo
Copy link
Member

So does that mean that work on this PR is done, and it's ready for final review and merging?

@kellnerd
Copy link
Contributor Author

Yes, I would say the work is done.
After metabrainz/bookbrainz-data-js#309 has been merged and the new version has been released on npm, I will do a version bump commit here and then it can be merged.

Copy link
Member

@MonkeyDo MonkeyDo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set up everything from scratch following the instructions in the readme, and successfully produced messages for 100 authors, then started the consumer which successfully consumed 100 messages.
Everything is working fine and the code looks much cleaner, nice work !

Even though I get errors on insertion in the DB, the tool works great and the CLI is very usable (testing the failure queue while I was at it).

IMO the default mechanism should be to post failures to the failure queue, with the ability to pass an option to change the failure queue name or disable it. (probably fine to change this in a separate PR once we've used the tool a bit more, maybe my opinion was actually 💩 )

@MonkeyDo MonkeyDo merged commit 7c3a71e into metabrainz:master Oct 19, 2023
@kellnerd
Copy link
Contributor Author

Good to hear that you have successfully tested it, with 100 messages even 😮
So far I've only tested with sets of a dozen authors because I still think there is too much redundant logging making the output hard to parse.

Most of the current errors seem to be the result from languages which are missing in the BB (sample) database but are present in MB. The reason for these rare languages appearing in our sample data at all is the bad performance of the language detection for author aliases, which are way too short to reliably do that. I think we should rather leave the alias language blank.

I agree that changing the defaults to have a failure queue makes sense. Better store these messages by default than losing them by default. I think the primary motivation for the current default is that the CLI parameter was easier to implement that way 😁

@MonkeyDo
Copy link
Member

MonkeyDo commented Oct 19, 2023

Oh, I tested the producer with more previously but didn't run the importer on it:
image

Indeed the logs are quite verbose, and indeed all I see are language-related issues.
I don't think the alias language can be null though :/ Not sure how to deal with that.

As for the CLI option, how about allowing both naming and disabling with the same option with --failure-queu my_failure_queue and --failure-queu false?
Or do you prefer a separate --no-failure-queu option instead (incompatible with the --failure-queu option)?

@kellnerd kellnerd deleted the importer-cleanup branch October 22, 2023 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants