Clean up a number of small issues and improve error checking by afontenot · Pull Request #205 · thisisparker/xword-dl

afontenot · 2024-08-07T02:13:34Z

I'm piling a large number of commits here that basically just improve code quality, with no (intended) changes in behavior (other than fixing several probable bugs caused by typos).

Summary:

When methods of urllib.parse are used, it's incorrect to only import urllib and in fact this only worked because importing the request module fixed it
Fixed several easily overlooked issues, e.g. typo == for =, missing reference to self on a method, accidentally using an undefined setting key with no default value (which happened to work because None is falsey, and False was the default behavior). Removed some code that looked broken and dead as well.
Clean up cases where .get() was used as a non-failing way to get the value for some key on an object, but then the result immediately fails with a type error if .get() returns None. These errors are harder to understand and debug than the clear KeyError: 'key' or equivalent.
Clean up a ton of cases where a subclass implementation had a different API for a method than its parent, causing breakage if something called a method on an instance of the class while assuming the parent API.
Try to improve the error handling by dealing with all the cases where type annotations in dependencies indicated that we do not deal with all error conditions. BeautifulSoup is very bad about silently running None and depending on the situation an error like this may not be caught for many lines, making debugging more difficult.

The last of these changes is the most involved, and it's possible I went a little too far to make the type checker happy, e.g. dealing with cases that are impossible (I think!) like soup.find("TagName") returning a NavigableString. There are also a few cases where existing error checking (with AttributeError) which did look sufficiently robust was replaced with a solution that the type checker could analyze instead. If need be some of these changes could be reverted.

afontenot · 2024-08-07T02:14:58Z

I realize reviewing this is a big ask. No hurry of course. Also, (a) I'm happy to help with anything I can in getting this in a mergeable state, and (b) I'm happy to see selected commits from this branch cherry-picked into main, and I can rebase my changes and force push here once that happens.

thisisparker · 2024-08-07T18:02:21Z

This is amazing and greatly appreciated. I do want to review it with some care but I anticipate I will probably merge all these commits, maybe with some very minor changes.

(It probably goes without saying, but this project started with pretty humble ambitions and my skills have grown along with it! With the exception of a pretty major refactor a few years back, it hasn't gotten this kind of holistic review, and I am very grateful for it coming from a pair of "outside" eyes.)

afontenot · 2024-08-09T05:35:04Z

Great! Was also meaning to ask whether there is a particular code style you prefer to use for this repository? There seems to be a mix of styles from what I've seen, but I'd like to clean up at least the code I've touched for consistency. If you don't have a preference my go-to is ruff format with ruff.

afontenot · 2024-08-11T03:17:28Z

I've added a few more commits. Other than the usual cleaning up of nits I've noticed, there's the following two significant changes:

I refactored the behavior of features like authentication and embedded URLs so that xword_dl.py is entirely neutral with respect to downloader plugins. Any plugin that offers the requisite features will be allowed to use the authentication option, for example.
Building on this, I refactored the way importing plugins worked pretty significantly, in a way that gets rid of a lot of the introspection required previously. Now the downloader package only makes public the function get_plugins(), and (for now) xword_dl.py uses this to populate a list of the available plugin classes. With this change the downloader package's __init__.py is much cleaner (IMHO), and the none of the code needs the inspect module any more. (As a result of this and some of the small changes, Pyright issues zero complaints for xword-dl now.)

I think it would be quite nice to try to refactor some of the download logic out to a separate file, if you're okay with that idea. My idea for how to structure this is to have only main() in xword_dl.py (along with helper functions), and have main() instantiate a Downloader class. With this class, we can make the importing of plugins more explicit (instead of a file global), and centralize functions like filtering downloader plugins that are currently spread throughout the file.

Other than a few small issues (e.g. the single-file utils package with a compressed namespace), the project is linter clean as well! The only other significant change I wanted was a little more error checking around requests.

afontenot · 2024-08-11T21:44:38Z

Note to self: need to fixup all the 3.10+ style typing annotations type | type to use Union instead before this gets merged.

afontenot · 2024-08-26T01:27:51Z

Just FYI, I'm pushing pause on this until you have time to review the changes (no pressure). There's already way more than enough here. I think the headline feature is that the codebase now passes pylint and pyright other than a couple of nits I was unable to fix without heavy refactoring.

I still plan to make the type annotations Python 3.9 compatible before this gets merged, but that fixup can happen after review.

The one significant future change I think would be helpful to add to this would be factoring the puzzle handling code out of xword_dl.py and into its own file. I'd like to make it possible to plug in alternative file formats, like ipuz (libipuz), by implementing the same puzzle API as puzpy.

Many files import urllib, but this is a package that contains the module `urllib.parse`. This works, when it does (I haven't checked every case) because `requests` is also imported, and it imports the module correctly, which makes it available to xword-dl under the urllib namespace.

Because of the way the settings dict is populated, if the preserve_html option is not provided, the key will not exist. `settings.get()` usually works because `None` is false-y, but has the potential to fail unexpectedly

Clean up some uses of get where the code either immediately assumes that the result is not None or else provides a replacement value, e.g. through a ternary expression. Keeping this clean makes error messages more debuggable when unexpected errors aren't caught.

Many sub-classes modify their parents' methods and alter the allowed arguments. This is a problem because if a function call is written on the API of the parent class, it will result in an error if it uses keyword arguments unavailable in the child class, and may fail or yield unexpected results if it uses positional arguments in a different order. Even when this can be guaranteed not to be the case through analysis of the code, it's still a bad practice that can break newly added code unexpectedly. I've fixed all of these issues that I could find.

This is mainly just catching and dealing with type mismatch errors reported by Pyright. This catches a lot of cases where HTML parsing can fall through, but not all of them, e.g. IndexError and KeyError are not type errors. (But failing to prove an optional type has a value is.) How you deal with these issues is ultimately a matter of some preference. In two or three places, I removed a perfectly adequate check that looked for an AttributeError, since exception catching doesn't satisfy the type checker. But this commit could be revised to leave these checks in or even do more of them. A follow up to this should probably add checks on the result of every `request.get` and `request.post`, as I only did that here where it was convenient.

These lines are an implementation of `guess_date_from_id`, which is supposed to turn a puzzle ID into a `datetime`, but they can't possibly have worked because you can't call `datetime.strftime` that way, and in any case the IDs now appear to be random hashes without date information.

This commit changes how importing the downloaders happens slightly. The point is to avoid making any assumptions in xword_dl.py about what plugins are available and what methods they have. This has a number of advantages, for example, that it's trivial for someone to add a downloader plugin that requires authentication without adding another special case to the main control flow. This also happens to make code checkers happy as a side effect, because they can't determine whether the plugin classes are in the downloader namespace or not without executing the downloader __init__ code. The work in this commit is prefatory to an additional refactor which would create a generic Downloader class that automatically finds plugins and abstracts the control flow away from the application logic in xword_dl.py.

Gets rid of a lot of the complicated introspection required to pick up the available downloader classes, and also moves all of this logic out of xword_dl.py. This change could potentially be amended to add the plugins to the downloader namespace as before, but after this change along with the last several, this has no real benefit and leaves the namespace cluttered.

With this commit the entire repository is Pyright clean!!

Simplifies __get_subclasses(), and adds type annotations to important methods in __init__.py, xword_dl.py, and utils.py. The result of this work was to reveal a number of issues with API inconsistency, which this commit fixes as well.

More appropriate as this function isn't specific to the downloader types.

afontenot · 2025-04-03T23:48:02Z

I still plan to make the type annotations Python 3.9 compatible before this gets merged

This might no longer be needed, Python 3.9 will be EOL in October.

thisisparker · 2025-05-27T17:17:31Z

I would like to merge this and I'm very sorry it took so long! (I'm glad you saw that that I'm back in it, since I've been at Recurse Center for the past few months.) The only merge conflicts are around the Puzzmo downloader, which I think I can resolve by just linting my newer version.

It feels funny to ask this now, but do you have any issues with my merging this as-is? Obviously I wouldn't have needed to ask last year but it's been so long I feel like I want to give you a chance to chime in!

thisisparker · 2025-05-27T17:35:39Z

Actually I have two questions, if you don't mind!

the __get_subclasses function in __init__.py has [T] which I think is a type parameter list, new in 3.12? Sorry, I'll admit I'm not that fluent with the type hinting stuff. As it stands, the automated status checker is choking on it using 3.11. What do you think would be the best way to resolve that?
again, have to plead some ignorance here, but if you'd be willing to explain the move to classmethod for existing staticmethods I would love to hear!

afontenot · 2025-05-27T22:15:25Z

which I think is a type parameter list, new in 3.12?

Yes, this is just the Python 3.12 way of doing a generic function, which it seems I used by accident. I'll add a commit fixing this.

again, have to plead some ignorance here, but if you'd be willing to explain the move to classmethod for existing staticmethods I would love to hear!

Sure! The idea is that because all these downloader classes are subclasses of BaseDownloader, their methods need to have the same method signatures as the method they're overriding from the base class. This is important to type checkers and just generally seems like a good policy.

Because I tried to keep this PR purely a refactor and type checking group of commits, changing all the signatures to use @classmethod was forced because you already use @classmethod in some of the downloaders, e.g.

xword-dl/xword_dl/downloader/simplydailydownloader.py

Lines 23 to 27 in cb4ad4b

    
           @classmethod 
        
           def matches_url(cls, url_components): 
        
                return (cls.website in url_components.netloc and 
        
                       f'/{cls.url_subdir}/' in url_components.path)

Of course, it wouldn't be too hard to change this to avoid using @classmethod, but when I was thinking about it at the time, it made a lot of sense to me that someone writing a matches_url function for a downloader might want to have access to class attributes. On the other hand, if you'd rather not take this approach, I'm happy to change it.

It feels funny to ask this now, but do you have any issues with my merging this as-is?

Well, I think there's a non-zero chance of breaking something on a PR this big - I haven't tested it in a while and there have been some commits since then. But if you're not planning a release in the next couple of weeks I think it makes sense to merge it and use any breakage we find to write additional tests. If anything making the program more strict (and type checker clean) should help reduce this kind of breakage in the future, and maybe you can add a type checker to the CI. I'm excited about being able to use these changes as a base for other work, especially #204.

That said, let me add that fix for 3.11 type checking before you merge anything. I'm going to force push as well so I can change the author name 😅.

afontenot · 2025-05-27T22:49:38Z

Okay, I think this is good to go on my end. Only remaining issue is the puzzmo conflict, and maybe the CI will find something.

thisisparker · 2025-05-28T16:39:10Z

I spotted one teeny thing with the keyword listing, and it looks like the CI is stumbling on something having to do with the matches_url changes, on some but not all outlets. I'm going to take a look, but do you know why that might be?

thisisparker · 2025-05-28T16:56:31Z

OK I've got a clearer picture of it. I think it's failing on every test that uses the by_url path, and I think it's because BaseDownloader is being included as a supported_outlet when command_only=False. I think this is probably best fixed in downloader/__init__.py's get_plugins, which I'm looking at now!

thisisparker · 2025-05-28T17:15:27Z

More context as I'm figuring it out: I think the issue is that the supported_downloaders comprehension wants to filter out downloaders that have not implemented their own matches_url function. Instead, it's matching everything because everything is inheriting the matches_url from BaseDownloader. I think I have a fix but I want to make sure it passes muster...

… implementation

thisisparker · 2025-05-28T17:40:20Z

Alright, how'd I do? CI won't mark as "passing" because it doesn't have access to my secrets in your fork, but I can independently confirm it works on NYT puzzles.

I did this in the way I thought was neatest, which was adding another filter to the get_supported_outlets function. It is a little finicky to make sure there's a subclass implementation of the function but I think this works alright! I also ran it through ruff format, which I agree, would be nice to implement across the repo.

afontenot

Checking the __func__ attribute seems not very clean to me, but I'm not seeing any obvious way to do better than that. I've left some comments, here's a commit (different branch) that implements the changes I'm suggesting: afontenot@9114ef3

I can pick it over to this branch if you want. If so I'll run ruff format on the changes first.

afontenot · 2025-05-29T02:48:39Z

xword_dl/xword_dl.py

-        for d in get_supported_outlets(command_only=False)
-        if hasattr(d, 'matches_embed_url')
-    ]
+    supported_downloaders = get_supported_outlets(command_only=False, matches_url=True)


This introduces a bug I think - you replace checking for matches_embed_url with a check for matches_url.

afontenot · 2025-05-29T02:54:01Z

xword_dl/xword_dl.py

 from puz import Puzzle

-from .downloader import get_plugins
+from .downloader import get_plugins, __bd


I don't like this, that __bd is being re-exported from the init file, where it's used for a specific purpose. You could do

from .downloader.basedownloader import BaseDownloader

instead.

afontenot · 2025-05-29T02:57:50Z

xword_dl/xword_dl.py



-def get_supported_outlets(command_only=True):
+def get_supported_outlets(command_only=True, matches_url=False):


This approach seems error prone. These parameters behave like filters, but the way you're applying them results in the first selected filter being used and the rest ignored. I think it would make more sense to build a list of plugins after applying all the filters and return that. That way no one will come along in the future and introduce a bug by passing the wrong arguments.

afontenot · 2025-05-29T03:00:20Z

xword_dl/xword_dl.py

-def get_supported_outlets(command_only=True):
+def get_supported_outlets(command_only=True, matches_url=False):
    if command_only:
        return [d for d in plugins if hasattr(d, 'command') and d.command]


nit: because I introduced command, matches_url, etc to the BaseDownloader class for type checking reasons, it makes sense to remove the hasattr checks. These attributes will always be defined for all plugins, in fact that's the cause of the issue we're trying to fix here.

afontenot · 2025-05-29T03:37:28Z

xword_dl/xword_dl.py

        return [d for d in plugins if hasattr(d, 'command') and d.command]
+    if matches_url:
+        return [d for d in plugins if hasattr(d, 'matches_url')
+                and getattr(d.matches_url, '__func__', None) is not getattr(__bd.matches_url, '__func__', None)]


I believe the fallback condition None is unnecessary here because unless the plugin overrides the method incorrectly, __func__ should be defined.

thisisparker · 2025-05-30T01:17:40Z

I agree with all the notes, thank you for your attention on this! Please do pick over that change with the ruff format and then I think we're good to go.

Up next for me is a bunch of chores related to CI and packaging and similar, so hopefully we can get that ruff format stuff applied across the board!

Thanks again :)

afontenot · 2025-05-30T05:23:47Z

Okay! I think this is good.

…arker#205) * Fix import of urllib.parse module Many files import urllib, but this is a package that contains the module `urllib.parse`. This works, when it does (I haven't checked every case) because `requests` is also imported, and it imports the module correctly, which makes it available to xword-dl under the urllib namespace. * Fix reference to undefined setting Because of the way the settings dict is populated, if the preserve_html option is not provided, the key will not exist. `settings.get()` usually works because `None` is false-y, but has the potential to fail unexpectedly * fix reference to undefined method * Avoid use of .get() when existence is immediately assumed Clean up some uses of get where the code either immediately assumes that the result is not None or else provides a replacement value, e.g. through a ternary expression. Keeping this clean makes error messages more debuggable when unexpected errors aren't caught. * Fix API consistency issues Many sub-classes modify their parents' methods and alter the allowed arguments. This is a problem because if a function call is written on the API of the parent class, it will result in an error if it uses keyword arguments unavailable in the child class, and may fail or yield unexpected results if it uses positional arguments in a different order. Even when this can be guaranteed not to be the case through analysis of the code, it's still a bad practice that can break newly added code unexpectedly. I've fixed all of these issues that I could find. * Add considerably more error checking This is mainly just catching and dealing with type mismatch errors reported by Pyright. This catches a lot of cases where HTML parsing can fall through, but not all of them, e.g. IndexError and KeyError are not type errors. (But failing to prove an optional type has a value is.) How you deal with these issues is ultimately a matter of some preference. In two or three places, I removed a perfectly adequate check that looked for an AttributeError, since exception catching doesn't satisfy the type checker. But this commit could be revised to leave these checks in or even do more of them. A follow up to this should probably add checks on the result of every `request.get` and `request.post`, as I only did that here where it was convenient. * fix erroneous equality check used in place of an assignment * Remove some incorrect (and probably dead) code These lines are an implementation of `guess_date_from_id`, which is supposed to turn a puzzle ID into a `datetime`, but they can't possibly have worked because you can't call `datetime.strftime` that way, and in any case the IDs now appear to be random hashes without date information. * handle error a bit more robustly * remove unused imports * avoid unnecessary 'import *' behavior * remove unused variables and dead code * make all exception catching explicit and refactor * Refactor imports of downloader plugins This commit changes how importing the downloaders happens slightly. The point is to avoid making any assumptions in xword_dl.py about what plugins are available and what methods they have. This has a number of advantages, for example, that it's trivial for someone to add a downloader plugin that requires authentication without adding another special case to the main control flow. This also happens to make code checkers happy as a side effect, because they can't determine whether the plugin classes are in the downloader namespace or not without executing the downloader __init__ code. The work in this commit is prefatory to an additional refactor which would create a generic Downloader class that automatically finds plugins and abstracts the control flow away from the application logic in xword_dl.py. * Make import of downloader modules much cleaner Gets rid of a lot of the complicated introspection required to pick up the available downloader classes, and also moves all of this logic out of xword_dl.py. This change could potentially be amended to add the plugins to the downloader namespace as before, but after this change along with the last several, this has no real benefit and leaves the namespace cluttered. * remove unnecessary import in __init__.py * Simplify overcomplicated use of hasattr With this commit the entire repository is Pyright clean!! * fix issue where refactor broke authentication * Additional typing improvements Simplifies __get_subclasses(), and adds type annotations to important methods in __init__.py, xword_dl.py, and utils.py. The result of this work was to reveal a number of issues with API inconsistency, which this commit fixes as well. * Make __get_subclasses generic More appropriate as this function isn't specific to the downloader types. * type checking: use Python 3.11 compatible generic signature * fix typo * check for non-empty commands in help text * correctly filter out downloaders that have no independent matches_url implementation * ruff format * update min version to 3.10 because of typing union syntax * plugin feature detection: clean up method override checks --------- Co-authored-by: Parker Higgins <parker@parkerhiggins.net>

afontenot force-pushed the cleanups branch from a621ee0 to 3a09b2c Compare August 7, 2024 02:39

afontenot force-pushed the cleanups branch from ea1091c to 4931763 Compare August 11, 2024 21:48

afontenot added 20 commits April 3, 2025 19:20

Fix reference to undefined setting

07436f0

Because of the way the settings dict is populated, if the preserve_html option is not provided, the key will not exist. `settings.get()` usually works because `None` is false-y, but has the potential to fail unexpectedly

fix reference to undefined method

9d46578

fix erroneous equality check used in place of an assignment

f7a9ff8

handle error a bit more robustly

bb06551

remove unused imports

511af77

avoid unnecessary 'import *' behavior

5c4afff

remove unused variables and dead code

c88309b

make all exception catching explicit and refactor

f6a6e34

remove unnecessary import in __init__.py

2953b49

Simplify overcomplicated use of hasattr

bee4ed4

With this commit the entire repository is Pyright clean!!

fix issue where refactor broke authentication

7fd4f82

Additional typing improvements

7b6d538

Simplifies __get_subclasses(), and adds type annotations to important methods in __init__.py, xword_dl.py, and utils.py. The result of this work was to reveal a number of issues with API inconsistency, which this commit fixes as well.

Make __get_subclasses generic

086a637

More appropriate as this function isn't specific to the downloader types.

afontenot force-pushed the cleanups branch from 458bcb6 to d2d757a Compare April 3, 2025 23:35

afontenot mentioned this pull request Apr 3, 2025

Refactor creation of puz files and add optional PUZv2 support #204

Open

type checking: use Python 3.11 compatible generic signature

cc31675

afontenot force-pushed the cleanups branch from d2d757a to cc31675 Compare May 27, 2025 22:43

afontenot and others added 3 commits May 27, 2025 18:51

fix typo

d07b922

check for non-empty commands in help text

262206e

Merge branch 'main' into cleanups

c7a82fa

thisisparker added 2 commits May 28, 2025 13:35

correctly filter out downloaders that have no independent matches_url…

5c3bee0

… implementation

ruff format

8d83dfd

update min version to 3.10 because of typing union syntax

c35d77d

afontenot commented May 29, 2025

View reviewed changes

plugin feature detection: clean up method override checks

a43d753

thisisparker merged commit cb08343 into thisisparker:main May 30, 2025
1 check failed



		def get_supported_outlets(command_only=True):
		def get_supported_outlets(command_only=True, matches_url=False):

Conversation

afontenot commented Aug 7, 2024

Uh oh!

afontenot commented Aug 7, 2024

Uh oh!

thisisparker commented Aug 7, 2024

Uh oh!

afontenot commented Aug 9, 2024

Uh oh!

afontenot commented Aug 11, 2024

Uh oh!

afontenot commented Aug 11, 2024

Uh oh!

afontenot commented Aug 26, 2024

Uh oh!

afontenot commented Apr 3, 2025

Uh oh!

thisisparker commented May 27, 2025

Uh oh!

thisisparker commented May 27, 2025

Uh oh!

afontenot commented May 27, 2025

Uh oh!

afontenot commented May 27, 2025

Uh oh!

thisisparker commented May 28, 2025

Uh oh!

thisisparker commented May 28, 2025

Uh oh!

thisisparker commented May 28, 2025

Uh oh!

thisisparker commented May 28, 2025

Uh oh!

afontenot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afontenot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

afontenot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

afontenot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

afontenot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

afontenot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

thisisparker commented May 30, 2025

Uh oh!

afontenot commented May 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

afontenot left a comment •

edited

Loading