Change GPL dependencies to MIT-compatible counterparts by danielpunkass · Pull Request #90 · thisisparker/xword-dl

danielpunkass · 2023-01-07T19:25:33Z

The Unidecode and html2text dependencies are GPL2 and GPL3 licensed, so they are not compatible with the MIT licensing of this project as a whole. See #61

… ISC, an MIT-compatible license.

…, with html_text, which is MIT licensed.

danielpunkass · 2023-01-07T23:21:51Z

I think this is ready for review now. The needs of xword-dl for both Unidecode and html2text seem light-weight enough that I suspect the replacement MIT-licensed libraries will fit the bill.

thisisparker · 2023-01-17T18:30:42Z

Thank you for the PR! Holding on this just for the moment because I might have to make some changes to how I'm using html2text anyway for #89. I'll keep you posted!

danielpunkass · 2023-01-17T19:42:09Z

Cool, no rush!

afontenot · 2025-06-02T16:47:53Z

@danielpunkass @thisisparker I think it would be really nice to see some version of this merged. I've created a pull request against the merging branch with what I think is an improvement. Rather than reimplementing the entire anyascii function, we can just pass the codepoints of the string that need replacement.

Those familiar with Unicode might be surprised you can do this. It's actually because anyascii itself iterates by codepoints, so this approach at least doesn't introduce any issues that don't already exist with that package. That doesn't mean that there aren't issues, e.g. anyascii("🏳️‍⚧️") returns :flag_white::transgender_symbol: which is not really ideal (and implies that we can't eliminate the demojize function with anyascii).

Because anyascii caches lookups on a module level and iterates by codepoint anyway, it's perfectly efficient to filter out latin-1 codepoints and pass the rest through anyascii.

requirements.txt

xword_dl/downloader/amuselabsdownloader.py

xword_dl/downloader/wsjdownloader.py

requirements.txt

xword_dl/util/utils.py

danielpunkass · 2025-06-02T18:19:05Z

Thanks for the attention to the PR @afontenot ! I will have to get my head back around some of the rationale for the way I approached things but I will review and revise the PR branch soon based on your reviews.

Improve use of anyascii

danielpunkass · 2025-06-02T19:14:49Z

I've updated the branch with most of the tweaks from @afontenot, and merged latest main to make the PR merge easier if/when accepted.

…te the html_text call to extract_text

xword_dl/downloader/wsjdownloader.py

danielpunkass · 2025-06-03T17:57:43Z

There's something wrong with branch that is causing one of the tests to fail:

⭐ Run Main Test New York Times rebus special chars
[Status check on supported outlets/Install xword-dl and test existing outlets]   🐳  docker exec cmd=[bash -e /var/run/act/workflow/27] user= workdir=
| Traceback (most recent call last):
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/bin/xword-dl", line 8, in <module>
|     sys.exit(main())
|              ^^^^^^
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/xword_dl/xword_dl.py", line 292, in main
|     puzzle, filename = by_keyword(args.source, **options)
|                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/xword_dl/xword_dl.py", line 49, in by_keyword
|     puzzle = dl.download(puzzle_url)
|              ^^^^^^^^^^^^^^^^^^^^^^^
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/xword_dl/downloader/basedownloader.py", line 85, in download
|     puzzle = self.parse_xword(xword_data)
|              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/xword_dl/downloader/newyorktimesdownloader.py", line 176, in parse_xword
|     puzzle.rebus()
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/puz.py", line 308, in rebus
|     return self.helpers.setdefault('rebus', Rebus(self))
|                                             ^^^^^^^^^^^
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/puz.py", line 559, in __init__
|     for item in parse_dict(solutions_str).items()
|                 ^^^^^^^^^^^^^^^^^^^^^^^^^
|   File "/opt/hostedtoolcache/Python/3.11.12/arm64/lib/python3.11/site-packages/puz.py", line 764, in parse_dict
|     return dict(p.split(':') for p in s.split(';') if ':' in p)
|            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| ValueError: dictionary update sequence element #0 has length 4; 2 is required

I'll figure out what's going on. Probably some dependency disparity.

afontenot · 2025-06-03T18:40:29Z

There's something wrong with branch that is causing one of the tests to fail:

...

I'll figure out what's going on. Probably some dependency disparity.

Looks to me like a discrepancy between how anyascii and unidecode are handling the NYT puzzle in question. xword-dl is latinizing the rebus answers, and it looks like probably anyascii is returning a :shortcode: for one of the puzzle clues here. The library that's supposed to save the puzzle file, puzpy, is creating a dict by splitting on : and this is almost certainly a bug - it ought to be using .partition(). Unfortunately I can't see what clue is causing the issue because I don't have a NYT subscription.

afontenot · 2025-06-03T18:47:52Z

requirements.txt

 pyyaml==6.0.2
 xmltodict==0.14.2
-lxml==5.4.0
+lxml==5.3.1


nit: you reverted this to the pre-main version by mistake: cb4ad4b

The string in question that's tripping up puz.py is:

0:Q:clubs:; 1:A:spades:; 2:2:hearts:; 3:K:clubs:; 4:10:hearts:; 5:10:clubs:; 6:A:hearts:; 7:7:diamonds:; 8:Q:hearts:; 9:A:clubs:;10:10:spades:;11:K:hearts:;12:J:hearts:;

Maybe the third colon in each is a problem.

Looks like the unidecode stripped the colons.

Got latinize: Q:clubs: Got unidecode: Qclubs

Ah, yeah, I see now your new latinize is introducing the colons in place of the non-ascii characters:

Got raw: Q♣

I'll defer to your opinion about what we should do in this example. The "Qclubs" translation wasn't great either, but it avoided the coincidental introduction of colons to the otherwise colon-free answer.

Thanks again for all the thoughtful feedback!

Something else to consider about the colons situation is whether we should strip all colons from the result of anyascii or check first to make sure there aren't intentional colons in the text.

Yes, the NYT supports the notion of multiple correct answers, and for example in this puzzle there is a pretty exhaustive list of "correct" answers for each of these squares. For example:

[ "Q\u2663", "QUEEN", "QUEENOFCLUBS", "Q", "CLUB", "C", "QUEEN/CLUB", "CLUB/QUEEN", "Q/C", "C/Q", "QC", "CQ" ],

Oh, for my concern about there being colons in the source text, that's easily addressable since you've already got it going one code point at a time. We can simply not pass the character to anyascii if it happens to be a single colon character, and strip any colons from the anyascii result. I think I'll work on that later and amend the PR with that as at least a short-term fix to replicate unidecode behavior.

Aaand ... no need, because the c < 0xFF test already includes the colon character. OK, easy-peasy.

OK I made one more push with a change to strip all colons from anyascii, at least as a stop-gap fix to hopefully get the tests passing. I think the PR could be considered as-is even while @thisisparker may want to mull the other questions about how to better handle non-ASCII answer text.

I think there are maybe some consistency concerns with unconditionally stripping the colons from anyascii, because elsewhere in the code emojis are converted to text with emoji.demojize, which uses this approach. So e.g. emoji.demojize("🏳️‍⚧️") returns :transgender_flag:. If we change the behavior of anyascii, it will no longer match this.

I do agree it's worth waiting at this point for Parker to weigh in on how best to handle representing rebus text.

requirements.txt

.github/workflows/status-check-outlets.yml

… anyascii, since they wreak havoc with downstream puz.py parsing.

thisisparker · 2025-06-26T18:49:14Z

Looking at the rebus thing in particular: I think we should not be applying any transformation to grid fill, and rebus entries count as grid fill. (I say that even though I am the one that introduced that transformation in #172.) I think for NYT the best behavior would probably be to test the valid answers and return the first one that doesn't break .puz encoding — and even in a future scenario where we can count on UTF-8, I don't think we should pass emoji in that field.

This is all kind of no-win, because entering rebuses is iffy in most circumstances and this is a lossier operation than I like to do on these, but I don't have any better ideas.

danielpunkass · 2025-07-11T02:39:14Z

@thisisparker Thanks for chiming in. How are you feeling about the PR on the whole, as far as considering taking the leap and merging into main so that the code base is in fact compliant with its license?

The handling of the non-ascii characters might be something to handle separately, if at all. The state of this PR as it stands should pass all tests and behave comparably to the main branch. It would be great to have the MIT license locked in, especially because keeping GPL in the code base might encourage somebody to add more GPL code in the future.

afontenot · 2025-07-11T04:50:34Z

FWIW I don't think there's a license compliance issue with the code as is. MIT licensed software is allowed to use a GPL licensed dependency. For example, if I want to extract some part of xword-dl and use it in my proprietary crossword software, I can do that (so long as I comply with the terms of the MIT license). The issue with having a GPL dependency is that there's no way for people to convey a functional version of xword-dl under the terms of the MIT license. Shipping a fully functional version of xword-dl would require including the GPL dependency, meaning that the resulting combined work would be covered by the GPL.

So it would be accurate to say that the xword-dl code is MIT licensed, but distributions of the software (e.g. by Debian or Arch Linux) happen under the terms of the GPL because they are tainted by the dependency.

GNU / FSF cover this in some detail here: https://www.gnu.org/licenses/gpl-faq.html#WhatDoesCompatMean

danielpunkass · 2025-07-11T05:43:39Z

That is indeed a fair distinction, but I think that generally people who are browsing projects and see a permissive license like MIT, they assume they can use the project as-is under those terms. Right now the project can only be used as a whole under the GPL license, so the license tag is misleading.

… a proxy server when downloading from the New Yorker.

Switch from Unidecode, which has a GPL license, to anyascii, which is…

079521c

… ISC, an MIT-compatible license.

danielpunkass marked this pull request as draft January 7, 2023 19:25

Fix format of the anyascii dependency

1ac1388

danielpunkass mentioned this pull request Jan 7, 2023

Clarify project license #61

Closed

Replace dependency on html2text, which has GPL licensing requirements…

d7e9af5

…, with html_text, which is MIT licensed.

danielpunkass marked this pull request as ready for review January 7, 2023 23:22

danielpunkass added 2 commits April 8, 2025 09:07

Merge remote-tracking branch 'upstream/main' into no-gpl

4cd99a9

Bring no-gpl branch up to date with latest from upstream/main

c1f9f82

Improve use of anyascii

acc4aba

Because anyascii caches lookups on a module level and iterates by codepoint anyway, it's perfectly efficient to filter out latin-1 codepoints and pass the rest through anyascii.