Change GPL dependencies to MIT-compatible counterparts#90
Change GPL dependencies to MIT-compatible counterparts#90danielpunkass wants to merge 29 commits intothisisparker:mainfrom
Conversation
… ISC, an MIT-compatible license.
…, with html_text, which is MIT licensed.
|
I think this is ready for review now. The needs of xword-dl for both Unidecode and html2text seem light-weight enough that I suspect the replacement MIT-licensed libraries will fit the bill. |
|
Thank you for the PR! Holding on this just for the moment because I might have to make some changes to how I'm using |
|
Cool, no rush! |
|
@danielpunkass @thisisparker I think it would be really nice to see some version of this merged. I've created a pull request against the merging branch with what I think is an improvement. Rather than reimplementing the entire Those familiar with Unicode might be surprised you can do this. It's actually because |
Because anyascii caches lookups on a module level and iterates by codepoint anyway, it's perfectly efficient to filter out latin-1 codepoints and pass the rest through anyascii.
|
Thanks for the attention to the PR @afontenot ! I will have to get my head back around some of the rationale for the way I approached things but I will review and revise the PR branch soon based on your reviews. |
Improve use of anyascii
|
I've updated the branch with most of the tweaks from @afontenot, and merged latest main to make the PR merge easier if/when accepted. |
…te the html_text call to extract_text
|
There's something wrong with branch that is causing one of the tests to fail: I'll figure out what's going on. Probably some dependency disparity. |
Looks to me like a discrepancy between how |
requirements.txt
Outdated
| pyyaml==6.0.2 | ||
| xmltodict==0.14.2 | ||
| lxml==5.4.0 | ||
| lxml==5.3.1 |
There was a problem hiding this comment.
nit: you reverted this to the pre-main version by mistake: cb4ad4b
There was a problem hiding this comment.
The string in question that's tripping up puz.py is:
0:Q:clubs:; 1:A:spades:; 2:2:hearts:; 3:K:clubs:; 4:10:hearts:; 5:10:clubs:; 6:A:hearts:; 7:7:diamonds:; 8:Q:hearts:; 9:A:clubs:;10:10:spades:;11:K:hearts:;12:J:hearts:;
Maybe the third colon in each is a problem.
There was a problem hiding this comment.
Looks like the unidecode stripped the colons.
Got latinize: Q:clubs:
Got unidecode: Qclubs
There was a problem hiding this comment.
Ah, yeah, I see now your new latinize is introducing the colons in place of the non-ascii characters:
Got raw: Q♣
There was a problem hiding this comment.
I'll defer to your opinion about what we should do in this example. The "Qclubs" translation wasn't great either, but it avoided the coincidental introduction of colons to the otherwise colon-free answer.
There was a problem hiding this comment.
Thanks again for all the thoughtful feedback!
Something else to consider about the colons situation is whether we should strip all colons from the result of anyascii or check first to make sure there aren't intentional colons in the text.
Yes, the NYT supports the notion of multiple correct answers, and for example in this puzzle there is a pretty exhaustive list of "correct" answers for each of these squares. For example:
[
"Q\u2663",
"QUEEN",
"QUEENOFCLUBS",
"Q",
"CLUB",
"C",
"QUEEN/CLUB",
"CLUB/QUEEN",
"Q/C",
"C/Q",
"QC",
"CQ"
],
There was a problem hiding this comment.
Oh, for my concern about there being colons in the source text, that's easily addressable since you've already got it going one code point at a time. We can simply not pass the character to anyascii if it happens to be a single colon character, and strip any colons from the anyascii result. I think I'll work on that later and amend the PR with that as at least a short-term fix to replicate unidecode behavior.
There was a problem hiding this comment.
Aaand ... no need, because the c < 0xFF test already includes the colon character. OK, easy-peasy.
There was a problem hiding this comment.
OK I made one more push with a change to strip all colons from anyascii, at least as a stop-gap fix to hopefully get the tests passing. I think the PR could be considered as-is even while @thisisparker may want to mull the other questions about how to better handle non-ASCII answer text.
There was a problem hiding this comment.
I think there are maybe some consistency concerns with unconditionally stripping the colons from anyascii, because elsewhere in the code emojis are converted to text with emoji.demojize, which uses this approach. So e.g. emoji.demojize("🏳️⚧️") returns :transgender_flag:. If we change the behavior of anyascii, it will no longer match this.
I do agree it's worth waiting at this point for Parker to weigh in on how best to handle representing rebus text.
… anyascii, since they wreak havoc with downstream puz.py parsing.
|
Looking at the rebus thing in particular: I think we should not be applying any transformation to grid fill, and rebus entries count as grid fill. (I say that even though I am the one that introduced that transformation in #172.) I think for NYT the best behavior would probably be to test the valid answers and return the first one that doesn't break .puz encoding — and even in a future scenario where we can count on UTF-8, I don't think we should pass emoji in that field. This is all kind of no-win, because entering rebuses is iffy in most circumstances and this is a lossier operation than I like to do on these, but I don't have any better ideas. |
|
@thisisparker Thanks for chiming in. How are you feeling about the PR on the whole, as far as considering taking the leap and merging into main so that the code base is in fact compliant with its license? The handling of the non-ascii characters might be something to handle separately, if at all. The state of this PR as it stands should pass all tests and behave comparably to the main branch. It would be great to have the MIT license locked in, especially because keeping GPL in the code base might encourage somebody to add more GPL code in the future. |
|
FWIW I don't think there's a license compliance issue with the code as is. MIT licensed software is allowed to use a GPL licensed dependency. For example, if I want to extract some part of xword-dl and use it in my proprietary crossword software, I can do that (so long as I comply with the terms of the MIT license). The issue with having a GPL dependency is that there's no way for people to convey a functional version of xword-dl under the terms of the MIT license. Shipping a fully functional version of xword-dl would require including the GPL dependency, meaning that the resulting combined work would be covered by the GPL. So it would be accurate to say that the xword-dl code is MIT licensed, but distributions of the software (e.g. by Debian or Arch Linux) happen under the terms of the GPL because they are tainted by the dependency. GNU / FSF cover this in some detail here: https://www.gnu.org/licenses/gpl-faq.html#WhatDoesCompatMean |
|
That is indeed a fair distinction, but I think that generally people who are browsing projects and see a permissive license like MIT, they assume they can use the project as-is under those terms. Right now the project can only be used as a whole under the GPL license, so the license tag is misleading. |
… a proxy server when downloading from the New Yorker.
The Unidecode and html2text dependencies are GPL2 and GPL3 licensed, so they are not compatible with the MIT licensing of this project as a whole. See #61