Skip to content

Partial implementation of fix for issue 1395.#2206

Closed
fatotak wants to merge 5 commits intomihonapp:mainfrom
fatotak:bugfix/1395_different_chapters_with_same_name
Closed

Partial implementation of fix for issue 1395.#2206
fatotak wants to merge 5 commits intomihonapp:mainfrom
fatotak:bugfix/1395_different_chapters_with_same_name

Conversation

@fatotak
Copy link
Contributor

@fatotak fatotak commented Jun 16, 2025

Fixed: new downloads work perfectly.
Working: reading works as far as I can tell.

Not done/TODO:

  1. Before migrating old chapters, 1 thing needs to be done. That is to detect the chapters that were affected by this issue and delete them.
  2. Only after 1. is done, can we migrate the old chapters that were not affected.

Test artefacts as below:

  1. Duplicate name/scanlator chapter downloads fine. You can see 2 files were downloaded with different file names.

image

  1. Downloaded chapter loads in reader correctly.

image

Fixed: new downloads.
Working: reading.
Todo: migrate old chapters to new name + delete and force redownload of problem chapters (chapters within a manga where name and scanlator are the same value).
val newChapterName = sanitizeChapterName(chapterName)
fun getChapterDirName(chapter: Chapter): String {
val newChapterName = sanitizeChapterName(chapter.name)
val hash = md5(chapter.url).takeLast(6)

This comment was marked as resolved.

This comment was marked as resolved.

@Syer10
Copy link

Syer10 commented Jun 16, 2025

What if the URL changes? A lot of sources change the url periodically and this would mean downloads from any of those sources won't stay downloaded.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

What if the URL changes? A lot of sources change the url periodically and this would mean downloads from any of those sources won't stay downloaded.

I used url for 2 reasons:

  1. initially I thought this as a change for the extension, so was limiting my thinking to the data extensions had.
  2. someone else mentioned hashing the url in the ticket. Different chapters with same name get replaced on Download #1395 (comment)

Given it's in the base build, not extension, there's 2 fields we can use for this change:

  1. url
  2. chapter.id

no other field provides the level of uniqueness required.
if you prefer to use chapter.id, I can change it to that (i.e. either append id as is, or take last n bits of id, or hash the id itself as a 4 byte array). I didn't look into the specific conditions where chapter.id changes (i.e. do db entries get replaced under certain conditions?).

@Syer10
Copy link

Syer10 commented Jun 16, 2025

chapter.id also changes when the url changes, there is no way to provide the level of uniqueness you want unless its done on the extension side.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

unless its done on the extension side.

  1. It won't be done on the extension side. They said it's a base build issue referencing this ticket (1395). Comick handling of same chapters from same scanlator with different quality overwrite each other. keiyoushi/extensions-source#9262
  2. They don't have more data than base build. They have less (only have site data, which they give you,) but no history data, which is in the db.

@Syer10
Copy link

Syer10 commented Jun 16, 2025

Then maybe a pre-processor should be added on Mihon's side before chapters are added to the database that adds an Alt1/Alt2/Alt3/etc to the scanlator field if it finds duplicates, based on the order they are in the list given by extensions. Changing the downloads scheme will cause more problems then not.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

Then maybe a pre-processor should be added on Mihon's side before chapters are added to the database that adds an Alt1/Alt2/Alt3/etc to the scanlator field if it finds duplicates.

And how will the pre-processor determine it's alt vs non-alt?
url? it comes back to the same thing.
presence of other rows? then it's no different from just using chapter.id instead of chapter.url

Changing the downloads scheme will cause more problems then not.

What exact problems?
When url changes, it won't find the file, but that will happen under the "alt" thing as well, if the determining factor in assigning "alt" is url. If it's NOT url, then what is it?

@Syer10
Copy link

Syer10 commented Jun 16, 2025

What exact problems?
When url changes, it won't find the file, but that will happen under the "alt" thing as well, if the determining factor in assigning "alt" is url. If it's NOT url, then what is it?

What if the URL changes? A lot of sources change the url periodically and this would mean downloads from any of those sources won't stay downloaded.

And how will the pre-processor determine it's alt vs non-alt? url? it comes back to the same thing.

As I said in my previous message, based on the order they are in the list given by extensions. Which means that if the pre-processor finds a exact duplicate(chapter title + scanlator), it will add an AltX to the scanlator field of the duplicate. There is no way to determine what is original and what is alt, but we can infer based on the order the dupes are listed in.

@MajorTanya
Copy link
Member

What exact problems?
When url changes, it won't find the file, but that will happen under the "alt" thing as well, if the determining factor in assigning "alt" is url. If it's NOT url, then what is it?

What if the URL changes? A lot of sources change the url periodically and this would mean downloads from any of those sources won't stay downloaded.

And how will the pre-processor determine it's alt vs non-alt? url? it comes back to the same thing.

As I said in my previous message, based on the order they are in the list given by extensions. Which means that if the pre-processor finds a exact duplicate(chapter title + scanlator), it will add an AltX to the scanlator field of the duplicate. There is no way to determine what is original and what is alt, but we can infer based on the order the dupes are listed in.

Basically, a list of "Chapter 1", "Bonus", "Bonus", "Chapter 2", etc is handed over by the extension, and you are suggesting to name them "Chapter 1.cbz", "Bonus (1).cbz", "Bonus (2).cbz", "Chapter 2.cbz" in the exact order given by the source.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

As I said in my previous message, based on the order they are in the list given by extensions. Which means that if the pre-processor finds a exact duplicate(chapter title + scanlator), it will add an AltX to the scanlator field of the duplicate. There is no way to determine what is original and what is alt, but we can infer based on the order the dupes are listed in.

that order is already in the db: chapter.sourceOrder. but that value isn't a constant and isn't reliable at all, according to my inspection of the data in my db.

The source can, at any time, insert a chapter anywhere.
Let's say they have 2 copies of 'Bonus' to start, which are chapter 5.5, chapter 9.5.
Next they add a third copy of 'Bonus', which is chapter 4.5.

In the filesystem, we have Bonus = 5.5, Bonus (2) = 9.5, and.... Bonus (3) = 9.5, with NO 4.5.

Using the URL, we won't have such issues.

I think it is far less reliable than using url to match.

@MajorTanya
Copy link
Member

The source can, at any time, insert a chapter anywhere. I think it is far less reliable than using url to match.

As Syer10 said before, there are plenty of sources who (semi-)frequently change chapter URLs, which makes the URL unreliable as well.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

The source can, at any time, insert a chapter anywhere. I think it is far less reliable than using url to match.

As Syer10 said before, there are plenty of sources who (semi-)frequently change chapter URLs, which makes the URL unreliable as well.

Does upload date change when they change url?

@Syer10
Copy link

Syer10 commented Jun 16, 2025

Your example uses the same upload date for both chapters

@MajorTanya
Copy link
Member

The source can, at any time, insert a chapter anywhere. I think it is far less reliable than using url to match.

As Syer10 said before, there are plenty of sources who (semi-)frequently change chapter URLs, which makes the URL unreliable as well.

Does upload date change when they change url?

I'm not sure but I believe some sources don't have an upload date in which case the extension falls back to (today) or something.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

Your example uses the same upload date for both chapters

I know that... but I'm looking to determine some kind of pattern, we have no single element right now - url, timestamp, id, etc, but if we can leverage the db and do 2 x lookup instead of 1, we can still achieve some uniqueness while being able to later find the entry.

Actually, my example, if we can just store the filename in the db, then it doesn't matter if they change url later. It's the same chapter anyway, different quality, so what's important is to keep both files and have each row in the ui point to a different file. BUT the example in the ticket is the one that forces us to "get it right", because they're actually totally different chapters suffering the same symptoms.

@raxod502
Copy link
Contributor

Per @AntsyLich in #1395 (comment), using the URL should be fine, given that there's already data fixing logic for the specific sources that have unstable URLs.

@Syer10
Copy link

Syer10 commented Jun 16, 2025

Per @AntsyLich in #1395 (comment), using the URL should be fine, given that there's already data fixing logic for the specific sources that have unstable URLs.

While that is true, having to rename each and every folder of all downloaded chapters for 1 source will be a huge amount of filesystem operations, I don't think its viable to trust the device to do it all in good time.

@raxod502
Copy link
Contributor

It's possible to have the reader check for both old-style and new-style downloads. This is in accordance with the reader logic Mihon already has to check for four(!) different formats of downloaded chapter, due in part to past migrations. Renaming everything therefore isn't something that has to be done as a synchronous migration, and even further, we don't have to enable it for every source.

In your alternative proposal, can you elaborate on how you would plan to handle cases where chapters are added or removed in the middle of a series? At first glance, it seems like this would catastrophically break the database correctness since already-downloaded-chapters would be renumbered and thus interchanged with each other - an even worse outcome than chapters needing to be redownloaded.

@Syer10
Copy link

Syer10 commented Jun 16, 2025

I think you misunderstood what I was saying.
Say you download 300 chapters with this new naming format, then the source changes the chapter URL. All those new downloads are now invalid and have to be moved since they rely on an MD5 of the URL. That is a huge amount of filesystem operations. And some won't even get moved because chapters with an invalid number don't get matched to new versions.

I know my solution is not perfect, but I feel it's better then this since this comes with a lot of caveats. It also must be said that the issue this is fixing is so small since almost no sources have duplicate Chapter names and scanlators, It's only a small few that have this issue.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

It's possible to have the reader check for both old-style and new-style downloads. This is in accordance with the reader logic Mihon already has to check for four(!) different formats of downloaded chapter, due in part to past migrations. Renaming everything therefore isn't something that has to be done as a synchronous migration, and even further, we don't have to enable it for every source.

This code doesn't proactively rename old files (as it can't, due to not being able to identify the old problem chapters). It only:

  1. names new downloads according to the new pattern.
  2. retains the ability to find old files (and as a result, problem chapters downloaded before the change will continue to have the problem).

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

Say you download 300 chapters with this new naming format, then the source changes the chapter URL. All those new downloads are now invalid and have to be moved since they rely on an MD5 of the URL. That is a huge amount of filesystem operations. And some won't even get moved because chapters with an invalid number don't get matched to new versions.

Nope. a file move isn't a huge amount of file system ops.
File handle changes are minimal in impact.
It's content writes that harms the file system.

Mihon has a bigger issue with file system iops (download to temp -> zip -> delete temp).
I run my own fork (since about 2 months ago) that downloads directly to zip since my PR for addressing that one wasn't merged into main (saves a huge amount of flash wear and IO performance is much better as well). If you need, I can get some numbers in a few days as I still run official on my tablet, so can do a side by side comparison by downloading the same title/chapter/source at the same time and time them.

@raxod502
Copy link
Contributor

How would you feel about having the hash-based chapter names be a flag that can be enabled or disabled by a particular source? The default can be the present behavior of not disambiguating chapters with the same names, and sources that have duplicate chapters can set the flag to append a hash based on the URL.

I think this would address everyone's concerns, yes? No need to worry about sources whose URLs are unstable, since we would not need to enable the flag for those, and no need to worry about a perception of how "small" the problem is, because the solution can be applied only to the sources that are problematic, without affecting the others.

IMO, we would also want to add a check to the general downloader code that would at least detect when there has been a filename collision, and report it as an error to the user, so that they can know to go to the issue tracker and report that a source probably should have this flag enabled for it, to enable correct downloads.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

How would you feel about having the hash-based chapter names be a flag that can be enabled or disabled by a particular source? The default can be the present behavior of not disambiguating chapters with the same names, and sources that have duplicate chapters can set the flag to append a hash based on the URL.

How will the source know? It's not like they can predict the future.

Rather than that, an indicator on the source about the stableness of the url may be viable... (static/hard coded value in the extension)?

  • for those that are relatively stable, use url hashing (I dare say 100% of the sources I've been using/testing with are or near 100% stable).
  • for those that are not stable, live with the original problem.

Main issue with that is that the interface between mihon and extension may require modification.

@AntsyLich
Copy link
Member

I think you misunderstood what I was saying. Say you download 300 chapters with this new naming format, then the source changes the chapter URL. All those new downloads are now invalid and have to be moved since they rely on an MD5 of the URL. That is a huge amount of filesystem operations. And some won't even get moved because chapters with an invalid number don't get matched to new versions.

I was intending to introduce unavailable chapter to counter chapter removal. Which should also affect this as we won't need to migrate downloads to another readded chapter.

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

Which should also affect this as we won't need to migrate downloads to another readded chapter.

If you do this, does it mean chapter.id will be the stable identifier? (so suffix the chapter file with chapter.id instead of md5(chapter.url))

@AntsyLich
Copy link
Member

Which should also affect this as we won't need to migrate downloads to another readded chapter.

If you do this, does it mean chapter.id will be the stable identifier? (so suffix the chapter file with chapter.id instead of md5(chapter.url))

I suppose yeah

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

I suppose yeah

Updated.

@Syer10
Copy link

Syer10 commented Jun 16, 2025

This means that we won't be able to move the downloads to a new Mihon install though

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

This means that we won't be able to move the downloads to a new Mihon install though

Doesn't export/import transfer the db?

@AntsyLich
Copy link
Member

This means that we won't be able to move the downloads to a new Mihon install though

Doesn't export/import transfer the db?

That won't work if the other db already has a chapter with id 1 tho

@fatotak
Copy link
Contributor Author

fatotak commented Jun 16, 2025

That won't work if the other db already has a chapter with id 1 tho

Can we add the last known file name to the db?
New allocations (chapter.filename == null) will use id, but non-new allocations will use the field data?
Then export/import scenario wouldn't be an issue.

Dedicated db field + deterministic suffix should be good for all cases, after your fix for chapter shift.

…d is more appropriate for new file naming format.
@fatotak
Copy link
Contributor Author

fatotak commented Jun 18, 2025

any progress on the review/merge?

@raxod502
Copy link
Contributor

raxod502 commented Jul 6, 2025

@AntsyLich or @MajorTanya, would one of you perhaps be able to comment on what is needed from contributors to move towards a merged fix for this bug? I'm happy to help out.

@AntsyLich
Copy link
Member

I was on hiatus but other than that i did leave some comments in the discord server tldr is that I don't like the extra db io this changes result in

@raxod502
Copy link
Contributor

raxod502 commented Jul 9, 2025

Ah, sorry, I was unaware that discussion was happening there. For future reference, the mentioned discussion can be viewed at https://discord.com/channels/1195734228319617024/1197870876557848647/1391637343038996482 and subsequent messages.

@AntsyLich
Copy link
Member

Ah, sorry,

No need to be sorry

@raxod502
Copy link
Contributor

raxod502 commented Jul 13, 2025

@fatotak and @AntsyLich - I've filed fatotak#4 against this PR. The additional changes:

  • Set the implementation to use hashed chapter URLs as the disambiguator, rather than chapter IDs. (This was decided on in the preceding discussion, and has the advantage that the filesystem layout does not depend on arbitrary internal database features, and existing downloads need not be thrown away if the database is lost.)
  • Eliminate the added database calls in UpdatesScreenModel, resolving issues with increased database IO that were previously caused by the change.
  • Migrate method calls back to taking separate parameters for the individual features of a chapter and manga that are relevant to locating them on the filesystem, rather than taking the entire domain objects, which encode much more information than is actually needed. (If it is desired to reduce the number of method parameters, a new domain object can be added that contains the chapter name, scanlator, and URL: this could then be used in place of the current parameters, and fields added in case more information is required in the future.)

The present code should be backwards compatible with existing downloaded chapters, but use the new scheme to download new chapters. As mentioned in the original PR description, we could also take action to correct existing, already-downloaded chapters that were affected by this bug and therefore contain incorrect contents. This is not yet implemented.

I tested this by compiling a development build of Mihon with the changes (as well as a custom Hiveworks extension to work around keiyoushi/extensions-source#9648) and downloading some chapters that have the same title. They render correctly and distinctly, unlike before. This is what the filesystem looks like:

bluejay:/sdcard/MihonDev/downloads/Hiveworks Comics (EN)/SMBC $ ls -lA                                                                                                                       
total 2184
-rw-rw---- 1 u0_a143 media_rw 514698 2025-07-13 15:12 AI_91a377.cbz
-rw-rw---- 1 u0_a143 media_rw 166897 2025-07-13 15:11 AI_92a53c.cbz
-rw-rw---- 1 u0_a143 media_rw 340925 2025-07-13 15:02 Rolling_6e8ef1.cbz
-rw-rw---- 1 u0_a143 media_rw 674535 2025-07-13 15:01 Summary_0e7956.cbz
-rw-rw---- 1 u0_a143 media_rw 529171 2025-07-13 15:12 ai_2d0872.cbz

(There are two chapters called AI and one called ai in lowercase, yes.)

@AntsyLich
Copy link
Member

@raxod502 awesome. From a quick glance I'm ok with the implementation but if possible we should take the opportunity in changing download name schema to address some other related issues (e.g. #1517, #1633, #2280 and potentially #2132 though I'm not married to the idea).

The general plan would be to:

  • Append hash to manga folder
  • Cap the manga and chapter name (25? for each)
  • Restrict names to alphanumeric

@BrutuZ
Copy link
Contributor

BrutuZ commented Jul 14, 2025

I wish this was followed by a change to Local Source to use the included metadata on chapters then to avoid both clipped titles and the hash suffix on the chapter list :nyoron:

@raxod502
Copy link
Contributor

I didn't know about local sources! Reading the documentation for that, it looks like we should continue to support the non-hash-appended filename structure as something Mihon is able to read, because it wouldn't make sense to ask people to compute hashes when providing their own files.

The character set and length restrictions can be applied on download, but we can still allow Mihon to read a more general set of file structures for the local source.

I will add the suggested changes to fatotak#4 if @fatotak doesn't get to it first.

@BrutuZ
Copy link
Contributor

BrutuZ commented Jul 15, 2025

I think you misunderstood, the effect on Local Source would be the Chapter Names when transplanting chapters downloaded from extensions.
Per Antsy's suggestion, Vol. 1 Chapter 1 - Let's start this run of the mill regurgitated weak to strong isekai shounen trash adventure would download as ScanlatorGroup_Vol__1_Chapter_1___Let_s_f00ba7.cbz, that would also be how it shows up on the Local Source's chapter list (sans the .cbz extension). You'd lose most of the title and append yet another element to it, making it harder to visually parse, especially from sources that do not prepend titles with the numbering.

It wouldn't need to recalculate the hash. In fact it couldn't since the chapter URL is the chapter's file/folder name itself.

@cuong-tran
Copy link
Contributor

  • Cap the manga and chapter name (25? for each)

I think the length cap is already in place:

@raxod502
Copy link
Contributor

So the use case you're referring to is when downloading manga using one source and then... viewing it again via local source, instead of using the original source?

@BrutuZ
Copy link
Contributor

BrutuZ commented Jul 16, 2025

Correct.
It's not that uncommon of a use-case, can come up if a source removes chapters, entries, or goes down altogether. You might want to reuse previously downloaded chapters despite migration, which doesn't include those.

…ters_with_same_name

Improvements to PR#2206
@raxod502
Copy link
Contributor

Friends - I have filed fatotak#5 against this PR. It incorporates two substantive changes:

1. Correct filename truncation

Existing logic in DiskUtil truncated filenames to 240 characters. However, Android imposes a limit of 255 bytes in a filename, and each character is at least one (may be more than one) byte. Update this code to impose the likely-intended limit of 240 bytes, instead.

Leave the old implementation of DiskUtil.buildValidFilename in the code, renaming it to DiskUtil.buildValidFilenameLegacy. Introduce instead a new function DiskUtil.sanitizeFilename, and migrate callers. This is so that we can retain one call to the old function in DownloadProvider.getLegacyChapterDirNames, so that chapters written using the old code can still be read by newer versions of Mihon, while all newly written chapters will use the correct truncation logic.

As was pointed out by @bunny on Discord, the original code in Mihon seems to have been taken from https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/java/android/os/FileUtils.java;l=1136, except that the critical subroutine responsible for correctly truncating filenames based on byte length was replaced with an incorrect substring slice. (It is also possible that an older version of the upstream code was buggy, and it is that version that was copied.)

Note that #2280 shows the exact truncation behavior implemented by the upstream code linked above. With the new implementation, truncation will occur at an earlier stage by Mihon, which will prevent that issue.

Based on discussion with @BrutuZ and @Foffs in Discord, code inspection, and some testing, I also believe that #1517 and #1633 are caused by too-long filenames rather than (as is claimed by the reporters) invalid characters. I suggest that we merge the fix in my new PR, have those reporters try the new version, and provide more detail about the offending filenames if errors persist.

See #2206 (comment) for where this was requested. Note that I opted not to add hashes to the manga directories, as I don't think duplicate manga names under the same source are nearly as common as duplicate chapter names under the same manga, and using only one hash in the directory structure simplifies things. I also opted to leave the existing restrictions on length and character set alone (standard FAT metacharacters disallowed, max length 240 bytes) since I think those are reasonable and the issue was actually that the maximum length constraint wasn't being enforced correctly.

I believe this to resolve all the issues linked in the comment above, except for #2132 since it seems basically unrelated, and since the feature request and its motivation remain rather unclear to me even after reading most of the discussion.

2. Use XML metadata in local source

When the chapter list on a local-source manga entry is refreshed, each chapter is checked for an embedded ComicInfo.xml file; if it is present, then the data contained therein is used for the chapter title and number, rather than the filename and heuristics for parsing based on the filename. This ensures that if a manga entry is downloaded from one source and then viewed through the local source interface, the display of the chapter list retains correct titles and ordering.

There is a performance implication: refreshing the chapter list for a local-source manga will take longer as Mihon must check to see if XML files are included. There is an existing feature in Mihon where including a file in the manga directory called .noxml will prevent XML files from being searched for source metadata. That feature is now extended to also suppress XML files from being searched for chapter metadata, which eliminates the performance penalty in the case that the user has no need for this feature.

Some minor refactoring in LocalSource was done to reduce code duplication.

See #2206 (comment) for where this was requested.

@raxod502
Copy link
Contributor

fatotak@4febec5 and fatotak@e2f0337 implement changes suggested by @Syer10 on Discord to use a more memory-efficient string truncation algorithm and to eschew buildValidFilenameLegacy since they believe the new implementation (now renamed from sanitizeFilename back to the original buildValidFilename) is backwards compatible with the original implementation.

@raxod502
Copy link
Contributor

Per request of @AntsyLich in Discord, I've filed my own PR #2305 to incorporate all code changes in one place. Further updates will take place there (but can also be merged into the original PR by its author, if they so desire).

@AntsyLich
Copy link
Member

Superceded by #2305

@AntsyLich AntsyLich closed this Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants