Skip to content

Rename strings barfs with a string that doesn't match that pattern.  #6

@notbaab

Description

@notbaab

Given a file list like this.

ShowName.S02E01.Anthropology.101.720p.WEB-DL.x264-LeRalouf.mkv
ShowName.S02E02.Accounting.For.Lawyers.720p.WEB-DL.x264-LeRalouf.mkv
ShowName.S02E03.The.Psychology.Of.Letting.Go.720p.WEB-DL.x264-LeRalouf.mkv
....
sample/

the rename code will not rename the files, even though there are clear patterns present. What should happen is keep track of matches that match the majority of strings. The failure happens at GetCommonSubstrings when it calls normalize_comparision.

The issue happens because while walking down the list of files, it only does a comparisons between the current string and previous one. From there, it only keeps anything that matches in those two strings, it doesn't take into account anything else it's seen.

A simple approach would be if get all substrings returns nothing at all don't normalize. I would say do that approach first.

A perfect approach would be to compare every string to every other string and if there is a string that didn't match any other string, simple exclude it. Generally, this is working with a pretty small input so an n^2 solution isn't impractical. I think I would rather throw more space at it though rather than power.

Keep the sequential walking and store anything removed in a map with the value being how many strings it was removed from. Anything that has a high percentage of being removed, actually try to remove it in the strings. This has a different set of issues though. If the given input is.

ShowName.S02E01.Anthropology.101.720p.WEB-DL.x264-LeRalouf.mkv
ShowName.S02E02.Accounting.For.Lawyers.720p.WEB-DL.x264-LeRalouf.mkv
ShowName.S02E03.The.Psychology.Of.Letting.Go.720p.WEB-DL.x264-LeRalouf.mkv
ShowName.S02E04.The.Psychology.Of.writing.Go.720p.WEB-DL.x264-LeRalouf.mkv

The above would produce a map that looks like

720p.WEB-DL.x264-LeRalouf.mkv -> 2
Go.720p.WEB-DL.x264-LeRalouf.mkv -> 2

So further steps need to be made to make sure that each value is not a subset. I'm not sure what the best solution would be. The three discussed seem easy enough to implement so maybe just prototype and see what does best?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions