Skip to content

MyAnimeList Scrapper#6

Open
pjeanjean wants to merge 12 commits into
Neraste:masterfrom
pjeanjean:master
Open

MyAnimeList Scrapper#6
pjeanjean wants to merge 12 commits into
Neraste:masterfrom
pjeanjean:master

Conversation

@pjeanjean
Copy link
Copy Markdown
Contributor

I created a scrapper to get the artists of anime themes from MyAnimeList.

The scrapper is used by the anime parser, and will automatically get the data from MyAnimeList if it successfully finds the exact title of the anime it parses.
If an exact match is not found, the scrapper will ask the user which title from MAL is the correct one.

The data is then stored in a JSON file, and used again in all the subsequent parsing.

Current limitations:

  • the JSON file must be placed in the root directory of dakaraneko
  • there is no setting to disable the scrapper

@Neraste Neraste self-requested a review August 11, 2018 11:00
Copy link
Copy Markdown
Owner

@Neraste Neraste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. Most of my comments are code style/code logic oriented.

The first point you have to improve is the saved file. If the default file is as simple as an empty dictionary, just remove it. Your code should look for such a file first and use it if it exists (meaning the script has been launched at least once) or create an empty dictionary instead. This way, the file only exists if it is not empty. For practical use, this allows the user to create a symbolic link to put the file whereever they want. Moreover, you should reconsider the structure of your saved data, as you use a list for no practical reason.

The second point concerns the enableness of the scrapper. With this PR, the scrapper is always on, which is not practical. Since there are no context given to the file name parser by Dakara (it will be the case), I think the best move is to use an environment variable to check if the scrapper is enabled or not. Since the scrapper does interact with third-party elements (the MAL server), I think it would be safer to disable it by default.

Comment thread mal_scrapper.py Outdated
Comment thread mal_scrapper.py Outdated
Comment thread mal_scrapper.py Outdated
Comment thread mal_scrapper.json
@@ -0,0 +1 @@
[{}]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is your data a list of a unique dictionary? You carry data[0] everywhere in your code for no practical reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the reason is that a JSON file needs to start with an array.
So I can only include my dictionary inside an array...

Copy link
Copy Markdown
Owner

@Neraste Neraste Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not not exact. RFC 4627 states that a JSON text is a serialized object or array. Newer RFC 7159 extends this with number, string, or true, false and null (cf. end of page 3). So a JSON text containing a dictionary is perfectly valid.

Comment thread mal_scrapper.py Outdated
Comment thread mal_scrapper.py Outdated
Comment thread mal_scrapper.py Outdated
Comment thread mal_scrapper.py
link = animes_list[i].find('a', {'class': 'hoverinfo_trigger'})['href']
found[anime] = link # We store the result and continue

if anime == name: # If we find a perfect match, that's great!
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some Python native tools for strings similarity detection (difflib.SequenceMatcher) that you can use.

Copy link
Copy Markdown
Contributor Author

@pjeanjean pjeanjean Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I wanted to look for a perfect match only was to check that all the anime titles were following the MAL convention.
However, after the first scrapping, I agree that it could be a bit less strict.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I did not mean to replace the strict equality. I think you can try the proximity detection with a high ratio (at least 80 %) before prompting the user.

Comment thread mal_scrapper.py
animes_found = list(found)
print()
for i in range(len(animes_found)):
print(str(i) + ": " + animes_found[i])
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feeder is an automatic tool and I do not like the idea to request the user to do something during the process. As you stated, it behaves weirdly with the progress bar.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree, but I also think that the user will not have to do something very often (after the first scrapping, it should rarely happen actually).
The main issue I have with removing this process is that some cases, like "Yuru Yuri S2", will become a lot more complicated to handle.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this as it for now. We will change this behavior later (cf. a comment of mine in the main discussion).

Comment thread mal_scrapper.py Outdated
@Neraste
Copy link
Copy Markdown
Owner

Neraste commented Aug 11, 2018

Associate the scrapper to the parser was a good idea at first, but it has some drawbacks (no context available, interactive script whereas the parser is non-interactive, hard time to turn on/off the scrapping itself). But there is a way for improvement.

The scrapping and the generation of the JSON file could be dissociated to become a stand-alone executable script, that can be easily manipulated. Within karaneko, nekommons gives a function to list a directory cleanly and nekoparse allows to parse file names.

The parser would be modified to only read this JSON file and extract data from it. Moreover, in the near future, dakara-server will be able to create songs by feeding from files or by reading such a JSON file.

Comment thread mal_scrapper.py
json_file_path = './mal_scrapper.json'


class File():
Copy link
Copy Markdown
Owner

@Neraste Neraste Sep 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open by itself is a context manager, so you don't need this bypass class. You simply do:

with open(filename) as file:
    content = file.read()

Comment thread mal_scrapper.py
return get_artists(tags, True)

# Index not in list despite scrapping... Can't do anything else here
print('\nWARNING MAL_Scrapper: ' + name + ' ' + tags['link_type'] + str(tags['link_nb']) + ' not found')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logging module for warnings.

Comment thread mal_scrapper.py
# First steap: finding the page of the anime, using the search engine
response = requests.get('https://myanimelist.net/anime.php?q="' + name + '"')
if not response.ok:
print('\nWARNING MAL_Scrapper: can\'t connect to MyAnimeList')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logging module for warnings.

Moreover, if response.ok is false, it does not mean the connection to a server failed, but that the response code is 4** or 5**. See my previous comment for connection failure.

Comment thread mal_scrapper.py
"""

# First steap: finding the page of the anime, using the search engine
response = requests.get('https://myanimelist.net/anime.php?q="' + name + '"')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add an error management here, in case the connection with the server failed. Basically, what you have to do is to encapsulate the requests.get call within a try/except block, and catch a requests.exceptions.RequestException error. You can either return or stop the execution of the program.

Comment thread mal_scrapper.py
parsed = bs.BeautifulSoup(response.text, features='html.parser')
animes_list = parsed.find('div', {'class': 'js-block-list'}).table.findAll('tr')

for i in range(1, kept_results_max + 1): # We won't keep more than kept_results_max entries
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware that you never use the first element of the list (of index 0). If this is voluntary, you should add a comment.

A more Pythonic way to iterate over a portion of a list:

for anime_row in animes_list[:kept_results_max]:
    pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants