MyAnimeList Scrapper#6
Conversation
There was a problem hiding this comment.
This is a good start. Most of my comments are code style/code logic oriented.
The first point you have to improve is the saved file. If the default file is as simple as an empty dictionary, just remove it. Your code should look for such a file first and use it if it exists (meaning the script has been launched at least once) or create an empty dictionary instead. This way, the file only exists if it is not empty. For practical use, this allows the user to create a symbolic link to put the file whereever they want. Moreover, you should reconsider the structure of your saved data, as you use a list for no practical reason.
The second point concerns the enableness of the scrapper. With this PR, the scrapper is always on, which is not practical. Since there are no context given to the file name parser by Dakara (it will be the case), I think the best move is to use an environment variable to check if the scrapper is enabled or not. Since the scrapper does interact with third-party elements (the MAL server), I think it would be safer to disable it by default.
| @@ -0,0 +1 @@ | |||
| [{}] | |||
There was a problem hiding this comment.
Why is your data a list of a unique dictionary? You carry data[0] everywhere in your code for no practical reason.
There was a problem hiding this comment.
Well, the reason is that a JSON file needs to start with an array.
So I can only include my dictionary inside an array...
| link = animes_list[i].find('a', {'class': 'hoverinfo_trigger'})['href'] | ||
| found[anime] = link # We store the result and continue | ||
|
|
||
| if anime == name: # If we find a perfect match, that's great! |
There was a problem hiding this comment.
There are some Python native tools for strings similarity detection (difflib.SequenceMatcher) that you can use.
There was a problem hiding this comment.
The reason I wanted to look for a perfect match only was to check that all the anime titles were following the MAL convention.
However, after the first scrapping, I agree that it could be a bit less strict.
There was a problem hiding this comment.
Sure. I did not mean to replace the strict equality. I think you can try the proximity detection with a high ratio (at least 80 %) before prompting the user.
| animes_found = list(found) | ||
| print() | ||
| for i in range(len(animes_found)): | ||
| print(str(i) + ": " + animes_found[i]) |
There was a problem hiding this comment.
The feeder is an automatic tool and I do not like the idea to request the user to do something during the process. As you stated, it behaves weirdly with the progress bar.
There was a problem hiding this comment.
I mostly agree, but I also think that the user will not have to do something very often (after the first scrapping, it should rarely happen actually).
The main issue I have with removing this process is that some cases, like "Yuru Yuri S2", will become a lot more complicated to handle.
There was a problem hiding this comment.
Let's keep this as it for now. We will change this behavior later (cf. a comment of mine in the main discussion).
|
Associate the scrapper to the parser was a good idea at first, but it has some drawbacks (no context available, interactive script whereas the parser is non-interactive, hard time to turn on/off the scrapping itself). But there is a way for improvement. The scrapping and the generation of the JSON file could be dissociated to become a stand-alone executable script, that can be easily manipulated. Within karaneko, nekommons gives a function to list a directory cleanly and nekoparse allows to parse file names. The parser would be modified to only read this JSON file and extract data from it. Moreover, in the near future, dakara-server will be able to create songs by feeding from files or by reading such a JSON file. |
| json_file_path = './mal_scrapper.json' | ||
|
|
||
|
|
||
| class File(): |
There was a problem hiding this comment.
open by itself is a context manager, so you don't need this bypass class. You simply do:
with open(filename) as file:
content = file.read()| return get_artists(tags, True) | ||
|
|
||
| # Index not in list despite scrapping... Can't do anything else here | ||
| print('\nWARNING MAL_Scrapper: ' + name + ' ' + tags['link_type'] + str(tags['link_nb']) + ' not found') |
There was a problem hiding this comment.
Please use logging module for warnings.
| # First steap: finding the page of the anime, using the search engine | ||
| response = requests.get('https://myanimelist.net/anime.php?q="' + name + '"') | ||
| if not response.ok: | ||
| print('\nWARNING MAL_Scrapper: can\'t connect to MyAnimeList') |
There was a problem hiding this comment.
Please use logging module for warnings.
Moreover, if response.ok is false, it does not mean the connection to a server failed, but that the response code is 4** or 5**. See my previous comment for connection failure.
| """ | ||
|
|
||
| # First steap: finding the page of the anime, using the search engine | ||
| response = requests.get('https://myanimelist.net/anime.php?q="' + name + '"') |
There was a problem hiding this comment.
You should add an error management here, in case the connection with the server failed. Basically, what you have to do is to encapsulate the requests.get call within a try/except block, and catch a requests.exceptions.RequestException error. You can either return or stop the execution of the program.
| parsed = bs.BeautifulSoup(response.text, features='html.parser') | ||
| animes_list = parsed.find('div', {'class': 'js-block-list'}).table.findAll('tr') | ||
|
|
||
| for i in range(1, kept_results_max + 1): # We won't keep more than kept_results_max entries |
There was a problem hiding this comment.
Beware that you never use the first element of the list (of index 0). If this is voluntary, you should add a comment.
A more Pythonic way to iterate over a portion of a list:
for anime_row in animes_list[:kept_results_max]:
pass
I created a scrapper to get the artists of anime themes from MyAnimeList.
The scrapper is used by the anime parser, and will automatically get the data from MyAnimeList if it successfully finds the exact title of the anime it parses.
If an exact match is not found, the scrapper will ask the user which title from MAL is the correct one.
The data is then stored in a JSON file, and used again in all the subsequent parsing.
Current limitations: