MyAnimeList Scrapper by pjeanjean · Pull Request #6 · Neraste/dakaraneko

pjeanjean · 2018-08-09T21:36:03Z

I created a scrapper to get the artists of anime themes from MyAnimeList.

The scrapper is used by the anime parser, and will automatically get the data from MyAnimeList if it successfully finds the exact title of the anime it parses.
If an exact match is not found, the scrapper will ask the user which title from MAL is the correct one.

The data is then stored in a JSON file, and used again in all the subsequent parsing.

Current limitations:

the JSON file must be placed in the root directory of dakaraneko
there is no setting to disable the scrapper

Neraste

This is a good start. Most of my comments are code style/code logic oriented.

The first point you have to improve is the saved file. If the default file is as simple as an empty dictionary, just remove it. Your code should look for such a file first and use it if it exists (meaning the script has been launched at least once) or create an empty dictionary instead. This way, the file only exists if it is not empty. For practical use, this allows the user to create a symbolic link to put the file whereever they want. Moreover, you should reconsider the structure of your saved data, as you use a list for no practical reason.

The second point concerns the enableness of the scrapper. With this PR, the scrapper is always on, which is not practical. Since there are no context given to the file name parser by Dakara (it will be the case), I think the best move is to use an environment variable to check if the scrapper is enabled or not. Since the scrapper does interact with third-party elements (the MAL server), I think it would be safer to disable it by default.

Neraste · 2018-08-11T11:45:37Z

@@ -0,0 +1 @@
+[{}]


Why is your data a list of a unique dictionary? You carry data[0] everywhere in your code for no practical reason.

Well, the reason is that a JSON file needs to start with an array.
So I can only include my dictionary inside an array...

This is not not exact. RFC 4627 states that a JSON text is a serialized object or array. Newer RFC 7159 extends this with number, string, or true, false and null (cf. end of page 3). So a JSON text containing a dictionary is perfectly valid.

Neraste · 2018-08-11T12:04:35Z

+        link = animes_list[i].find('a', {'class': 'hoverinfo_trigger'})['href']
+        found[anime] = link # We store the result and continue
+
+        if anime == name: # If we find a perfect match, that's great!


There are some Python native tools for strings similarity detection (difflib.SequenceMatcher) that you can use.

The reason I wanted to look for a perfect match only was to check that all the anime titles were following the MAL convention.
However, after the first scrapping, I agree that it could be a bit less strict.

Sure. I did not mean to replace the strict equality. I think you can try the proximity detection with a high ratio (at least 80 %) before prompting the user.

Neraste · 2018-08-11T12:06:13Z

+        animes_found = list(found)
+        print()
+        for i in range(len(animes_found)):
+            print(str(i) + ": " + animes_found[i])


The feeder is an automatic tool and I do not like the idea to request the user to do something during the process. As you stated, it behaves weirdly with the progress bar.

I mostly agree, but I also think that the user will not have to do something very often (after the first scrapping, it should rarely happen actually).
The main issue I have with removing this process is that some cases, like "Yuru Yuri S2", will become a lot more complicated to handle.

Let's keep this as it for now. We will change this behavior later (cf. a comment of mine in the main discussion).

Neraste · 2018-08-11T18:18:53Z

Associate the scrapper to the parser was a good idea at first, but it has some drawbacks (no context available, interactive script whereas the parser is non-interactive, hard time to turn on/off the scrapping itself). But there is a way for improvement.

The scrapping and the generation of the JSON file could be dissociated to become a stand-alone executable script, that can be easily manipulated. Within karaneko, nekommons gives a function to list a directory cleanly and nekoparse allows to parse file names.

The parser would be modified to only read this JSON file and extract data from it. Moreover, in the near future, dakara-server will be able to create songs by feeding from files or by reading such a JSON file.

Neraste · 2018-09-30T04:12:06Z

+json_file_path = './mal_scrapper.json'
+
+
+class File():


open by itself is a context manager, so you don't need this bypass class. You simply do:

with open(filename) as file: content = file.read()

Neraste · 2018-09-30T04:19:11Z

+            return get_artists(tags, True)
+
+        # Index not in list despite scrapping... Can't do anything else here
+        print('\nWARNING MAL_Scrapper: ' + name + ' ' + tags['link_type'] + str(tags['link_nb']) + ' not found')


Please use logging module for warnings.

Neraste · 2018-09-30T04:25:15Z

+    # First steap: finding the page of the anime, using the search engine
+    response = requests.get('https://myanimelist.net/anime.php?q="' + name + '"')
+    if not response.ok:
+        print('\nWARNING MAL_Scrapper: can\'t connect to MyAnimeList')


Please use logging module for warnings.

Moreover, if response.ok is false, it does not mean the connection to a server failed, but that the response code is 4** or 5**. See my previous comment for connection failure.

Neraste · 2018-09-30T04:32:28Z

+    """
+
+    # First steap: finding the page of the anime, using the search engine
+    response = requests.get('https://myanimelist.net/anime.php?q="' + name + '"')


You should add an error management here, in case the connection with the server failed. Basically, what you have to do is to encapsulate the requests.get call within a try/except block, and catch a requests.exceptions.RequestException error. You can either return or stop the execution of the program.

Neraste · 2018-09-30T04:40:12Z

+    parsed = bs.BeautifulSoup(response.text, features='html.parser')
+    animes_list = parsed.find('div', {'class': 'js-block-list'}).table.findAll('tr')
+
+    for i in range(1, kept_results_max + 1): # We won't keep more than kept_results_max entries


Beware that you never use the first element of the list (of index 0). If this is voluntary, you should add a comment.

A more Pythonic way to iterate over a portion of a list:

for anime_row in animes_list[:kept_results_max]: pass

pjeanjean added 11 commits August 4, 2018 13:51

Prototype for MyAnimeList artist scrapper

76ef14c

Various fixes and added automatic scrapping if theme not found

5b62564

Forgot the case where the anime is not found

f7b81a9

Mistakes were made + compatibility with the parser

63185bb

Edited to use BeautifulSoup + various fixes with artists extraction

5ed6476

Fixed typo

d26cc1b

Added subtitle to avoid issues with homonyms

9572f3c

Added some regular expressions + usage of theme id

ef5ec8d

Fixes

2f41cf7

Edited parsers to better use the scrapper

0563e77

Made JSON human readable + fixes with regexps

bd5ab80

Neraste self-requested a review August 11, 2018 11:00

Neraste added the enhancement label Aug 11, 2018

Neraste reviewed Aug 11, 2018

View reviewed changes

Various fixes + added support for alternative titles

0ff71ff

Neraste reviewed Sep 30, 2018

View reviewed changes

		@@ -0,0 +1 @@
		[{}]

		json_file_path = './mal_scrapper.json'


		class File():

Conversation

pjeanjean commented Aug 9, 2018

Uh oh!

Neraste left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neraste Aug 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjeanjean Aug 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Neraste commented Aug 11, 2018

Uh oh!

Neraste Sep 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Neraste left a comment •

edited

Loading

Neraste Aug 11, 2018 •

edited

Loading

pjeanjean Aug 11, 2018 •

edited

Loading

Neraste Sep 30, 2018 •

edited

Loading