-
Notifications
You must be signed in to change notification settings - Fork 299
Description
Issue:
Thendash characters in word_count.txt cause an error when following the "Run your first Spark Job" tutorial. There are only two occurences of this character here: "from 1913–74." and here: "near–bankruptcy".
To Recreate:
using spark-2.3.2-bin-hadoop2.7 on Ubuntu18, pyspark/python 2.7, Installed following instructions from lecture 5, go to directory where you cloned python-spark-tutorial and run the following from lecture 6:
spark-submit ./rdd/WordCount.py
The execution halts about halfway through the frequency counter with the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position4: ordinal not in range(128)
Spoiler, it's the dash. I'm not sure whether or not the utf16 dash was intentional, so I'm posting.
Work-Around:
I changed the two ndash characters to "from 1913-74." and "near-bankruptcy", which solved the issue for me. Related stackoverflow thread where someone else ran into a similar problem with python2.7 and used the same solution.