-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathprob_2.txt
More file actions
22 lines (15 loc) · 1.75 KB
/
prob_2.txt
File metadata and controls
22 lines (15 loc) · 1.75 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Problem 2 for you
Reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
UTF-8, a transformation format of ISO 10646 (You only need to refer to section 3, “UTF-8 definition”).
You will write a Python program which will take a path to an input file (absolute path name) as the first parameter. It will read the file as a binary file, and assume that it contains text in UTF-16 encoding (big endian), that is every 2 bytes correspond to one character and directly encode that character's Unicode code point. The program will encode each character in UTF-8 (between 1 and 3 bytes), and write the encoded bytes to a file called utf8encoder_out.txt. If you use Python 2.7, name your program utf8encoder.py; if you use Python 3.4, name your program utf8encoder3.py. For example, your program will be expected to handle:
> python utf8encoder.py /path/to/input
You can test your program by running it on the following input files and comparing the program output to the corresponding output files.
english_in.txt , english_out.txt
arabic_in.txt , arabic_out.txt
gujarati_in.txt , gujarati_out.txt
The actual test will be done with similar files.
Notes:
You may assume that the input is proper UTF-16 and encodes proper characters, so there's no need to test the input or worry about the security issues that are brought up in the UTF-8 specification.
The point of the exercise is encoding the characters. You may use external resources to learn how to read, process and write binary strings in Python, but computation of the actual byte values must be your own work.
You may use any type of arithmetic for computing the byte values (bitwise, integer, floating point, etc.) as long as it gives the correct results.