Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
chiachienhung authored Apr 25, 2023
1 parent f159228 commit 0028008
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions LangCC/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This dataset is created for intermediate training purpose, in order to encode kn
You can simply download the full data from [here](https://drive.google.com/drive/folders/18WSQQp6omwgR1KHzmyw13KQBBdd3voHB?usp=sharing) or you can modify the scripts for your own usage.


## Download files from OpenSubtitles
## Download files from CC-100
```
* Chinese
wget https://data.statmt.org/cc-100/zh-Hans.txt.xz -O zh-Hans.txt.xz
Expand All @@ -23,4 +23,4 @@ wget https://data.statmt.org/cc-100/de.xz -O de.txt.xz
```
python langcc_extract.py --input_lang_file "./de.txt.xz" --save_file_name "./langcc/cc_de_500K.txt" --max_line 500000
python langcc_prep.py --input_lang_file="./langcc/cc_de_500K.txt" --save_train_file_name="./langcc/cc_de_train_200K.txt" --save_test_file_name="./langcc/cc_de_test_10K.txt"
```
```

0 comments on commit 0028008

Please sign in to comment.