Skip to content

Commit

Permalink
Add more reliable numbers/stats (#24)
Browse files Browse the repository at this point in the history
* some numbers on how well it performs

* some numbers on how well it performs

* Update README.md

* diff between ftfy and chardet
  • Loading branch information
Ousret authored Oct 11, 2019
1 parent 19d2075 commit cfa2fda
Show file tree
Hide file tree
Showing 3 changed files with 635 additions and 1 deletion.
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ This project offers you an alternative to **Universal Charset Encoding Detector*

| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
| ------------- | :-------------: | :------------------: | :------------------: |
| `Fast` | ❌<br> | <br> | ✅ <br> |
| `Fast` | ❌<br> | <br> | ✅ <br> |
| `Universal**` ||||
| `Reliable` **without** distinguishable standards ||||
| `Reliable` **with** distinguishable standards ||||
Expand All @@ -45,6 +45,12 @@ This project offers you an alternative to **Universal Charset Encoding Detector*
| `Detect spoken language` ||| N/A |
| `Supported Encoding` | 30 | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/support.html) | 40

| Package | Accuracy | Mean per file (ns) | File per sec (est) |
| ------------- | :-------------: | :------------------: | :------------------: |
| [chardet](https://github.com/chardet/chardet) | 93.5 % | 126 081 168 ns | 7.931 file/sec |
| [cchardet](https://github.com/PyYoshi/cChardet) | 97.0 % | 1 668 145 ns | **599.468 file/sec** |
| charset-normalizer | **97.25 %** | 209 503 253 ns | 4.773 file/sec |

<p align="center">
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://image.noelshack.com/fichiers/2019/31/5/1564761473-ezgif-5-cf1bd9dd66b0.gif" alt="Cat Reading Text" width="200"/>

Expand Down Expand Up @@ -119,6 +125,8 @@ What I want is to get readable text, the best I can.

In a way, **I'm brute forcing text decoding.** How cool is that ? 😎

Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.

## 🍰 How

- Discard all charset encoding table that could not fit the binary content.
Expand Down
Loading

0 comments on commit cfa2fda

Please sign in to comment.