Skip to content

Commit

Permalink
Allow CDATA; do not parse script childs as raw content
Browse files Browse the repository at this point in the history
  • Loading branch information
giulianopz committed Sep 6, 2024
1 parent b7d83b3 commit d13a9ea
Show file tree
Hide file tree
Showing 10 changed files with 540 additions and 599 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

A Go port of Mozilla [Readability.js](https://github.com/mozilla/readability), an algorithm based on heuristics (e.g. link density, text similarity, number of images, etc.) that [just somehow work well](https://stackoverflow.com/a/4240037) and powers the [Firefox Reader View](https://support.mozilla.org/kb/firefox-reader-view-clutter-free-web-pages) offering a distraction-free reading experience for articles, blog posts, and other text-heavy web pages by removing ads, GDPR-compliant cookie banners and other unsolicited junk.

This port uses only the minimal DOM parser bundled with the original lib, resorting to the Go stdlib (`net/html`) in case of a failure. The source code is aligned with the latest commit ([97db40b](https://github.com/mozilla/readability/commit/97db40ba035a2de5e42d1ac7437893cf0da31d76)) on the main branch.
The source code is aligned with the latest commit ([97db40b](https://github.com/mozilla/readability/commit/97db40ba035a2de5e42d1ac7437893cf0da31d76)) on the main branch.


## A Bit of History
Expand Down
17 changes: 13 additions & 4 deletions cmd/readability/readability.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,17 @@ import (
"flag"
"fmt"
"io"
"log/slog"
"net/http"
"os"

"github.com/giulianopz/go-readability"
)

var output string
var (
output string
verbose bool
)

func handle(err error) {
if err != nil {
Expand All @@ -27,10 +31,15 @@ func main() {

flag.StringVar(&output, "output", "text", "the result output format: 'text' or 'html'")
flag.StringVar(&output, "o", "text", "the result output format: 'text' or 'html'")
flag.BoolVar(&verbose, "verbose", false, "enable logs")
flag.BoolVar(&verbose, "v", false, "enable logs")
flag.Parse()

url := flag.Arg(0)
if !verbose {
slog.SetDefault(slog.New(slog.NewTextHandler(io.Discard, nil)))
}

url := flag.Arg(0)
if url == "" {
exit("missing url")
}
Expand All @@ -41,14 +50,14 @@ func main() {
bs, err := io.ReadAll(resp.Body)
handle(err)

parser, err := readability.New(string(bs), url, readability.LogLevel(-1))
parser, err := readability.New(string(bs), url)
handle(err)

res, err := parser.Parse()
handle(err)

if output == "html" {
fmt.Print(res.Content)
fmt.Print(res.HTMLContent)
} else {
fmt.Print(res.TextContent)
}
Expand Down
Loading

0 comments on commit d13a9ea

Please sign in to comment.