Some ideas for very useful and helpful functions. #234

ZeroDot1 · 2021-04-07T13:05:05Z

Add a URL/Domain extractor.
With this function it should be possible to extract all URLs/domains from any text and save them to a file so that they can be easily checked at a later time without significant time and effort.
Simply a useful function for blacklist developers.
Add a search function (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program).
With the search function it should be possible to search all URLs/domains with a given string in a file and save it to another file, and it should be possible to search multiple keywords by comma separation directly after each other.
This function is also very helpful for blacklist developers.

keczuppp · 2021-04-07T16:44:51Z

There exist tools for it already:

ZeroDot1 : Add a URL/Domain extractor.

https://www.google.pl/search?q=domain+extractor

ZeroDot1 : Add a search function

it's a scope of regular expressions:
https://regexr.com/
https://regex101.com/
also any good text editor (like Notepad++) has regex support

ZeroDot1 · 2021-04-07T16:59:42Z

There exist tools for it already:

Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream.
In addition, most tools are very limited e.g. by a size limitation of input files.

keczuppp · 2021-04-08T10:06:16Z

ZeroDot1 : Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream.

this one looks good and has no problem with it: https://en.rakko.tools/tools/62/, however it has a 50000 lines limit

ZeroDot1 · 2021-04-08T12:50:03Z

this one looks good

Sorry, no the tool does not work.
I have tested it with different text inputs. From a HTML page with over 95 links only 32 URLs/Domains were extracted.

keczuppp · 2021-04-08T15:14:59Z

ZeroDot1 : Sorry, no the tool does not work.

It does work very well, but it extracts domains only and not URLs,
I just missed you wanted to extract not only domains but also URLs.

Then there are several browser addons I use myself to extract domains/URLs from a webpage:
https://addons.mozilla.org/pl/firefox/addon/web-link-extractor/
https://addons.mozilla.org/pl/firefox/addon/link-gopher/
https://chrome.google.com/webstore/detail/link-gopher/bpjdkodgnbfalgghnbeggfbfjpcfamkf
https://chrome.google.com/webstore/detail/link-grabber/caodelkhipncidmoebgbbeemedohcdma?hl=pl

ZeroDot1 · 2021-04-09T01:55:16Z

It does work very well,

All this is not what I would need.
Your suggested tools are nothing more than online helpers.
I need tools that work completely offline.
I can't process 2GB text files with any of your suggestions.

keczuppp · 2021-04-09T07:20:51Z

ZeroDot1 : 2GB text files

Wow.

Are you sure you don't want to split the file into smaller chunks?:
https://stackoverflow.com/questions/18208524/how-do-i-read-a-text-file-of-about-2-gb
https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files

Also this is a good tool I use personally sometimes:
https://www.digitalvolcano.co.uk/textcrawler.html

What does it do?

TextCrawler is a fantastic tool for anyone who works with text files. This powerful program enables you to instantly find and replace words and phrases across multiple files and folders. It utilises a flexible Regular Expression engine to enable you to create sophisticated searches, preview replace, perform batch operations, extract text from files and more. It is fast and easy to use, and as powerful as you need it to be.

spirillen · 2021-04-09T16:30:04Z

I can't process 2GB text files with any of your suggestions.

😆 🤣

Try Linux 😜

This said, there is a python module that can do this, since you don't have (e)grep available out of the box.
My question to you (@ZeroDot1) is, are your source files in any kind of "standard" formatting? or would it be better to convert them first into some std format, which then can be processed by pyfunceble??

keczuppp · 2021-04-10T07:15:29Z

Yeah, I'm curious as well, how did he end up with a 2GB text file...

is it a some log which accumulated over time (weeks/months)
or is it a combined file from other many smaller files (filter lists, html webpages, logs, etc..)

I just can't imagine how can a single 2GB text file ever be created in normal conditions.
I don't think a lot of people need this as it is rather a premium feature request needed by individuals for specific tasks, nothing that many people would benefit from, however I'm always open mind.

spirillen : Try Linux

Perhaps you were just joking, anyway, it seems he didn't like to use Linux:

ZeroDot1 : #234 (comment) : (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program).

ZeroDot1 · 2021-04-10T16:53:23Z

I use Linux most of the time.
Yes, this is really a very special function, but I think this function will be very useful and helpful me and for others.

The data are completely mixed files that are combined into one file.
The function should simply be able to read all text independent of the file format, because it makes very little sense with an extraction function to limit the function to specific file formats.

I think with just a few changes to PyFunceble the functions should be easily possible.

With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful.

At the moment the function is simply limited to RAW files.

ZeroDot1 · 2021-04-10T16:54:53Z

Try Linux

I use Linux since 1999.

By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D

keczuppp · 2021-04-10T18:21:14Z

As for domains extraction:

As I said in #13 (comment) : in case of extracting domains in Adblock Decoder, "Decode everything" mode, will give too many useless false positives which will clutter the output list, making the output a garbare dump.

There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content.

As for URLs extraction:

Can have false hits as well:
https://pypi.org/project/urlextract/
https://mathiasbynens.be/demo/url-regex

The only solution seems to be to extract everything, and to leave all false hits/garbare as an user's issue to deal with.
Paraphrasing WYSIWYG ==> WYGIWYG (What You Give is What You Get)

keczuppp · 2021-04-11T13:05:50Z

Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml,
I've tested on easylist and:

the result contain some garbare: result.zip
you can test your 2 GB file with it to see whether the tool can handle such big files and how much garbare it produces
by the way this tool have a nice search/filter feature you requested in your initial comment: it can extract only the URLs which contain given phrases

ZeroDot1 · 2021-04-13T13:16:39Z

Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml,

I can not use this tool, I use Linux as a system.

I don't have a VM at the moment because I don't have any free space and I can't buy a new SSD at the moment.

I have somewhere in my archive also a self-programmed software in C# to extract all possible URLs.
I can't use that right now, but a solution that works directly with Linux and command line is best.
I think the best solution would be to use PyFunceble to extract domains and subdomains from completely mixed text.

@funilrys What do you think about this idea would it be possible?

keczuppp · 2021-04-13T16:16:19Z

And what about Wine?

ZeroDot1 : I think the best solution would be to use PyFunceble to extract domains and subdomains from completely mixed text.

It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:

dealing with cleaning the output on your own
or wait until pyfunceble checks for their availability / syntax validity

spirillen · 2021-04-16T12:36:32Z

@ZeroDot1 wrote:

Try Linux

I use Linux since 1999.

By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D

I know, I most had been tired as I confused you with someone else 😃 😪 You are on Arch I know...

@ZeroDot1 wrote:
With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful.

Sounds like an integration of BeautifulSoup could come in handy!!!

This said, I do understand why you (@ZeroDot1) would like to integrate it into @PyFunceble directly.

This is not an objection, but a thought of the big picture, would it be more handy to write this as a individual code that can extract all urls/domains from any source?
Why: I could use such tool to extract urls/Domains when I'm working on the Adult Contents project, saving me a bunch of time from extracting sources through @gorhill's uBlock Origin logger. This tool could then be using the @PyFunceble API to test on the fly.

What do you other think? @ZeroDot1 @keczuppp @mitchellkrogza @funilrys

@keczuppp wrote: It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:

Not true if you are using proper code bases

keczuppp · 2021-04-16T16:17:34Z

spirillen : Sounds like an integration of BeautifulSoup could come in handy!!!

I saw it before, but I didn't mention about it coz it supports only HTML or XML, which is not the case the OP requested, he requested to extract from any text.

keczuppp wrote: It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:

spirillen : Not true if you are using proper code bases

1

I already mentioned about it before (wrote "might / can" instead of "will") as mentioned before:

keczuppp : #234 (comment) : There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content.

keczuppp : #234 (comment) : Can have false hits as well

2

You quoted my statement out of context, in my comment #234 (comment), above my statement is a quote to which my statement refers, which means I was reffering to the quote and not talking generally, and the quote says about: "ZeroDot1 : completely mixed text." where "mixed" most likely means "random", which is opposite to "spirillen: use proper (prepared/custom) code base". Hence I can't agree with you saying "not true" in this case.. But yeah, if not reffering to the quote, and when talking generally, it is possible to avoid false hits if you cherry pick the input content (about what I already mentioned before.)

spirillen : What do you other think?

Maybe it can be like Adblock Decoder, it can exists in both ways: as integrated into PyFunceble and as standalone

spirillen · 2021-06-11T11:10:56Z

Hey @ZeroDot1

As I'm re-reading your suggestion, I sitting here and thinking:
What would you prefer?

A separated tool that can extract a random list and then test them through the PyFunceble API
A integrated tool to PyFunceble

Related to: ~~Adult Contents submit program #python~~, New url location: https://mypdns.org/my-privacy-dns/porn-records/-/issues/59

funilrys · 2021-06-29T22:36:44Z

The search "function" won't be my priority. Other tools should be able to handle it better.

But some dedicated tools which proxies our internal decoders (like the adblock-decoder) may be provided in the future.

Let's keep this open.

funilrys · 2021-09-21T20:14:10Z

btw, the PyFunceble Web Worker project, provides some endpoints for the decoding/conversion of inputs.

It basically exposes the (internal) converter of PyFunceble behind a web server / API. I can't and don't want to hosts such a service (yet) but it can be a good alternative for some people ... I'm still ready to fix the issues reported there though.

spirillen mentioned this issue Sep 6, 2021

[FEATURE] Sorting and duplicate deletion of output PyFunceble/adblock-decoder#2

Open

funilrys self-assigned this Oct 7, 2022

funilrys added the enhancement label Oct 7, 2022

garry-ut99 mentioned this issue Oct 17, 2023

Text is empty while searching large (>2Gb) files stefankueng/grepWin#300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some ideas for very useful and helpful functions. #234

Some ideas for very useful and helpful functions. #234

ZeroDot1 commented Apr 7, 2021

keczuppp commented Apr 7, 2021 •

edited

Loading

ZeroDot1 commented Apr 7, 2021

keczuppp commented Apr 8, 2021

ZeroDot1 commented Apr 8, 2021

keczuppp commented Apr 8, 2021 •

edited

Loading

ZeroDot1 commented Apr 9, 2021

keczuppp commented Apr 9, 2021 •

edited

Loading

spirillen commented Apr 9, 2021

keczuppp commented Apr 10, 2021 •

edited

Loading

ZeroDot1 commented Apr 10, 2021

ZeroDot1 commented Apr 10, 2021 •

edited

Loading

keczuppp commented Apr 10, 2021 •

edited

Loading

keczuppp commented Apr 11, 2021 •

edited

Loading

ZeroDot1 commented Apr 13, 2021

keczuppp commented Apr 13, 2021 •

edited

Loading

spirillen commented Apr 16, 2021

keczuppp commented Apr 16, 2021 •

edited

Loading

spirillen commented Jun 11, 2021 •

edited

Loading

funilrys commented Jun 29, 2021

funilrys commented Sep 21, 2021

Some ideas for very useful and helpful functions. #234

Some ideas for very useful and helpful functions. #234

Comments

ZeroDot1 commented Apr 7, 2021

keczuppp commented Apr 7, 2021 • edited Loading

ZeroDot1 commented Apr 7, 2021

keczuppp commented Apr 8, 2021

ZeroDot1 commented Apr 8, 2021

keczuppp commented Apr 8, 2021 • edited Loading

ZeroDot1 commented Apr 9, 2021

keczuppp commented Apr 9, 2021 • edited Loading

spirillen commented Apr 9, 2021

keczuppp commented Apr 10, 2021 • edited Loading

ZeroDot1 commented Apr 10, 2021

ZeroDot1 commented Apr 10, 2021 • edited Loading

keczuppp commented Apr 10, 2021 • edited Loading

keczuppp commented Apr 11, 2021 • edited Loading

ZeroDot1 commented Apr 13, 2021

keczuppp commented Apr 13, 2021 • edited Loading

spirillen commented Apr 16, 2021

keczuppp commented Apr 16, 2021 • edited Loading

1

2

spirillen commented Jun 11, 2021 • edited Loading

funilrys commented Jun 29, 2021

funilrys commented Sep 21, 2021

keczuppp commented Apr 7, 2021 •

edited

Loading

keczuppp commented Apr 8, 2021 •

edited

Loading

keczuppp commented Apr 9, 2021 •

edited

Loading

keczuppp commented Apr 10, 2021 •

edited

Loading

ZeroDot1 commented Apr 10, 2021 •

edited

Loading

keczuppp commented Apr 10, 2021 •

edited

Loading

keczuppp commented Apr 11, 2021 •

edited

Loading

keczuppp commented Apr 13, 2021 •

edited

Loading

keczuppp commented Apr 16, 2021 •

edited

Loading

spirillen commented Jun 11, 2021 •

edited

Loading