Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some ideas for very useful and helpful functions. #234

Open
ZeroDot1 opened this issue Apr 7, 2021 · 20 comments
Open

Some ideas for very useful and helpful functions. #234

ZeroDot1 opened this issue Apr 7, 2021 · 20 comments
Assignees

Comments

@ZeroDot1
Copy link
Contributor

ZeroDot1 commented Apr 7, 2021

  • Add a URL/Domain extractor.
    With this function it should be possible to extract all URLs/domains from any text and save them to a file so that they can be easily checked at a later time without significant time and effort.
    Simply a useful function for blacklist developers.

  • Add a search function (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program).
    With the search function it should be possible to search all URLs/domains with a given string in a file and save it to another file, and it should be possible to search multiple keywords by comma separation directly after each other.
    This function is also very helpful for blacklist developers.

@keczuppp
Copy link

keczuppp commented Apr 7, 2021

There exist tools for it already:

ZeroDot1 : Add a URL/Domain extractor.

ZeroDot1 : Add a search function

@ZeroDot1
Copy link
Contributor Author

ZeroDot1 commented Apr 7, 2021

There exist tools for it already:

Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream.
In addition, most tools are very limited e.g. by a size limitation of input files.

@keczuppp
Copy link

keczuppp commented Apr 8, 2021

ZeroDot1 : Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream.

@ZeroDot1
Copy link
Contributor Author

ZeroDot1 commented Apr 8, 2021

  • this one looks good

Sorry, no the tool does not work.
I have tested it with different text inputs. From a HTML page with over 95 links only 32 URLs/Domains were extracted.

@keczuppp
Copy link

keczuppp commented Apr 8, 2021

ZeroDot1 : Sorry, no the tool does not work.

It does work very well, but it extracts domains only and not URLs,
I just missed you wanted to extract not only domains but also URLs.

Then there are several browser addons I use myself to extract domains/URLs from a webpage:
https://addons.mozilla.org/pl/firefox/addon/web-link-extractor/
https://addons.mozilla.org/pl/firefox/addon/link-gopher/
https://chrome.google.com/webstore/detail/link-gopher/bpjdkodgnbfalgghnbeggfbfjpcfamkf
https://chrome.google.com/webstore/detail/link-grabber/caodelkhipncidmoebgbbeemedohcdma?hl=pl

@ZeroDot1
Copy link
Contributor Author

ZeroDot1 commented Apr 9, 2021

It does work very well,

All this is not what I would need.
Your suggested tools are nothing more than online helpers.
I need tools that work completely offline.
I can't process 2GB text files with any of your suggestions.

@keczuppp
Copy link

keczuppp commented Apr 9, 2021

ZeroDot1 : 2GB text files

Wow.

Are you sure you don't want to split the file into smaller chunks?:
https://stackoverflow.com/questions/18208524/how-do-i-read-a-text-file-of-about-2-gb
https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files

Also this is a good tool I use personally sometimes:
https://www.digitalvolcano.co.uk/textcrawler.html

What does it do?

TextCrawler is a fantastic tool for anyone who works with text files. This powerful program enables you to instantly find and replace words and phrases across multiple files and folders. It utilises a flexible Regular Expression engine to enable you to create sophisticated searches, preview replace, perform batch operations, extract text from files and more. It is fast and easy to use, and as powerful as you need it to be.

@spirillen
Copy link
Contributor

I can't process 2GB text files with any of your suggestions.

😆 🤣

Try Linux 😜

This said, there is a python module that can do this, since you don't have (e)grep available out of the box.
My question to you (@ZeroDot1) is, are your source files in any kind of "standard" formatting? or would it be better to convert them first into some std format, which then can be processed by pyfunceble??

@keczuppp
Copy link

keczuppp commented Apr 10, 2021

Yeah, I'm curious as well, how did he end up with a 2GB text file...

  • is it a some log which accumulated over time (weeks/months)
  • or is it a combined file from other many smaller files (filter lists, html webpages, logs, etc..)

I just can't imagine how can a single 2GB text file ever be created in normal conditions.
I don't think a lot of people need this as it is rather a premium feature request needed by individuals for specific tasks, nothing that many people would benefit from, however I'm always open mind.

spirillen : Try Linux

Perhaps you were just joking, anyway, it seems he didn't like to use Linux:

ZeroDot1 : #234 (comment) : (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program).

@ZeroDot1
Copy link
Contributor Author

I use Linux most of the time.
Yes, this is really a very special function, but I think this function will be very useful and helpful me and for others.

The data are completely mixed files that are combined into one file.
The function should simply be able to read all text independent of the file format, because it makes very little sense with an extraction function to limit the function to specific file formats.

I think with just a few changes to PyFunceble the functions should be easily possible.

With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful.

At the moment the function is simply limited to RAW files.

@ZeroDot1
Copy link
Contributor Author

ZeroDot1 commented Apr 10, 2021

Try Linux

I use Linux since 1999.

By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D

@keczuppp
Copy link

keczuppp commented Apr 10, 2021

As for domains extraction:

As I said in #13 (comment) : in case of extracting domains in Adblock Decoder, "Decode everything" mode, will give too many useless false positives which will clutter the output list, making the output a garbare dump.

There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content.

As for URLs extraction:

Can have false hits as well:
https://pypi.org/project/urlextract/
https://mathiasbynens.be/demo/url-regex


The only solution seems to be to extract everything, and to leave all false hits/garbare as an user's issue to deal with.
Paraphrasing WYSIWYG ==> WYGIWYG (What You Give is What You Get)

@keczuppp
Copy link

keczuppp commented Apr 11, 2021

Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml,
I've tested on easylist and:

  • the result contain some garbare: result.zip
  • you can test your 2 GB file with it to see whether the tool can handle such big files and how much garbare it produces
  • by the way this tool have a nice search/filter feature you requested in your initial comment: it can extract only the URLs which contain given phrases

@ZeroDot1
Copy link
Contributor Author

Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml,

I can not use this tool, I use Linux as a system.

I don't have a VM at the moment because I don't have any free space and I can't buy a new SSD at the moment.

I have somewhere in my archive also a self-programmed software in C# to extract all possible URLs.
I can't use that right now, but a solution that works directly with Linux and command line is best.
I think the best solution would be to use PyFunceble to extract domains and subdomains from completely mixed text.

@funilrys What do you think about this idea would it be possible?

@keczuppp
Copy link

keczuppp commented Apr 13, 2021

And what about Wine?

ZeroDot1 : I think the best solution would be to use PyFunceble to extract domains and subdomains from completely mixed text.

It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:

  • dealing with cleaning the output on your own
  • or wait until pyfunceble checks for their availability / syntax validity

@spirillen
Copy link
Contributor

@ZeroDot1 wrote:

Try Linux

I use Linux since 1999.

By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D

I know, I most had been tired as I confused you with someone else 😃 😪 You are on Arch I know...

@ZeroDot1 wrote:
With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful.

Sounds like an integration of BeautifulSoup could come in handy!!!

This said, I do understand why you (@ZeroDot1) would like to integrate it into @PyFunceble directly.

This is not an objection, but a thought of the big picture, would it be more handy to write this as a individual code that can extract all urls/domains from any source?
Why: I could use such tool to extract urls/Domains when I'm working on the Adult Contents project, saving me a bunch of time from extracting sources through @gorhill's uBlock Origin logger. This tool could then be using the @PyFunceble API to test on the fly.

What do you other think? @ZeroDot1 @keczuppp @mitchellkrogza @funilrys

@keczuppp wrote: It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:

Not true if you are using proper code bases

@keczuppp
Copy link

keczuppp commented Apr 16, 2021

spirillen : Sounds like an integration of BeautifulSoup could come in handy!!!

I saw it before, but I didn't mention about it coz it supports only HTML or XML, which is not the case the OP requested, he requested to extract from any text.

keczuppp wrote: It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:

spirillen : Not true if you are using proper code bases

1

I already mentioned about it before (wrote "might / can" instead of "will") as mentioned before:

keczuppp : #234 (comment) : There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content.

keczuppp : #234 (comment) : Can have false hits as well

2

You quoted my statement out of context, in my comment #234 (comment), above my statement is a quote to which my statement refers, which means I was reffering to the quote and not talking generally, and the quote says about: "ZeroDot1 : completely mixed text." where "mixed" most likely means "random", which is opposite to "spirillen: use proper (prepared/custom) code base". Hence I can't agree with you saying "not true" in this case.. But yeah, if not reffering to the quote, and when talking generally, it is possible to avoid false hits if you cherry pick the input content (about what I already mentioned before.)

spirillen : What do you other think?

Maybe it can be like Adblock Decoder, it can exists in both ways: as integrated into PyFunceble and as standalone

@spirillen
Copy link
Contributor

spirillen commented Jun 11, 2021

Hey @ZeroDot1

As I'm re-reading your suggestion, I sitting here and thinking:
What would you prefer?

  1. A separated tool that can extract a random list and then test them through the PyFunceble API
  2. A integrated tool to PyFunceble

Related to: Adult Contents submit program #python, New url location: https://mypdns.org/my-privacy-dns/porn-records/-/issues/59

@funilrys
Copy link
Owner

The search "function" won't be my priority. Other tools should be able to handle it better.

But some dedicated tools which proxies our internal decoders (like the adblock-decoder) may be provided in the future.

Let's keep this open.

@funilrys
Copy link
Owner

btw, the PyFunceble Web Worker project, provides some endpoints for the decoding/conversion of inputs.

It basically exposes the (internal) converter of PyFunceble behind a web server / API. I can't and don't want to hosts such a service (yet) but it can be a good alternative for some people ... I'm still ready to fix the issues reported there though.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Waiting
Development

No branches or pull requests

4 participants