-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why there are no descriptions? #1109
Comments
I agree. For technologies is fine as it is, for companies also, but for individuals categories at least one should be added. I do not know 100% of them and checking each site one by one is a nightmare. I am not sure is it legal. Something like this should do the job. import requests
from bs4 import BeautifulSoup
from transformers import pipeline
# Initialize a summarization pipeline
summarizer = pipeline("summarization")
def crawl_and_summarize(url):
# Crawl the webpage
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract main content, could be more specific based on site structure
text = soup.get_text()
# Summarize the text
summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
return summary[0]['summary_text']
def read_urls_from_file(file_path):
with open(file_path, 'r') as file:
return [line.strip() for line in file if line.strip()]
# File containing URLs, one per line
file_path = 'urls.txt'
# Read URLs from the file
urls = read_urls_from_file(file_path)
# Crawl and summarize each URL
for url in urls:
try:
summary = crawl_and_summarize(url)
print(f"URL: {url}\nSummary: {summary}\n")
except Exception as e:
print(f"Error processing {url}: {e}") In this script:
Remember to place the |
Can descriptions be added? Otherwise, this collection of a bunch of URLs (albeit alphabetized) has little use. Maybe a script that goes through them and retrieves
document.title
would do?The text was updated successfully, but these errors were encountered: