Skip to content

Commit

Permalink
added readme.md
Browse files Browse the repository at this point in the history
Signed-off-by: Maroun Touma <[email protected]>
  • Loading branch information
touma-I committed Nov 15, 2024
1 parent b77bbe9 commit 8e71177
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 5 deletions.
37 changes: 37 additions & 0 deletions transforms/universal/web2parquet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Web Crawler to Parquet

This tranforms crawls the web and download files in real-time.

This first release of the transform, only accept the following 4 parameters. Additional releases will extend the functionality to allow the user to specify additional constraints such as mime-type, domain-focus, etc.


## parameters

For configuring the crawl, users need to identify the follow paramters:

| parameter:type | Description |
| --- | --- |
| urls:list | list of seeds URL (i.e. ['https://thealliance.ai'] or ['https://www.apache.org/projects','https://www.apache.org/foundation']). The list can include any number of valid urls that are not configured to block web crawlers |
|depth:int | control crawling depth |
| downloads:int | number of downloads that are stored to the download folder. Since the crawler operations happen asyncrhonous, the process can result in any 10 of the visited URLs being retrieved (i.e. consecutive runs can result in different files being downloaded) |
| folder:str | folder where downloaded files are stored. If the folder is not empty, new files are added or replace existing ones with the same URLs |


## Invoking the transform from a notebook

In order to invoke the transfrom from the notebook, users must enable nested asynchronous io as follow:
import nest_asyncio
nest_asyncio.apply()

In order to invoke the transform users need to import the transform class and call the transform() function:

example:
```
import nest_asyncio
nest_asyncio.apply()
from dpk_web2parquet.transform import Web2Parquet
Web2Parquet(urls= ['https://thealliance.ai/'],
depth=2,
downloads=10,
folder='downloads').transform()
````
32 changes: 27 additions & 5 deletions transforms/universal/web2parquet/web2parquet.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {},
"outputs": [],
Expand All @@ -39,7 +39,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "b6c89ac7-6824-4d99-8120-7d5b150bd683",
"metadata": {},
"outputs": [],
Expand All @@ -65,7 +65,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
Expand All @@ -88,10 +88,32 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"['downloads/thealliance_ai_core-projects-ntia_request_text.html',\n",
" 'downloads/thealliance_ai_focus-areas-advocacy_text.html',\n",
" 'downloads/thealliance_ai_blog-open-source-ai-demo-night-sf-2024_text.html',\n",
" 'downloads/thealliance_ai_contact_text.html',\n",
" 'downloads/thealliance_ai_core-projects-sb1047_text.html',\n",
" 'downloads/thealliance_ai_focus-areas-foundation-models-datasets_text.html',\n",
" 'downloads/thealliance_ai_focus-areas-hardware-enablement_text.html',\n",
" 'downloads/thealliance_ai_core-projects-trusted-evals_text.html',\n",
" 'downloads/thealliance_ai__text.html',\n",
" 'downloads/thealliance_ai_contribute_text.html',\n",
" 'downloads/thealliance_ai_community_text.html',\n",
" 'downloads/thealliance_ai_become-a-collaborator_text.html']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import glob\n",
"glob.glob(\"downloads/*\")"
Expand Down

0 comments on commit 8e71177

Please sign in to comment.