web_harvesting

Jul 17, 2017

1d36c73 · Jul 17, 2017

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Minor edits.	Jul 17, 2017

README.md

Web page parsing by archival crawlers

What follows is a summary of different types of challenging situations for web page parsing and archiving, and approaches that have been proposed or attempted to address the challenges.

Root causes of page parsing difficulties

Parsing and archiving the content of static HTML is a straightforward and well-understood process, supported today by many application development libraries available as well as a great many off-the-shelf web crawlers/archivers. Beyond static HTML, however, there lie many challenges to automated page archiving. Some of the cases that cause difficulties include the following:

Client-side scripting: pages that contain code used to generate content dynamically by executing programs in the user's browser. This may be done unconditionally, or in response to mouse or keyboard events, timing events, or other conditions. Technologies used to achieve this include JavaScript, ActionScript, Flash, and AJAX. Generating content dynamically is also known as deferred representation.
Server-side scripting: pages whose content is generated dynamically by the web server based on events or user actions, such that repeated accesses to the same URL may produce different results each time. Some of the technologies used to achieve this include Java Server Pages, PHP, Perl, Ruby, Python, ASP, ColdFusion, and others.
Web forms: pages that provide a way for users to enter information into fields, to be sent to a server that returns new pages based on the information. Forms can be written in HTML, JavaScript, and other languages, and even be generated by client-side code.
Image maps: pages that contain images and react to user events (e.g., mouse clicks) performed at different spatial locations on the page.

All of the technologies above can also interact and be used together. For example, web forms are often used in conjunction with server-side scripting to produce content in response to the user's input. An important reason why archival preservation of websites is difficult is that so many approaches and technologies are available for implementing the features above, and they can be combined in a nearly infinite number of ways. There is no reason to expect that a crawler designed to handle one scenario or technology can handle any other.

The ability to handle web forms is sometimes called deep-web crawling, because web forms are often used to give access to the content of databases of unknown size. It represents an extremely difficult situation, and remains a topic of research today.

Software systems that support dynamic page content

ARCOMEM
Brozzler
CrawlJax – see also this 2008 technical report by Mesbah et al.
Nutch
Portia
WAIL
StormCrawler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

web_harvesting

web_harvesting

README.md

Web page parsing by archival crawlers

Root causes of page parsing difficulties

Software systems that support dynamic page content

Deep-web crawling systems

Files

web_harvesting

Directory actions

More options

Directory actions

More options

Latest commit

History

web_harvesting

Folders and files

parent directory

README.md

Web page parsing by archival crawlers

Root causes of page parsing difficulties

Software systems that support dynamic page content

Deep-web crawling systems