Crawling

Project funded by:

The Crawling component extracts HTML documents from a set of defined Web sources (forums, portals), in order to gather the documents containing potential threats.
The component's activity begins with an initial list of addresses pointing to different sources.
Then, the address list is iteratively extended by hyperlinks found on the previously extracted pages and the new addresses are also processed.
In order to reduce the number of documents to extract only to those relevant for the system user, for each extracted web page a class is assigned.
Links from the page are extracted based on its class - for each class there are rules defining which part of the page contains relevant links and what is their structure (we use regular expressions and XPath for this purpose).
Examples of classes of pages in the forum-like web source, might be for example "thread" or "list threads".

Pages are extracted incrementally, i.e. once downloaded, pages are visited repetitively to extract new content.
The frequency of visits on the page depends on the likelihood of changes being made on the page and its class.
For each page, the parameters of the probability distribution for changed content are calculated based on the history of changes.
When the probability (recalculated after certain time) exceeds the threshold, the address is added to the queue and the page is visited at the earliest possible time, after which the distribution parameters can be updated.
In accessing the source pages we implement the politeness policy - the interval between the retrievals from the same domain is set in order to avoid overloading Web servers.

The extracted HTML documents are stored in an appropriate form in a file repository, and the during the crawling process a graph of pages and links is created and stored in a database reflecting the source structure and current probability distribution for changed content for each known address.

SMC - Semantyczny Monitoring Cyberprzestrzeni

Crawling