Structured Extraction

Project funded by:

The goal of the extraction based on the document structure is to transform the documents gathered from the Web sources to the semi-structured form. It means that the document is divided into parts and its main element is extracted (i.e. the content of the post on the forum) together with additional information contained in the document, including its publication date, topic, details about the post author - his pseudonym, contact and address data.

The goal is achieved thanks to the extraction methods which are applicable for semi-structured documents with stable template. In particular we apply XPath expressions which allow to split the document into parts and HTML transformations which result in XML document with clean code (without errors typical for Web pages).

The structured extraction phase is a preprocessing step for the document content. It allows to increase the effectiveness of the later stages. For example extracting information about the post author allows to increase precision of acquiring contact and address data, in comparison to applying raw natural language processing techniques to the unprocessed textual content of the HTML page. Structured extraction increases also the overall effectiveness of the system as it filters the content which is uninteresting from the system users point of view. The users monitoring for potential threats are not interested in comments, advertisements or navigation elements of the Web documents. Filtering these out improves the experience and allows to focus on the actual content of the documents.

SMC - Semantyczny Monitoring Cyberprzestrzeni

Structured Extraction