ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data

Posted on:2007-01-01

Degree:M.Sc

Type:Thesis

University:University of Calgary (Canada)

Candidate:Ashraf, Fatima

Full Text:PDF

GTID:2458390005490812

Subject:Computer Science

Abstract/Summary:

In the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction systems try to solve this problem by making the task as automatic as possible. All existing approaches, however, require user feedback in one form or another during the extraction. Thus, none of these are completely automatic. In this thesis, the use of clustering techniques for automatic information extraction from HTML documents containing semi-structured data is proposed. The system ClusTex is based on this idea. Using domain-specific information provided by the user, ClusTex parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters and finally the output is reported. The proposed approach is tested by conducting experiments on seven websites from three domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.

Keywords/Search Tags:

Information, HTML, Clustex, Containing, Data

Related items

1	Data Extraction And Integration In HTML Tables
2	Research And Application On The Technology Of Web Information Extraction Based On The HTML
3	Information Hiding Technology Application In HTML Tags
4	Based On The Html Pages Of Web Information Extraction
5	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
6	The Technology Of Web Information Extraction Based On HTML Parser
7	Research On The HTML And PDF Informaiton Extraction Technology Based XML
8	Design And Implementation Of Cross-platform Mobile Service System Of Tutor Information Based On HTML 5
9	Semantic hierarchies of HTML documents and their applications
10	The Research On Web Information Extraction Based On HMM