Font Size: a A A

ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data

Posted on:2007-01-01Degree:M.ScType:Thesis
University:University of Calgary (Canada)Candidate:Ashraf, FatimaFull Text:PDF
GTID:2458390005490812Subject:Computer Science
Abstract/Summary:
In the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction systems try to solve this problem by making the task as automatic as possible. All existing approaches, however, require user feedback in one form or another during the extraction. Thus, none of these are completely automatic. In this thesis, the use of clustering techniques for automatic information extraction from HTML documents containing semi-structured data is proposed. The system ClusTex is based on this idea. Using domain-specific information provided by the user, ClusTex parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters and finally the output is reported. The proposed approach is tested by conducting experiments on seven websites from three domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.
Keywords/Search Tags:Information, HTML, Clustex, Containing, Data
Related items