Font Size: a A A

Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model

Posted on:2008-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:F ChenFull Text:PDF
GTID:2178360212984927Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The exponential increase of the information on the World Wide Web is now making automatic Web information extraction a practical necessity to reduce human efforts in reading and understanding them. The text on Web pages can generally be classified into three categories: structured text, semi-structured text and free text. The system presented in this paper aims to extract Web page information in two steps: extracting free text from the Web and extracting information from the extracted free text. The extraction process starts with a focused topical crawler which collects Web pages of a certain topic. The pages are then inspected again for free texts and a trained model is used to extract information and store the structured results into a database for later use. Most Web sites use TABLE tag to manage the contents and layout of pages, therefore, this paper proposes an algorithm to detect and remove noisy information of a Web page with following steps: extracting table elements from html texts, constructing a new simplified tree structure and blocking, clustering and extracting page templates according to the similarity between pages. Besides, free texts in Web pages are sequence characters while semi-structured texts are in tags like TABLE, OL, UL etc, so this paper proposes another algorithm to identify whether the content of a Web page is free text or semi-structured text according to the degree of discretion of text segments in the content of the page. In the end, this paper proposes an approach to extract information from free text using POS based Hidden Markov Model.
Keywords/Search Tags:Information Extraction, free Web text extraction, template detection, table blocks, Hidden Markov Model
PDF Full Text Request
Related items