Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model

Posted on:2008-10-08

Degree:Master

Type:Thesis

Country:China

Candidate:F Chen

Full Text:PDF

GTID:2178360212984927

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

The exponential increase of the information on the World Wide Web is now making automatic Web information extraction a practical necessity to reduce human efforts in reading and understanding them. The text on Web pages can generally be classified into three categories: structured text, semi-structured text and free text. The system presented in this paper aims to extract Web page information in two steps: extracting free text from the Web and extracting information from the extracted free text. The extraction process starts with a focused topical crawler which collects Web pages of a certain topic. The pages are then inspected again for free texts and a trained model is used to extract information and store the structured results into a database for later use. Most Web sites use TABLE tag to manage the contents and layout of pages, therefore, this paper proposes an algorithm to detect and remove noisy information of a Web page with following steps: extracting table elements from html texts, constructing a new simplified tree structure and blocking, clustering and extracting page templates according to the similarity between pages. Besides, free texts in Web pages are sequence characters while semi-structured texts are in tags like TABLE, OL, UL etc, so this paper proposes another algorithm to identify whether the content of a Web page is free text or semi-structured text according to the degree of discretion of text segments in the content of the page. In the end, this paper proposes an approach to extract information from free text using POS based Hidden Markov Model.

Keywords/Search Tags:

Information Extraction, free Web text extraction, template detection, table blocks, Hidden Markov Model

PDF Full Text Request

Related items

1	Algorithm Research For Text Information Extraction Based On Hidden Markov Model
2	Web Text Information Extraction And Classification
3	Research And Implementation Of Web Information Extraction Based On Improved Hidden Markov Model
4	Research On Spatial And Temporal Information Extraction In Unstructured Text
5	The Algorithm Research Of Chinese Information Extraction Based On The Hidden Markov Model
6	Application Research Of Hidden Markov Model In Information Extraction
7	Research On Heterogeneous Academic Information Extraction And Aggregation Based On Web
8	Based On The Hmm Education News Extraction And Classification
9	Template Recognition And Extraction Of Complex Table Document Images
10	Research On Domain-oriented Deep Web Information Extraction