Study On Tables Information Extraction Based On Web

Posted on:2011-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Qin

Full Text:PDF

GTID:2178360305973038

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

With the explosive growth of Web information, the user wants to obtain information from the Internet become more and more difficult."Information overload" has become a serious problem. Tables can present information effectively and concise, so they are used widely in web page. According to the statistics, about 52% of the Web page contains tables. tables information extraction in data mining and other fields have important significance。Thus, we bring forward a method for web information extraction by using of table.Web tables information extraction technique was proposed in the 1990s. At present, there are two methodologies. One is based on wrapper, This approach has the versatility of the poor, and once page structure change, you need to reconstruct the wrapper. The other way is based on table structure recognization, This paper focuses on the latter kind of extraction technology research, and then design and implement Web form information extraction system so that it can automatically understand the structure of tables and effectively extract the information.This paper firstly get the HTML page, which should be cleaned to remove the useless information. The format integrity of the HTML document is not strict, which will cause information extraction fail. So we transform HTML documents to well-formed XHTML documents(subset of XML). Output of XML documents not only contain the genuine tables which user is interested in, but also contain non-genuine tables used for page layout. Through a large number of observation, we obtain the heuristic rules from the genuine tables and non-genuine tables and locate the positon of the genuine tables, and then we do an in-depth research on the table structure recognition. do the deeply analysis to the inner structure of web table. we get the heuristic rules from the characteristic of the title and identify the expation method of table. This paper considers such layout type features as the cross-row and cross-column instance, which make each data unit and the corresponding property not corresponded, so tables are standardized so that each row(column) are aligned with the same number of cells. Finally, this paper presents some special web table information extraction methods and implement the algorithm.Experimental results is measured and results indicate that it has important significance to further study.

Keywords/Search Tags:

information extraction, HTML, table locate, XML, structure recognization, DOM

PDF Full Text Request

Related items

1	Research And Implementation Of The Web Page Table Structure Recognition
2	Research Of Web Information Extraction Based On Table Structure
3	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
4	Based On The Html Pages Of Web Information Extraction
5	Data Extraction And Integration In HTML Tables
6	Research And Application On The Technology Of Web Information Extraction Based On The HTML
7	The Research On Web Information Extraction Based On HMM
8	Research On The HTML And PDF Informaiton Extraction Technology Based XML
9	Method Of Entity Table Information Extraction In Web Page
10	The Technology Of Web Information Extraction Based On HTML Parser