Font Size: a A A

Study On Tables Information Extraction Based On Web

Posted on:2011-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z H QinFull Text:PDF
GTID:2178360305973038Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the explosive growth of Web information, the user wants to obtain information from the Internet become more and more difficult."Information overload" has become a serious problem. Tables can present information effectively and concise, so they are used widely in web page. According to the statistics, about 52% of the Web page contains tables. tables information extraction in data mining and other fields have important significance。Thus, we bring forward a method for web information extraction by using of table.Web tables information extraction technique was proposed in the 1990s. At present, there are two methodologies. One is based on wrapper, This approach has the versatility of the poor, and once page structure change, you need to reconstruct the wrapper. The other way is based on table structure recognization, This paper focuses on the latter kind of extraction technology research, and then design and implement Web form information extraction system so that it can automatically understand the structure of tables and effectively extract the information.This paper firstly get the HTML page, which should be cleaned to remove the useless information. The format integrity of the HTML document is not strict, which will cause information extraction fail. So we transform HTML documents to well-formed XHTML documents(subset of XML). Output of XML documents not only contain the genuine tables which user is interested in, but also contain non-genuine tables used for page layout. Through a large number of observation, we obtain the heuristic rules from the genuine tables and non-genuine tables and locate the positon of the genuine tables, and then we do an in-depth research on the table structure recognition. do the deeply analysis to the inner structure of web table. we get the heuristic rules from the characteristic of the title and identify the expation method of table. This paper considers such layout type features as the cross-row and cross-column instance, which make each data unit and the corresponding property not corresponded, so tables are standardized so that each row(column) are aligned with the same number of cells. Finally, this paper presents some special web table information extraction methods and implement the algorithm.Experimental results is measured and results indicate that it has important significance to further study.
Keywords/Search Tags:information extraction, HTML, table locate, XML, structure recognization, DOM
PDF Full Text Request
Related items