Font Size: a A A

Research On Technology Of Table Information Extraction In Semi-Structured Texts

Posted on:2008-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Y PanFull Text:PDF
GTID:2178360245997915Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As part of web documents, the table is a simple, effective and commonly used presentation scheme. Table includes rich information, so it is very attractive to information extraction, data-mining and so on. Because the content in the table provides rich information for us, it has become a valuable knowledge source. So it is worthy of study.Through analyzing semi-structure text's table's layout and content type character, this paper finds table's features and brings forward the maximum entropy to train the tables. In order to improve the training precision, when selecting the table's features, this paper considers such layout type features as the cross-row and cross-column instances and so on. This paper divides the research on table information extraction of semi-structure into two parts: the first part is the table's recognition work, the second part is to extract the information of genuine table.Toward the first part of table recognition research, because the method based on the heuristic rule's F-measure is not very high and the method based on the decision tree's F-measure is higher than heuristic rule's and up to now there is nobody using the maximum entropy model in the table recognition research, considering these two reasons, in this paper the innovation is bringing forward maximum entropy model for the table identification and we compares it with the decision tree, using the same corpora and the same features. When using maximum entropy model training and selecting features, this paper analyzes table's layout feature and content type feature and finds that the content type feature is better than the layout type feature in reflecting the table's characters. This paper uses the layout and the content type features as the table's features and uses a number of professional fields's webpage as training corpora. Through many experiments, the results show that the maximum entropy model can solve the table recognition problem very well and achiveve an F-measure of 91.31%, exceeding the decision tree(its F-measure is 87.87%). These experiments prove that in handling large traing corpus, the maximum entropy model is better than the decision tree model. It is because that the decision tree model lacks the retractility, the memory size restricts it in depth-first search.Toward the second part of table information extraction research, this paper brings forward the HTML Tidy tool to solve such nonstandard problems of webpage source code as missing the label, label confusion problem and so on. To the table's content extraction, this paper simply introduces the method based on the Wrapper to the method based on the Dom. Because that the method based on the Dom is more suited to extract the structured information and the table's content just owns the structured information"attributee-value". So in our information extracting system, this paper uses the method based on the Dom. In the system, the table uses the Dom Tree model. When extracting the information, we extract the table's information from the top nodes to the leaf nodes.This paper divides table information extraction work into two steps: the first step is to detect the table and to extract the table's frame, the second step is to extract the table's attribute information and the content of the table and to show them out.
Keywords/Search Tags:information extraction, table recognition, maximum entropy, decision tree, Dom
PDF Full Text Request
Related items