Font Size: a A A

Method Of Entity Table Information Extraction In Web Page

Posted on:2017-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2348330503992905Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the web information is growing exponentially. Browsing web information page by page cannot satisfy the requirement of people so that the information extraction technology are born at the right moment. Information extraction technology make people choose the useful content without further artificial screening and help people obtain valuable information from huge amounts of network data directly.The web information extraction technology mainly revolves around two directions, the wrapper and the structure identification. The shortcoming of the former is that the dependency is strong, the reusability and universality of the wrapper is weak. This article is a kind of structure identification, the method can locate and identify the semi-structured information well in web pages, and it has generality for most web pages. Also, the generated results can be directly applied to ontology and it has high practical value.In this paper, we study the crawler is an incremental, depth first crawling oriented crawler. It generates a crawler through a configuration file, a configuration file that corresponds to a crawler task. Configuration file has a specific format and configuration field, generated by the manual editing, only need to configure the approximately 10 fields, you can for directional crawler configuration for specific sites, domain specific and subject specific content.After the web cleaning, according to the standardized table of proposed entity location algorithm based on heuristic rules and based on URL classification algorithm of entity location. Based on feature tag, table structure, table content features in this paper summarizes the six rules, followed by the string to generate six rules, and then the finite automaton to recognize strings, finally according to stay in different state to judge whether is true form. In order to improve the positioning accuracy, this paper proposes a URL entity classification positioning method, through the classification of the URL, can't contain a table entity removal. The combination of these two methods makes the table location with high accuracy. At the same time, this paper proposes a heuristic rule for non-canonical forms with special symbols, and proposes a localization method based on DOM tree and heuristic rules for the non-normalized form of the label organization.In the process of table recognition, this article construct different kinds of type trees based on different types of attribute name and attribute value and judge the direction of the table through calculating the different types between the table cell. At the same time, this paper presents will form digital, by calculating the difference between the length of the cell judge table expansion way, whether the two results gives different weights, and ultimately determine the table for horizontal or vertical expansion. And this passage can distinguish the number of rows and columns that the table across by type differences and structure differences.
Keywords/Search Tags:Ontology Generation, Information extraction, Web table, Entity position, structure recognition
PDF Full Text Request
Related items