Font Size: a A A

Research On Information Extraction Based On Vertical Search Engine

Posted on:2010-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:J X LiFull Text:PDF
GTID:2178360272980200Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In modern times, there are many ways to obtain information. As an information media, internet is much more important in the efficiency of transmission and volume of information. With the increment of information, it is difficult for users to get the information they really need. Although the invention of search engine improves the current condition, it is still not convenient for them to obtain professional information. And vertical search engine solves it.In vertical search engine, structured data extraction is one of the key technologies. And data extraction based on wrapper is a more important technology of the data extraction. Before generation of the wrapper, it is needed to analyze the web page and generate the rule of data extraction. The un-theme data involved in extraction rule while analyzing badly affects the efficiency of extraction and accuracy of result.We promote improvement on the wrapper, which extracts the theme directly and analyzes the web page based on improved MDR algorithm. The web page extracted theme should be data intensive, because in the procedure of extraction, we would extract the tree of each item in the theme region. Through the comparison of each node on the same layer in the DOM tree of the whole web page, some generalized nodes which are similar to each other are divided into some similar data region to constitute the simple tree processed later. After that we can generate the rule of structured data extraction on the data region in the usual way. The un-theme data which is useless and causes inefficiency on the web page is thoroughly cleared. The experiment demonstrates its efficiency on generating and promotion of the precision rate and recall rate of the wrapper in some degree.
Keywords/Search Tags:vertical search engine, data extraction, theme data, generalized node, data region
PDF Full Text Request
Related items