Font Size: a A A

Research On WEB Entity Information Extraction Algorithm And Its Application

Posted on:2019-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:2348330569495570Subject:Engineering
Abstract/Summary:PDF Full Text Request
Due to the rapid development and popularity of the Internet,the Internet has become a very important source of information.Many Internet users are also increasingly eager to be able to effectively and accurately find the target topic page in the vast Internet and implement customized entity information extraction on the topic page.In the traditional search engine domain,topic crawlers and vertical crawlers are popular methods for obtaining specific topics and specific website data.However,the topic crawlers pay more attention to the search of topic pages,and often neglect the in-depth study of accurate extraction of page information.Although reptiles can extract accurate information from a website,their main disadvantage is their poor portability,their inability to achieve common crawling of different websites,and their low degree of automation.Made some achievements as cassical WEB methods of information extraction in respective fields they adapt to,certain problems also exist,limited scope of adaptation and low efficiency of the extraction algorithm included.In the meanwhile,these methods only focus on studying the entity information extraction of the target WEB page,while the topic crawler concentrates more on the search and location of the subject target page.Therefore,the existing classical WEB methods of entity information extraction is of limited scope of application and research.Aiming at the disadvantage that vertical crawler can not be directly transplanted to other websites and the program design needing a lot of manual intervention,and the limitation of classical WEB entity information extraction method,a highly efficient and highly portable WEB entity information extraction algorithm is proposed in this paper,and the research of the extraction algorithm which includes the search and location of topic page as well as the extraction of page information.(1)In the searching and locating part,a supervised weighted searching strategy for breadth-first web pages is proposed,which can automatically identify topic targets and directory pages URL,and generate URL regular expression filters by URL clustering.The regular expression filter is used to search the relevant pages in a wide range of areas first,and the effect of the best and priority can be accomplished by means of the value calculation based on tunnel technology.Experiments demonstrate that the search strategy designed in this paper can ensure the crawler to locate and download the topic related pages sufficiently,quickly and accurately with high efficiency and accuracy.(2)In the section of page information extraction,the crawler can realize the accurate and complete extraction of the customized WEB entity information based on configured information taking advantage of Wrapper.The automatic generation of data parsing path template can fully ensure the efficiency and accuracy of information extraction.Using the principle of the WEB entity information extraction algorithm proposed in this paper,this paper designs and implements a general-purpose vertical crawler system.The implementation of the system is a specific application of the WEB data collector,The system can realize efficient,fast and accurate customized data crawling on different websites after convenient configuration information,high portability and strong universality.At the same time,it also proves the rationality and effectiveness of WEB entity information extraction algorithm proposed in this paper,has a high value of application,and also enriches the theoretical and applied research of WEB information extraction.
Keywords/Search Tags:Entity information extraction, Vertical crawler, URL clustering, Regular expression filters, Page weight
PDF Full Text Request
Related items