Font Size: a A A

Research And Implement Of Active Learning Focused Crawling Based On Ontology

Posted on:2011-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:B RenFull Text:PDF
GTID:2178360305455271Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The general search engine technique has achieved huge improvement to solve finding resources on Internet. Some general search engine, such as baidu, google and yahoo etc, help people to search the useful information based on their typing key words. Most of these general search engines are based on the full-text search technique, the feature of this technique is strong searching ability, fast respondence to user requirements. But they also have some drawbacks:the feedback information is plenty and the alternative information is too much and it reduces the correctness of designated searching information. Another problem is bats crawling website lack of focus,can not search based on a specific content field. This issue will lead to the less professional searching results.According to these problems of the general search engine, the researchers got one brand new search engine which is only focus on the specific contents. That is Focused search engine. Focused crawling and normal networking crawling's biggest difference is according to some extent of website analysis algorithm, to filter some none-useful website and relative URL. Focused crawling's feature is searching sector are relatively small, the searched information is less and higher accuracy, stored data are less.Focused crawling's one vital issue is how to describe the content of the sector.The easiest way is to rely on one key words list to describe the sector contents, but there is one word-too many meaning,one meaning-too many words problem. This judge strategy lacks accuracy.The Focused search describes another way is to construct ontology from the semantics of the ontology to analysis and judge the importance of the searched website's content of the relative sector, that will improve the correctness of crawling's judging process.The traditional ontology construct method is under the expert's assistance to fully construct languge and text to describe ontology.fully man-made constructed method has some drawbacks, such as labor cost and time cost the describing concept and relationship is not enough, can not guarantee the ontology correctness.The ontology construction base on the automatic and half-automatic way. According to the current technique and methods,fully automatic construction's achievement is unrealistic. Hence, using the current resources to construct one initial ontology, through ontology's self study addition to capture entity's concept and concepts relationship, and through artificial's way to enhance ontology itself.To sum up, this article combine the Focused crawling and ontology study's research current situation, design and achieved based on the ontology self study crawling system. Through better crawling process accumulation, detailed divide function independent modules, enhanced the correctness of the whole bat crawling working effectiveness and capture related website.This article introduces the following aspects of research and implement of active learning Focused crawling based on ontology:1. Construct travelling sector ontology. According to the foundation of ontology construction, this article introduces one new method to one specific sector construction. To the previous construction mechanism, introduce the iterations construction thoughts, make the entity construction gradually elaborated, high refinement and perfectness. to the process of achieving the goal, we use the ontology construction tool: Protégé, through the visual operation platform, construct entity's concept class and child class and concept class relationship, save the constructed entity to the XML format, meanwhile, through the Jena tool operation ontology, insert its concept to the database system.2. Design the whole system architecture. To introduce system's entity crawling and entity study requirement analysis, this article divide the whole system to four sub function system, that is website crawling module, correlation calculation module, correlation website handing module and self study module. Every module's function are independent, and need to cooperate to fulfill the whole system's functions.3. Webpage crawling module. this module will provide URL according to the URL seed list, send request, capture the download webpage, through the HTML parser analysis tool to pure and pre-handle the webpage, finally save the retrieved webpage feature contents to the webpage library. To the module implementation procedure, we mainly solve the DNS cache, webpage purge and webpage pre-handling etc fundamental functions.4. Webpage correlation calculation module. The module construct ontology concept-tree according to the traveling sector ontology concepts relationship. to the depth of the tree and the tree layers relation to compute each concept's weight, the concept and weight to form the concept-weight pair to save to the entity library; retrieve the contents from the website library based on the feature type, according to the entity concept to gather statistics of the concept's ratio, compute the website's concept weight value, form the website content's eigenvector, to construct entity vector based on the entity concept and weight value, through two vector to compute the website relationship; if the relationship's value bigger than the pre-set threshold, that means this website is related, otherwise, the website is not related with the desired topic.5. Correlation webpage handling module. This component mainly contains two fields: one is for the website super connections issue. It will measure the website and txt's relationship to compute that connection's weight value, then it will queue the list depends on the final computing results, then it will transmit the formatted URL to the under capturing URL library; Another is It will combine the pre-handled website together according to its features, then it will transfer to the pure txt format to filter the concept vocabularies from the ontology, make the other content to be the source of the entity studies.6. Ontology learning module. The module based on the related webpage handling component's out-website text content, using relative methods to retrieve entity concepts and its relationship. The ontology concept's retrieving function is based on the statistics and mixture way to capture concept, and the ontology concept relationship's retrievement rely on the wordnet or hownet tool's describing regulation to capture the relationship between the entity concepts. 7. Systematic experiments and data analysis. Through the two experiments to verify the whole system, the experiments results show that this system will greatly enhance the crawling's correlation webpage, meanwhile the accumulated entity studies will capture the sector' concepts and relationships.Finally, to conclude the article's contributions. Because my personal lacks of strong research abilities and knowledge, this system still has some pitfalls. Firstly, it is lack of consideration to save capturing webpage, the webpage storage technique is another vital key sector, and it will affect the whole crawling efficiency. Secondly, this system will only capture the static webpage, can not handle the active website programmed by JSP,PHP etc. The active webpage's capturing is still important and difficult problem. Thirdly, the module of correlation calculation applies the webpage feature text content to search the relations, not applying other factors, such as relying on the webpage hyperlink to judge the webpage correlation.
Keywords/Search Tags:Focused Crawling, Ontology Learning, Correlation Calculation, Web Page Pretreatment, Ontology
PDF Full Text Request
Related items