Font Size: a A A

Research On The Focused Crawling Combining Synthetic Web-Page Information And Domain Ontology

Posted on:2011-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:X GuanFull Text:PDF
GTID:2178360305454671Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since 1994, Search engine on the web has been developed significantly. It solves the problem of mass resource to be indexed and fast located on the web. The effect of search engine has more and more important in the people's live. However, with data increasing more and more rapidly, traditional search engine will not meet user request. Search engine with poor semantic processing ability will not meet users'accuracy demand.Focused crawling is a improve technique for search engine. It is an intelligent search application in the search domain. The aim of focused crawling is to find the web pages which are defined previously. It classifies web pages by using text categorization and predicting hyperlink technique to get a good search effect.If we integrate focused crawling with semantic technique, then during the progress of the search, crawler would like be guided by domain specialist. The search engine will not only return search result, but also give resources concerned with topic. Designing focused crawling strategy is based on common search engine. Actually, it is extension for traditional search engine. Under direction of background knowledge, crawler gets as more as possible web page. Range of focused crawler is smaller than the common crawler. However, focused crawling will get more precise result. Focused crawler filter un-relevant web pages to get and save lots of relevant web pages under limited web resource. The main orientation of focused crawling is how to filter off-topic web pages and how to get more topic web page.Marc Ehrig proposes an approach of document discovery building on a frame for ontology-focused crawling of web documents. Ontology is a description for conception and properties. It can describe background knowledge precisely, and it is a tool for knowledge representation. Ontology-focused crawling will get more satisfied search result. From research of computing relevance of web page, we find that combining document term location on the web and ontology will get more precise relevance of web page. Traditional methods do not give more research on the link. The approach predicts topic relevance of link using extend anchor text and relationship of links. The whole algorithm centre on above two points.The main work of this paper is based on the ontology-focused crawling. Firstly, we analyze text of web page to extend this approach, and point that information of specific location in web page plays an important role to the topic of web page. Secondly, this approach gives a analysis for topic relevance of link which is contained on the page.Anchor text is the hyperlink text. It is summarize of information of hyperlink. Because anchor text usually distributes on other web pages, it represents the intension of web authors. They want to guide users to know subject of web pages and visit URL by using brief information. Comparing with web page which is selected randomly, anchor text has stronger ability to describe goal page. So, Predicting topic relevance of web page based on the anchor text is a hot pot for researcher.The thought of Algorithm is that when get web page, it delete the tag which is not important. Then system extracts text from page, counts high frequency and convert text to vector. When computing topic score of web page, it judge each term of vector to belong to conception of ontology. And it judges it to map the conceptions, properties and instances of ontology. It gives the vector the topic score by combining web location weight and ontology. If topic score of page is higher than the threshold, all of hyperlinks of the page will be extracted. And each hyperlink has been judged whether it has been crawled. For the hyperlink which has not been crawled, the algorithm predicts its topic score. Hyperlinks are made to enter different queue according to the score.We get deep research for this problem. And we proposed a strategy based on ontology background and anchor text information. It will improve accuracy and this paper integrates search engine with semantic web such as resource description frame, ontology, and reasoning technique and so on. I construct finance domain ontology to realize search strategy. To test advantage of ontology-based search strategy, I do experiment with finance information.The most important standard to measure effect of focused crawling is how to select relevant web pages and how to filter topic-off web pages. Harvest rate represents the fraction of web pages crawled that satisfy the target among the crawled page.The paper designs three groups of experiments. The first compute topic score of web page by combining term location weight and ontology. The second predict relevant score of hyperlinks. In the third , the algorithm combine the first two experiments ,and compare the four strategy.We can get conclude from result of the experiment that our approach has a higher efficiency and harvest rate. The strategy use domain ontology as background knowledge and combine with text term location weights to compute topic score of web pages. It also uses anchor text and dependency of html to predict relevance hyperlink. This strategy can be made an effect use to focused crawling research.The research of focused crawling does not only have theoretical value, but also have wide application prospect. There are some issues of focused crawling discussed in the paper. Future of web is well expected and our work is the beginning of the research which should be done in the future. How to change the research of focused crawling to web application and how to support service according different users'demand is direction of our research.
Keywords/Search Tags:Focused Crawling, Ontology, Anchor text, Term Location
PDF Full Text Request
Related items