Font Size: a A A

Research Of Ontology-based Focused Crawling Technique

Posted on:2010-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:N LuoFull Text:PDF
GTID:1118360272995631Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid expand and growth of web pages information from the World Wide Web, it gets harder to retrieve the information and knowledge relevant to a specific domain. Threrfore, focused crawling technique for retrieving the specific-domain information has got more attention and development in recent years. While crawling the World Wide Web, a focused web crawler aims to collect as many relevant web pages with respect to predefined topic and as few irrelevant ones as possible. The fundamental technical difficulty of focused crawling lies in the necessity to predict a web page's topical relevancy before downloading it.Ontology as the new concept to describe the semantic hierarchy of knowledge has been widely used in different fields such as Computer Information Processing, Artificial Intelligence and Knowledge Engineering. The information retrieval methods combined with ontology can not only emphasize the advantages of knowledge-based retrieval but also deal with the relationships between the various concepts. Though the research of ontology is just at the beginning, and there have no uniform standard and stationary applications, the research of ontology applied in the Semantic Web will certainly become a hot spot, the application of ontology in information retrieval and semantic web will be the focus in this field. Ontology has capability to represent meaning of the information by a hierarchical structure, and its reasoning support. Ontology-based information retrieval is a promising method. Ontology includes the definition to judge concept so that the machine can understand the concepts of the domain, the relationship between concepts in a unified framework. The system could comprehend the query of user by analyzing user's query expression and mapped it to information resources. Retrieval has much higher performance than traditional methods.The main contribution of this dissertation and result of study are as follows: 1. This dissertation makes a general summary of the research on web information retrieval andthe correlative techniques, analyzes the derivation background and the course of development. After introducing and analyzing the development of search engines and ontology, the virtues and necessary of a topic-specific search engine be presented. Furthermore, the future of search engines is also discussed in this dissertation. The basic theory and strategies of topical web crawling and text classification technique are also introduced and analyzed, which are the groundwork of farther research works.2. A focused crawling algorithm loads a page and extracts the links. By rating the links based on keywords the crawler decides which page to retrieve next. Link by link the Web is traversed. Our crawling framework builds on and extends existing work in the area of focused document crawling. We do not only use keywords for the crawl, but rely on high-level background knowledge with concepts and relations, which are compared with the texts of the searched page. This ontology-based focused crawling method we can easily achieve a direct focus. This method provide the following main contributions: An ontology structure extended for the purposes of the focused crawler, several new and innovative approaches for relevance computation based on conceptual and linguistic means reflecting the underlying ontology structures, both the management of the focused crawling process and the management of the ontology, and an empirical evaluation which shows that crawling based on ontology clearly outperforms standard focused-crawling techniques.3. It is an effective topical web crawling approach that the relevance of a target web page is evaluated by using web page information. However, the common problem in the construction of classifier is that we need to label great training examples manually. It's easier to get positive examples than negative examples. In the other side, the negative examples we find are deflected, because of our subjective factors, so that they will affect the performance of classifier. Therefore, researchers advanced that we can build a classifier using a few positive and many unlabeled examples, which is called PU problem. This dissertation put forward ontology-based feature selection for PU classification which scanned the documents twice. In the first time, we get the semantic meanings of the documents with WordNet. In the next time, we had filterated terms without synsets. After that we reduced the dimensionality and get the text vector. Combining with CoTraing and Affinity Propagation, we proved that the ontology-based feature selection can improved the performance of classifier greatly as the positive examples are few. An empirical evaluation shows that compared with document frequency method, our algorithm increases the F1 of One-Class classifier of 10.183% for the fewer positive examples case and 1.941% for the more positive examples case, and increases the F1 of PEBL classifier of 2.781%.4. Due to the complexity of the web environment and topic-multiplicity of the contents of web pages, it is quite difficult to get all the web pages relevant to a specific topic. It is possible for irrelevant web pages to link a relevant web page, so we need to traverse the irrelevant web page to get more relevant pages. This procedure is called Tunneling. There are two types of tunneling, grey tunneling and black tunneling. Our main works are bringing forward a new page segmentation method and finishing a grey tunneling system based on page segmentation. The method makes use of the vision information, tag information, link information and ontology information, which are in the web pages. The vision information contains background color, font size and color etc; the tag information used an order tag collection {, ,

,


} to recursive segment page; the link information is make use of"pagelet"concepts and the anchor text and ontology information provided hierarchical concepts. At last we bring forward to a lot of heuristic rules to control the accuracy and grain degree of the block when segment a page. Face to the black tunneling, we use Association Rules to slove these prblems.5. Respect for users, study on user's behavior and interests are the fundamental for User-oriented personalized service. It provides a better guarantee for users'utilize resources. User-oriented personalized service which aim is satisfy the user's requests and everything from the user's requirements. Not only can users customize their interface, but also can freely select the contents of required services, and denifit their own preferences property documents. Information services through the network in accordance with the specific user interest, babits, etc. to carry out personalized services to meet the needs of the user's individual requirements. Personalized service has been an inevitable trend for the development of search engines. Based on the thinking of focused crawling that we had proposed above, we had built a focused crawling model for specific user's interests, and this model based on cognitive psychology, information dissemination and the discipline of forgotten. We will accord with user's search habits and track user's behavior patterns to realize specific user-oriented recommendation, filtering and other personalized services thought machine learning and training specific user models. At the same time, we note that the groups of user behavior will have the same similar acts of users to create user group. This group can achieve the informations sharing and dissemination of them. We can also indentify the typical users and filed experts. The research has the characters of semantic, personalized, Intelligent and decision support.To sum up, research on semantic information retrieval is of important theoretical value and widely used in search engine area. This dissertation has done some research on its modeling and application. The emphasis of our further research will be on the application, evaluation, and employment of the ontology-based focused crawling to the web search engine.
Keywords/Search Tags:Focused Crawling, Ontology, PU Classification, Cross Tunneling, User Interest
PDF Full Text Request
Related items