Font Size: a A A

Research On Customized Web Crawling

Posted on:2006-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L H WuFull Text:PDF
GTID:1118360185995719Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosion of Web information, how to immediately and exactly find the needed information for each user has become a tough problem. Although traditional technologies of search engine meet some demands of users, they cannot fulfill the personalized requirements of users in various backgrounds, with diverse intention and at different time. Customized web crawling has been brought forward for addressing this issue. By taking full advantage of user's personalized information, the research on customized web crawling aims to provide better services for user, and to gather information with supervision or interaction of user's interests.Around the customized Web crawling system, PSearch, the main contributions of this dissertation can be summarized as follows:(1) Acquirement of user's personalized interests. After analyzing the collection and update of user's interests, this dissertation experimentally studies user requirement expansion, feature selection methods and document clustering analysis in the acquirement of user's personalized interests. On capturing current browsing action of user, customized Web crawling can select and expand the words most similar to the keywords representing user's needs by computed results of browsed contents. The experimental results indicate that user's current personalized interests are really obtained by such requirement expansion. The essence of automatic collection of user's interests is similar to that of the methods of feature selection in text categorization. User's interests can be obtained by the methods of feature selection. Four methods are evaluated, including term selection based on document frequency (DF), mutual information(MI), information gain (IG), and x~2-test (CHI). We find IG most effective inour experiments. User's interests can be obtained if the pages browsed by user are clustered. They can be achieved by document clustering analysis. Four methods are evaluated, including K-means, K-Medoids, MaxDist Sampling Clustering and Bisecting K-means. We find Bisecting K-means most effective in our experiments.(2) Selection of the order in which a crawler should visit the URLs it has seen. According to the distributing characteristics of Web pages, this dissertation analyzes setup of seed URLs, process of page retrieval, similarity evaluation between retrieved pages and...
Keywords/Search Tags:personalized services, customized Web crawling, user's interests, PSearch, search engine
PDF Full Text Request
Related items