Research On Customized Web Crawling

Posted on:2006-01-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L H Wu

Full Text:PDF

GTID:1118360185995719

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the explosion of Web information, how to immediately and exactly find the needed information for each user has become a tough problem. Although traditional technologies of search engine meet some demands of users, they cannot fulfill the personalized requirements of users in various backgrounds, with diverse intention and at different time. Customized web crawling has been brought forward for addressing this issue. By taking full advantage of user's personalized information, the research on customized web crawling aims to provide better services for user, and to gather information with supervision or interaction of user's interests.Around the customized Web crawling system, PSearch, the main contributions of this dissertation can be summarized as follows:(1) Acquirement of user's personalized interests. After analyzing the collection and update of user's interests, this dissertation experimentally studies user requirement expansion, feature selection methods and document clustering analysis in the acquirement of user's personalized interests. On capturing current browsing action of user, customized Web crawling can select and expand the words most similar to the keywords representing user's needs by computed results of browsed contents. The experimental results indicate that user's current personalized interests are really obtained by such requirement expansion. The essence of automatic collection of user's interests is similar to that of the methods of feature selection in text categorization. User's interests can be obtained by the methods of feature selection. Four methods are evaluated, including term selection based on document frequency (DF), mutual information(MI), information gain (IG), and x~2-test (CHI). We find IG most effective inour experiments. User's interests can be obtained if the pages browsed by user are clustered. They can be achieved by document clustering analysis. Four methods are evaluated, including K-means, K-Medoids, MaxDist Sampling Clustering and Bisecting K-means. We find Bisecting K-means most effective in our experiments.(2) Selection of the order in which a crawler should visit the URLs it has seen. According to the distributing characteristics of Web pages, this dissertation analyzes setup of seed URLs, process of page retrieval, similarity evaluation between retrieved pages and...

Keywords/Search Tags:

personalized services, customized Web crawling, user's interests, PSearch, search engine

PDF Full Text Request

Related items

1	Design And Implementation Of User-customized Desktop Search Engine
2	Analysis Of Personalized Search Engine Based On User Interest
3	A Study Of Personalized Search Based On Blog Content
4	Research And Implementation Of A Personalized Service System Based Onuser Interests
5	The Research And Implementation Of Personalized Search Engine
6	Algorithms Research And System Design On User-oriented Personalized Search Engine
7	Research And Implementation Of Book Search Engine Based On User Personalization
8	The Research Of Personalized Search Based On Improved Pagerank Algorithm And User Interest
9	Research And Achievement Of Personalized Search Engine
10	The Research Of Personalized Search Engine Technology Based On User Interest