Font Size: a A A

The Research And Implementation Of The Website Evaluation Model Based On Clustering Algorithm

Posted on:2010-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:H Y XuFull Text:PDF
GTID:2178360272997169Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
There are a lot of resources on internet, which can help people gain information without region and space limitation. However, with the rapid expansion of information on internet, it has become more and more difficult for user to extract useful information from so many web pages. People use search engine to search information more quickly and more accurately. First, search engine stores and indexes all web pages on internet. Then, it analyzes queries submitted by users and offered all the web pages related to queries back to users in some order. Through this reciprocal process, search engine completes search service for users. Because of the large quantity of web pages related to queries, search engine must rank pages according to authority and popularity in order to offer pages in more reasonable order to users, putting the important ones on the top. To measure authority of a web page, the website from where the web page is must be measured. The core of clustering algorithm is to rank website. The website with high ranking is more authoritative and in favor.All existing website ratings are based on hyperlink analysis technology which refers to the study of referring relations through hyperlink between pages and pages. By analyzing the quantity of web pages, quality of the linked pages is valued and recommendation between pages is reflected.Because hyperlink analysis technology only takes hyperlink between pages into consideration, rather than pages'topic and content, website rating through hyperlink analysis technology is irrelevant to queries input by users and website liability can not be related to queries, which can result in topic drift. In addition, hyperlink is an attribute which can be faked by stationmaster (scriptwriter), so internet is filled with junk websites and cheating websites which severely affect the result of website rating models based on hyperlink analysis technology.To resolve the above problems, this thesis puts forward a suggestion which is to rank websites based on users'behavior data, instead of hyperlink data. When users search information by search engine, they can also direct search engine. Users'selective browse through web pages from search results is also a valuation of web pages offered by search engine. By analyzing users search behavior information, we can relate queries input by users to websites and construct a complex network which has websites as nodes and queries as edges. This network can be clustered by complex network clustering algorithm. As to selection of clustering algorithm, given the high time complexity of existing complex internet clustering algorithm which is not suitable for super large-scale complex network calculation, can not satisfy speed at the same time, clustering and high precision, do not rely on prior knowledge requirements. Clustering method with high accuracy are often very high time complexity, and rapid clustering method and not enough precision, and more prior knowledge necessary to put an end to cluster computing, it is difficult to select between a very good balance point.This thesis puts forward a clustering algorithm called cluster density limitation network algorithm. This algorithm measures the rationality of node combination by checking whether the new cluster has enough edges and whether the new cluster is dense after node combination. Cluster density limitation network clustering algorithm can carry out rapid cluster calculation on complex network even has 3,000,000 nodes and 20,000,000 edges. After clustering the complex network, websites of the same type can gather automatically and form a cluster. Nodes of the same cluster are connected closely, while nodes of the different clusters are connected loosely.After clustering, we can respectively rank the websites for different clusters, so that the websites from the same cluster have differences in authority. By comparing with PageRank, rating curve within websites in the same cluster with that of PageRank, so that rationality of website rating is verified. In the process of overall rating, we take both the website authority of internal cluster and scale of the cluster into consideration. There is higher rating for unpopular websites and popular websites will not get too many preferential treatments from rating models. Therefore, a reasonable rating system can be formed. The rating system has the following advantages: (1) Web ratings are based on the classification, so websites of the same type have similar content and belong to the same field. Pre-classifying queries by using type information can make it possible to search in the corresponding website type, according to different types of query, which can improve accuracy of search results. (2) Take full advantage of the user data, minimizing the effects of cheating through hyperlinks on website rating. (3) Clustering through user behavior analysis data requires much less computation than that through hyperlink analysis. Makes the calculation of a rating cycle to shorten the website, the entry into force of faster and better search engine to enhance the search results.According to website rating models based on users behavior analysis, this thesis realizes users-directed website rating system which takes users data as input and completes websites classification by clustering algorithm of websites. The accuracy of website classification is as high as 96%. In addition, websites of different ranks within the same cluster are distributed reasonably and evenly and there is a distinct line between the high rank and low rank. Compared to PageRank, junk and cheating websites are controlled effectively and given low rank, which avoids hyperlink's cheating effect on search engine. Because classification of websites is based on queries input by users, data classified contain a lot of queries information and they can be used to match search requests input by users. On one hand, searching queries input by users in the websites from the same cluster where there are similar searches instead of in all websites can improve search precision. On the other hand, according to website classification information, the websites from different clusters can be displayed appropriately, not only improves the unity of the results offered by search engine based on hyperlink analysis technology, but also realizes result diversification and meets different users'need.Website rating models based on users'behavior analysis can overcome the shortcoming of existing search engine relying on hyperlink analysis technology and efficiently improve the results of existing website rating models.
Keywords/Search Tags:Website rating, Clustering algorithm, Users'behavior analysis
PDF Full Text Request
Related items