Font Size: a A A

Research On The Algorithms Of Web Structure Mining

Posted on:2010-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y L JiFull Text:PDF
GTID:2178360275979602Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web Mining is the application of classical Data Mining theory in Web page set, and. it is related to artificial intelligence, information science, computer science, mathematics, cognitive science and many other areas of science and technology. Web Structure Mining (WSM) is an important branch of Web Mining research. Through analyzing different kinds of page structures, WSM means the process to discover multiple of potential valuable information that contained outside of Web content. The main kinds of page structures are as the following: hyperlink structures between different pages; the tree structure which can be presented with HTML and XML in Web pages, and the directory path structure in document URL.In this paper, we first analyzed classical Structure Mining algorithm PageRank, HITS and their improvement, and expounded systematically on the problems existed in these algorithms. PageRank has the characteristic of responding swiftly because it is calculated in state of offline, but it ignores the relationship between querys, which results in the topicality decreasing; HITS is calculated according to query text, but it is need to be completed in the state of online during the whole calculation process, so compared to the former, it responds slowly. Targeting at these existing problems an improvement algorithm B-PH (Algorithm based on PageRank and HITS) has been brought up in this pager. B-PH fully combined document content and hyperlink structure, thus making the result pages more authoritative and relative. Through simulation experiments and comparition with classical algorithm its feasibility and availability has been verified.The major works done in this paper are as follows:1. Analyzed classical algorithm and expounded existing problems systematically.2. Brought up new ways to get rid of Web noise link, thus greatly improved the algorithm efficiency.3. Brought up B-PH algorithm. This algorithm is based on the frame of HITS algorithm and combined with PageRank. It can greatly decrease topic-drift and improve inquiring efficiency and quality.4. Brought up experiment models to verify B-PH algorithm, and developed the Web application experiment system under DotNet circumstances based on B/S architecture. Through dealing with real data, the feasibility and availability of this algorithm is verified, and at the same time, the results are compared and analyzed.
Keywords/Search Tags:Web structure mining, Link structure, algorithm, PageRank, HITS
PDF Full Text Request
Related items