Font Size: a A A

Web Mining Algorithm Based On Anchor Text Similarity And Time Factor

Posted on:2014-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:L Y LiFull Text:PDF
GTID:2268330425466230Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of computer network technology and increasingly wideapplication of data storage technology such as database and data warehouse in managementinformation system, information on the Internet grows swiftly and violently and massive datais constantly produced, which makes the Internet become an important platform of resourceacquisition and information sharing. At the same time, the structure of Internet also becomesextremely huge. The data on Internet is dynamic, distributed and heterogeneous and lack ofefficient unified management. Therefore, against the huge data on Internet, how to retrieve themost desired information quickly and accurately has become a problem urgent to be solved,which also has brought unprecedented opportunities and challenges to search engine. Withsuch requirement background, Web data mining emerges as the times require and is shiftingfrom research to wide range of applications little by little.First of all, the relevant background and theory of Web data mining is studied and theresearch status of Web data mining is also analyzed and summarized, based on which, thefundamental principle, calculation method, advantages and disadvantages of PageRankalgorithm is analyzed. Afterwards, against the disadvantages of topic drift and emphasis onold webpages of PageRank algorithm, combined with vector space model, merged into thesimilarity between anchor text and the theme of webpage it points to as well as feedbackfactor of webpages generation time, an improved algorithm called ATSTF-PageRank based onanchor text similarity and time factor is proposed. Finally, a practical and feasible scheme forverification is worked out, according to which, an experimental system based on Nutch opensearch engine is designed and implemented. Under SinaData dataset and TencentData dataset,a comparative experiment on ATSTF-PageRank algorithm and original PageRank algorithm iscarried out respectively. Experimental results show that topic drift is inhibited efficiently andthat the accuracy and timeliness of query results are enhanced in ATSTF-PageRank. At thesame time, the precision and user satisfaction of search engine are improved.
Keywords/Search Tags:Web mining, link analysis, PageRank, time factor, anchor text similarity
PDF Full Text Request
Related items