Font Size: a A A

Research On Link Analysis And Topic Detection On Web Mining

Posted on:2013-05-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:1228330395498952Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Web has become the main platform for people to store and share information. It is a serious challenge to find and utilize useful information on this platform varying information source. However, some special problems on the Web, such as huge amount of semi-structural and non-strutural documents, different web pages quality, large amount of multi-media information and fuzzy or non-standard user query, make traditional information retrieval and database techniques could not effectively be used on the Web. Web information retrieval has become an independent discipline and includes broad research topics.Among hot topics of the Web information retrieval field, this thesis conducts deep research in the following aspects.Firstly, this thesis reseaches the most important part of modern search engine, Web page ranking algorithm. Against shortcomings of mainstream topic-dependent web page ranking algorithms, this thesis proposes an attractive force model based ranking algorithm, G-HITS. The model treats each web page as particles and other elements related to ranking as quality or distrance, and uses grativity to measure the relationship between pairs of web pages, in this way. the shorcomings of pure link-based ranking algorithm are overcomed.Secondly, aiming at the daily rampant web spam, this thesis researches link-based anti-spam algorithms. Based on analysis problems of the famous TrustRank and Anti-TrustRank algorithms which could only propagate trust or distrust, this thesis proposes a framework which propagages both trust and distrust. The proposed algorithm overcomes the disadvantages of TrustRank and Anti-TrustRank, and enhances the effectiveness of anti-spam.Thirdly, this thesis researches web community identification problem. Community is an important phenomenon of the web which reflectes topic distribution of the Web. Web community identification finds this kind of topic distribution through discovering dense subgraph of the web graph. However, existing algorithms are based on web page, whereas each page contains multiple topics. This thesis proposes a block-based web community identification algorithm, which solves the multiple topic problems and improves the precision of web community identification.Finally, this thesis researches the topic detection problem. To better detect topic, this thesis researches spectral clustering, improves existing algorithms and uses the improved spectral clustering algorithm to detect topic. Then the thesis proposes a topic detection algorithm which is based a hyper-graph partition algorithm. The algorithm firstly uses a two-stage web feature extraction method, then uses hyper-graph partition to detect topic. In this way, the precesion of topic detection is improved.
Keywords/Search Tags:Web Information Retrieval, Web Mining, Link Analysis, CommunityIdentification, Topic Detection
PDF Full Text Request
Related items