Research On Link Analysis And Topic Detection On Web Mining

Posted on:2013-05-17

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Y Liu

Full Text:PDF

GTID:1228330395498952

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The Web has become the main platform for people to store and share information. It is a serious challenge to find and utilize useful information on this platform varying information source. However, some special problems on the Web, such as huge amount of semi-structural and non-strutural documents, different web pages quality, large amount of multi-media information and fuzzy or non-standard user query, make traditional information retrieval and database techniques could not effectively be used on the Web. Web information retrieval has become an independent discipline and includes broad research topics.Among hot topics of the Web information retrieval field, this thesis conducts deep research in the following aspects.Firstly, this thesis reseaches the most important part of modern search engine, Web page ranking algorithm. Against shortcomings of mainstream topic-dependent web page ranking algorithms, this thesis proposes an attractive force model based ranking algorithm, G-HITS. The model treats each web page as particles and other elements related to ranking as quality or distrance, and uses grativity to measure the relationship between pairs of web pages, in this way. the shorcomings of pure link-based ranking algorithm are overcomed.Secondly, aiming at the daily rampant web spam, this thesis researches link-based anti-spam algorithms. Based on analysis problems of the famous TrustRank and Anti-TrustRank algorithms which could only propagate trust or distrust, this thesis proposes a framework which propagages both trust and distrust. The proposed algorithm overcomes the disadvantages of TrustRank and Anti-TrustRank, and enhances the effectiveness of anti-spam.Thirdly, this thesis researches web community identification problem. Community is an important phenomenon of the web which reflectes topic distribution of the Web. Web community identification finds this kind of topic distribution through discovering dense subgraph of the web graph. However, existing algorithms are based on web page, whereas each page contains multiple topics. This thesis proposes a block-based web community identification algorithm, which solves the multiple topic problems and improves the precision of web community identification.Finally, this thesis researches the topic detection problem. To better detect topic, this thesis researches spectral clustering, improves existing algorithms and uses the improved spectral clustering algorithm to detect topic. Then the thesis proposes a topic detection algorithm which is based a hyper-graph partition algorithm. The algorithm firstly uses a two-stage web feature extraction method, then uses hyper-graph partition to detect topic. In this way, the precesion of topic detection is improved.

Keywords/Search Tags:

Web Information Retrieval, Web Mining, Link Analysis, CommunityIdentification, Topic Detection

PDF Full Text Request

Related items

1	Research On The Representation Model And Technologies Of Link Detection And Tracking On News Topic
2	Searching Topic-specific Authoritative Information Sources On The Web With Content And Link Analysis
3	Research On The Key Technologies Of Information Mining Oriented To Network Content Security
4	Research On Short Text Topic Information Mining Technology
5	Research And Application Of Topic Oriented Text Mining
6	Research On Subtopic Mining For Diversified Information Retrieval
7	Research On The Recognition Of Link's Topic Drift With Short Text
8	Research On The Key Techniques Of Web Information Intelligent Acquisition
9	Research On Information Retrieval Discipline Topic Based On Comparative Study On SIGIR Mailing List And Scholar Papers
10	Knowledge Mining For Web Information Retrieval