Font Size: a A A

Topic Web Mining Algorithms Research And Application

Posted on:2010-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2178360275963020Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
World Wide Web (or the Web for short) has closely related to various aspects of our lives. We use it to obtain information, communicate with people, work on the web and conduct various social activities. How to obtain the required information from the web quickly and accurately becomes a serious problem. To address this problem, topic web mining has been proposed in the field of information retrieval. The basic idea of topic web mining can be summarized as follows: In accordance with a user-defined topic, topical crawlers crawl the Web, collect topic-related pages. The collected pages are used to deal with intelligent analysis. Finally use friendly retrieval methods to meet a particular search request. Topic web mining involves multiple disciplines, including machine learning, information retrieval, natural language analysis, statistics and computer network. Topic web mining can be applied to various applications, including specific area knowledge base, enterprise decision support, customer loss analysis, potential customer analysis, enterprise management optimization, business trends analysis and so on. It is a good complement to current search engine.Based on the discussion and analysis of current topic web mining, this thesis focuses on three critical research issues: First, how to improve classification accuracy rate of the web text. The second is how to improve the performance of topics crawler, especially in regard to web spam detection. The third is based on the above study, we design a topic Web mining prototype system Gsearch to evaluate the performance of the algorithm proposed. The experimental results have testified the effectiveness of our models and system.The main contributions of the thesis can be summarized as follows:1. Current topic crawlers lack the ability of detecting topic web spam, which is the primary limitation for further improvement of their performance. we address this problem based on a topical crawling model and propose an antiSpam topic crawler algorithm. This enables the topic crawlers to the function of the anti-spam, improves the correlation of the pages downloaded by the topic crawlers, and enhances the adaptability of the crawlers.2. We transform the Web text filtering problem into a web text classification problem, and propose two web text classification methods, PSK-means based on clustering algorithm and correlation-FCM based on Fuzzy Cognitive Map. PSK-means is a modification of the traditional k-means algorithm, it combines the similar data, and then uses cluster analysis; correlation-FCM is a reasoning algorithm of text categorization based on fuzzy cognitive map, which uses text term weights and the correlation degrees of terms and class in the map. Experiments show this method is effective.3. A topic Web mining prototype system Gsearch is designed and implemented, which is used to verify the validity of model and algorithm. Gsearch involves a Gcrawler topic crawler module, a word segmentation and index module, a Page evaluation module, a Gminer Data Mining Module, a Query analysis module and an User Interface. It is platform-independent, distributed and highly scalable. Its functions include Web information download, information storage, document archive, analysis of information, and a convenient retrieval interface. This system can be used to various applications, including enterprise decision support, business market analysis, enterprise management optimization, customer Analysis,construction specific area knowledge base and so on.
Keywords/Search Tags:topic web mining, topic crawler, web text categorization, web spam detection, data mining, full-text index
PDF Full Text Request
Related items