Font Size: a A A

Research Of Chinese Web Text Mining Techniques And Its Implementation

Posted on:2007-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:F Z SuFull Text:PDF
GTID:2178360185481181Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of the Internet, the Web information becomes richer and richer. This brings us much convenience, but it also causes a new problem. The Internet is full of large non-structured information, and it forces us to adapt some efficient information processing techniques so that we can just find some necessary information without submerged by large useless information. Under this background, the Web Text Mining technology turns out, and becomes a hot research area. This thesis mainly introduces our research on Chinese Web Text Mining technology. The framework of the thesis is listed as follow:First, it introduces the research background, research purpose, and some basic theories of Web Text Mining.Second, it does some research in the word segmentation of Chinese Web Text. It adopt an algorithm called binary-seek by character in the word rough segmentation. It also designs some efficient strategies to deal with the ambiguities and the unknown words, especially for combinational ambiguity, it designs a new disambiguated algorithm in case-learning method based on the structured similarity of the Chinese sentences.Third, it discusses the feature representation and feature selection. It uses the common Vector Space Model (VSM) to represent the text feature and evaluated functionχ~2 statistic method to select features.And last, it mainly lays emphasis on the research of Chinese text clustering. It designs a concept clustering algorithm based on the Hownet. It uses Hownet to build the Chinese concept dictionary and concept hierarchy, and then maps the text feature to the concept space by concept disambiguation and concept mapping techniques, and apply an improved K-mediod method in text clustering analysis.The theory and experiment analysis has been carried out. It mainly tests the word segmentation efficiency and the clustering result of the Web Text Mining system. The test result shows that, our system is effective and has a bright future in real use.
Keywords/Search Tags:Chinese Web Text Mining, Cased-based Learning, χ~2 statistic, Concept Mapping, Text Clustering
PDF Full Text Request
Related items