Font Size: a A A

Research Of Web Text Clustering Based On Semantic

Posted on:2015-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2268330428966818Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, especially the Internettechnology and mature, people more and more available information. With such vastamounts of information, on the one hand, people demand for fast, accurate andcomprehensive information. On the other hand, information is redundancies andchaotic. As the most pressing issues in the information processing field, effectivelyacquisition, analysis, management information has become more and more importantin the researchers. Therefore, Web text clustering become one of the importantresearch direction in the field of information retrieval. At present, the traditional textclustering method based on vector space model due to its text eigenvector highdimensional and sparse sexual characteristics, the research of this direction is hard tohave any breakthrough and innovation. The research object of existing text clusteringmethod based on semantic is more confined to the traditional text, and a lack of WebChinese text clustering analysis, which lead to these clustering methods getunderachiever result when applied to Web text in Chinese.This paper analyzes the present situation of the study on Chinese text clusteringmethod. Based on that, deal with the Web text characteristics, such as updates fast,short length and non-standard words,the analysis method based on HowNet semanticsis used to study the Web text clustering. First of all, on the basis of understanding thestructure of HowNet, this paper improved word similarity calculation method, made itmore in line with the specification of semantic. Then through the analysis of therelated difficulty of Web text clustering algorithm, HowNet semantics similaritycalculation is introduced into the traditional Fuzzy C-Means algorithm. This is animproved algorithm of K-Means algorithm, whichuses semantic similarity thresholdvalue to control the number of iterations of clustering. Based on this algorithm, themicroblogging topic discovery system was designed and implemented. The systemcan automatically fetching updated daily Weibo from Sina Weibo. The content of the microblog weibo in the same clustering cluster will be considered to be talking aboutthe same topic, which can realize the function of the weibo topics found.Finally, the effect of algorithm and the experiment analysis of functions of thesystem show that, the algorithm has obvious effect of improvementcompared with thetraditional Web text clustering. Based on this algorithm,the designed and implementedsystem can better meet the expected requirements.
Keywords/Search Tags:HowNet, Fuzzy C-Means, text clustering, semantic similarity
PDF Full Text Request
Related items