Font Size: a A A

The Research On Chinese Word Sense Induction

Posted on:2014-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y X SunFull Text:PDF
GTID:2268330401969445Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Word Sense Induction (WSI) uses clustering technologies to automatically obtain senses of polysemous words from corpus, which can greatly improve the efficiency of linguists’work. Therefore, WSI becomes one of the most important issues of computational linguistics nowadays. On this account, this paper takes the study of CWSI from three aspects including the method based on feature vector、the method based on the co-occurrence graph and the integration method. All these researches are based on the corpus provided by CLP2010. The specific contents are as follows:(1) The method based on feature vector. This paper focuses on feature selection and clustering algorithm. We choose words、Chinese characters、2-gram of Chinese characters and so on as features. Chinese characters can greatly affect the result. And2-gram of Chinese characters improves the performance with great increasing in single Chinese character targets, but little increasing in multiple Chinese characters ones. K-means and Rb are better clustering algorithms. Compared with the participating systems in CLP2010, our F-score takes first place with the79.34%F-score, Specially for single Chinese character targets, our result is great better than others’with the score of69.50%.(2) The method based on co-occurrence graph. This approch can achieve perfect result, but is suitable for large-scale corpus. Thus, this paper uses People’s Daily of4years to extend the original corpus provided by CLP2010, and then uses the extended corpus to describe the distribution of nodes in the original corpus. The use of threshold filtering can greatly improve the result, with20%higher Part_Purity and the much more similar cluster number distribution to the standard result. At the same time, the collocations are added to the nodes, with the Part_Purity of90.46%, the useful polysemous words’number of93, average example number of27, which is better than the result88and25without collocations. Part_Purity of the extended corpus can achieve8%higher score than the origin corpus.(3) The integration of different systems’results. This paper mainly uses three Integration methods:the secondary clustering, the voting method and the method by obtaining optimal clustering results in iterative steps. The paper chooses the two better and two worse systems as the basic ones, among these systems, the best one is79.34%, the worst is68.68%, and the above integration methods achieve79.28%,78.52%,79.05%separately. Though the integration brings little improvement, the system has good stability, which can effectively avoid the influence of bad systems.
Keywords/Search Tags:Chinese Word Sense Induction, feature vector, graph method, integration
PDF Full Text Request
Related items