Font Size: a A A

Building A Large Scale Chinese Semantic Dictionary

Posted on:2012-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2218330362450410Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, understanding and processing information has received attention more and more by people. Understanding information needs the technology of natural language semantic analysis, and the semantic analysis of natural language can not do without the support of the corresponding semantic dictionary, so the construction of a semantic dictionary has become a basical work in the natural language processing. A semantic dictionary as a foundation resource in the natural language processing, is not only helpful to the underlying technology of natural language processing, such as word segmentation, named entity recognition, word sense disambiguation, etc, and is also very useful in the upper layer application, such as question answering systems, information retrieval, text classification, etc.In order to build a pratical semantic dictionary which will be play an important role in the field of Chinese information processing, we propose to build a Chinese semantic dictionary having a simple structure and containing sufficient words, named as WordMap. The WordMap integrates the existing semantic dictionarys such as HowNet, Tongyici Cilin extended version, etc, and also uses the Internet, combines the resouce of network, such as Sogou Cell Dictionary, Baidu Encyclopedia, etc, Thus we expand the scale of the dictionary and add into new words.WordMap uses classification system with five levels to describ the word sense, reflecting good hierarchical relationships between words. It is a synonym set following each word sense, in the set, the word relationship is synonymous or similar.First, we integrate HowNet and Tongyici Cilin extended version to build the general part of WordMap. In order to integrate HowNet into Tongyici Cilin extended version, first use algorithm based on synonyms, and use algorithm based on similar words for the remaining words; then manually proofread the automatic results and label a few words; finally constitute the general field words in WordMap.Second, we construct the domain dictionary in the WordMap. According to the features of Baidu Encyclopedia, we use the integrated method basing on the open category tag of the word items, which add into 83 domain dictionarys, 1,751,756 words. According to the characteristics of Sogou Cell Dictionary, we use the integrated method manually labeling the mapping relations from the classification architecture of Sogou Cell Dictionary to the word sense architecture of WordMap, add into 26 domain dictionarys, 4,417,937 words. Third, we normalize to the WordMap. We use support vector machine(SVM) to recognize the right names in the Name domain dictionary, ruling out the wrong classified words. The F-value achieves 99.926% on the test sets, the F-value is 7% higher than the Baseline methods, satisfying the application requirements.Last, In order to help users to better understand of the data in the WordMap, we develop the WordMap online system.
Keywords/Search Tags:Wordmap, a semantic dictionary, HowNet, Tongyici Cilin, Sogou Cell Dictionary, Baidu Encyclopedia
PDF Full Text Request
Related items