Font Size: a A A

An Algorithm For Discoverying New Chinese Words Based On Combination Frequency

Posted on:2019-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y G Y OuFull Text:PDF
GTID:2348330542498868Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data,massive information resources are circulated through the Internet in the form of text.Chinese,which is one of the most important carriers of information resources,allows users to quickly obtain the information that they want to obtain.Vocabulary,as the most active component of Chinese sentences,is closely related to the development of society.When social upheavals happen with the demises of the old things and the appears of the new things,people begin to change their ideas and start to have more profound cognition upon the new things.Because of the above reasons,the developments of new Chinese words are updated rapidly.Many new Chinese words do not exist originally in Chinese vocabulary.Aiming at the constantly updated new Chinese word and based on unsupervised learning-related theory of natural language processing,proposed an new Chinese word discovery algorithm based on combination frequency.The three main contents of this thesis are as follows:(1)Proposed a new Chinese word discovery framework.Based on the three levels of Chinese syntax,semantics and pragmatics,analyzed the semantic structure of modern Chinese.Combined with the features of new words' reconstruction and colloquial,proposed the definition of new Chinese words.According to the characteristics of the new Chinese words,the thesis proposed the construction of new Chinese word discovery framework by analyzing the supervised learning and unsupervised learning algorithms in Natural Language Processing.(2)Established a key threshold index system for new Chinese word discovery algorithm.By studying the segmentation technology of Chinese word based on jieba and the sliding scan pane in N-gram model,the beforehand string process of the original corpus was given.Then,three specific quantitative indexes were proposed for the implementation of key threshold index system,including intra word aggregation based on combination frequency,inter word combination based on information entropy and reverse document frequency.(3)Set up the experiment of discovering Chinese neologisms based on Scrapy framework and the comparative experiment based on TF-IDF(Term Frequency-Inverse Document Frequency)algorithm.Quoted by the precision rate,recall rate and F-measure criterion,the experiment was compared by TF-IDF which is an unsupervised learning algorithm.And the analysis of the new Chinese algorithm results in improving the accuracy of the new rate explanation was also given.Through the establishment of a complete processing framework based on new Chinese word discovery algorithm,combining probability theory and statistical knowledge,this paper constructs index system of key word discovery center threshold algorithm,and provides a theoretical basis and practical guidance for the implementation of new Chinese recognition computer.Experimental results show that,under the premise of a specific corpus,the algorithm has a certain degree of improvement in the accuracy of new Chinese word recognition,and has practical value in the related fields.
Keywords/Search Tags:new Chinese word discovery algorithm, combination frequency, intra word polymerization degree, inter word combination degree
PDF Full Text Request
Related items