An Algorithm For Discoverying New Chinese Words Based On Combination Frequency

Posted on:2019-03-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y G Y Ou

Full Text:PDF

GTID:2348330542498868

Subject:Information and Communication Engineering

Abstract/Summary:

With the arrival of the era of big data,massive information resources are circulated through the Internet in the form of text.Chinese,which is one of the most important carriers of information resources,allows users to quickly obtain the information that they want to obtain.Vocabulary,as the most active component of Chinese sentences,is closely related to the development of society.When social upheavals happen with the demises of the old things and the appears of the new things,people begin to change their ideas and start to have more profound cognition upon the new things.Because of the above reasons,the developments of new Chinese words are updated rapidly.Many new Chinese words do not exist originally in Chinese vocabulary.Aiming at the constantly updated new Chinese word and based on unsupervised learning-related theory of natural language processing,proposed an new Chinese word discovery algorithm based on combination frequency.The three main contents of this thesis are as follows:(1)Proposed a new Chinese word discovery framework.Based on the three levels of Chinese syntax,semantics and pragmatics,analyzed the semantic structure of modern Chinese.Combined with the features of new words’ reconstruction and colloquial,proposed the definition of new Chinese words.According to the characteristics of the new Chinese words,the thesis proposed the construction of new Chinese word discovery framework by analyzing the supervised learning and unsupervised learning algorithms in Natural Language Processing.(2)Established a key threshold index system for new Chinese word discovery algorithm.By studying the segmentation technology of Chinese word based on jieba and the sliding scan pane in N-gram model,the beforehand string process of the original corpus was given.Then,three specific quantitative indexes were proposed for the implementation of key threshold index system,including intra word aggregation based on combination frequency,inter word combination based on information entropy and reverse document frequency.(3)Set up the experiment of discovering Chinese neologisms based on Scrapy framework and the comparative experiment based on TF-IDF(Term Frequency-Inverse Document Frequency)algorithm.Quoted by the precision rate,recall rate and F-measure criterion,the experiment was compared by TF-IDF which is an unsupervised learning algorithm.And the analysis of the new Chinese algorithm results in improving the accuracy of the new rate explanation was also given.Through the establishment of a complete processing framework based on new Chinese word discovery algorithm,combining probability theory and statistical knowledge,this paper constructs index system of key word discovery center threshold algorithm,and provides a theoretical basis and practical guidance for the implementation of new Chinese recognition computer.Experimental results show that,under the premise of a specific corpus,the algorithm has a certain degree of improvement in the accuracy of new Chinese word recognition,and has practical value in the related fields.

Keywords/Search Tags:

new Chinese word discovery algorithm, combination frequency, intra word polymerization degree, inter word combination degree

Related items

1	Word Segmentation And Pos Tagging In Chinese
2	Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery
3	The Core Word Extraction Based On Important Degree And Affinity Degree
4	Research On Chinese New Word Discovery Technology Based On Large Scale Network Corpus
5	Research On Chinese Word Segmentation Strategies For Statistical Machine Translation
6	Keyword Spotting Based On Sub-word Decoding And System Combination
7	The Research Of Unknown Chinese Work Recognition And Its Application To Chinese Input Method
8	The Research On Chinese Word Segmentation System Based On SVM
9	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
10	Context Computing Applications, Word Disambiguation