Font Size: a A A

Research And System Implementation Of Chinese New Words Discovery Based On Mutual Information

Posted on:2021-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:G H ShangFull Text:PDF
GTID:2428330614471742Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,people are accustomed to making comments through the network.The spread of public opinion is speeding up,and the impact is also increasing.Therefore,products related to public opinion monitoring have emerged in the market.With the development of the times,people's language is slowly changing.There will be some new words or special terms in the network.These frequently occurring new words not only bring challenges to the existing automatic word segmentation technology,but also have an important impact on public opinion hot spot analysis,web text mining,and sentiment analysis.More and more attention has been paid to the discovery of Chinese new word discovery research.In recent years,many scholars or research institutions have done a lot of work in the field of new word recognition.However,the accuracy of it is not very high.The key problem is the recognition of high-frequency garbage word strings.Since existing statistical methods cannot distinguish them,semantics is a good point to solve it.This paper aims to study the extraction of Chinese new words.The web crawler is used to obtain micro-blogs,news,etc.to construct a corpus,and the text is cut from left to right to form substrings,which is used for model training.Then filter the substrings to form candidate words.Finally,new words are identified according to the corresponding indicators.The method used in this paper is based on unsupervised new word discovery algorithm with statistical information such as mutual information,information entropy,and word frequency,combined with the semantic to make further improvement and refine a new index of new word selection.Experiments show that the algorithm proposed in this paper has a good improvement in accuracy,and finally explores an unsupervised,high-accuracy algorithm that can be applied to smaller-scale data to meet actual production needs.At the same time,based on improving the new word discovery algorithm,a public opinion system is designed and implemented,and the system architecture has been improved to enable it to support real-time retrieval and analysis of massive data.This paper has 22 pictures,6 tables,and 32 references.
Keywords/Search Tags:Mutual Information, Branch Entropy, Internal Cohesion, Boundary Freedom
PDF Full Text Request
Related items