Font Size: a A A

Network Neologism Recognition Based On Social Media Text

Posted on:2019-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:J ShiFull Text:PDF
GTID:2428330548467232Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the network media,the human language,as the medium of information,is developing and evolving.The popular topic of concern in the media era of the Internet is more likely to lead to the emergence of neologisms.These words are often used to express a variety of new things or phenomena,and some are also used in the expression of rich emotional color,which is of great significance to the analysis of public opinion,the trend of social trends and the development of language itself.With the development speed of our country and the different national conditions of the social stage,the popular trend of the public is different,the hot spots of social concern are constantly updated,and the new replacement of the new words of the network is fast.So the research on the acquisition of network neologisms needs to keep up with the development of the times.At present,artificial discrimination on the recognition of traditional new words is the most accurate but also need the most human.Although many methods have appeared in the use of computer technology,there are not many studies on the recognition of new words in the network,and the extraction performance is not ideal.In view of these situations,this paper will do the following two aspects in order to improve the performance of network neologism recognition model.First,based on the N-gram model and the improved mutual information and adjacency entropy,a network new word candidate word extraction algorithm is proposed,which can effectively identify the potential network new words.At present,the latest participle system has been able to identify a small number of new word candidate words,but these systems often adopt a relatively simple strategy and fail to take full account of the features of the new words on the network,thus the performance is limited.In this paper,the text is preprocessed and the n-grams model is used to divide the word string,and the garbage string is filtered through the rules and rules of each word.Then the candidate word set of the net new word is obtained by the improved mutual information and the weighting measurement of the adjacency entropy for the next denoising.The results of comparative experiments show that the algorithm performs well,and it can effectively reduce the effect of high and low frequency words in corpus.Especially when the corpus is large,the extraction effect of the candidate word extraction algorithm is more significant.Second,based on the characteristics of network corpus,three new features are proposed to apply to neologism recognition.First,in view of the time span and vitality of the new word,it is possible to be included in the dictionary and being used by the public.The time feature of the new word(TC)is introduced into the recognition task and the stationary analysis method of time series is used to estimate the stability of the data sequence.Then,the word vector feature(VC)is introduced as the feature of the word vector feature,and the CBOW model is used to train the word vector characteristics according to the content relevance and string similarity in the network corpus.At last,using the media characteristics of the corpus text of the network neologism,it quantifies it as the feature of the network neologism,that is,the network influence feature(IC)to describe the new word.The experiment shows that the correctness rate of the model recognition is improved after the new feature is applied to the recognition model,and the validity of the new feature for the network new word recognition research is shown.
Keywords/Search Tags:N-gram, Neologism recognition, Word embedding, CBOW model, time series
PDF Full Text Request
Related items