Font Size: a A A

Research On Words Segmentation Algorithm And Word Variant Extraction Method Of Message Variety Based

Posted on:2018-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:X WeiFull Text:PDF
GTID:2348330518496495Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Abstract : With the development of mobile communication network, the growth and spread of spam messages is plaguing mobile phone users .This situation is bring a great challenge to the governance of spam messages. In the spam messages governance, reasonable segmentation of SMS text is the prerequisites for recognition,classification and intercept. Illegal SMS is ungrammatical, variability and those kinds of things, which leads to the problem of the segmentation accuracy deteriorated. In this paper, a revised PMI is combined with the proposed cross-skip-bi-grams model to resolve the problem. In addition, optimal segmentation, segmentation merge,increment training and feedback training are proposed to improve the usability and robustness of this segmentation method. Experimental results show that this method can improve the accuracy of segmentation on Illegal SMS information.In the face of the difficulties in extracting character words out of documents,this paper introduces the possibility of converting words into data forms that the computer understands without losing semantic by word embedding and semantic vector space.Then, this paper introduces the vector representation of the neural network language model and its curse of dimensionality, and how word2vec obtains the continuous space vector of the word efficiently and profoundly.In addition, this paper describe how to use the improved vector space and the incremental method of construct the vector space to identify the SMS variant words. Finally, the word vectors clustered in the vicinity of the prototype word are found by experiment, and furthermore, the word variants are obtained by filtering rule .This paper also designs the overall architecture and the main functional module' s architecture of the management and intelligence analysis platform for massive SMS data.Then the key techniques in major function modules are introduced.The SMS storage and timing storage module, statistical word segmentation module, index module,vector space module and similar variant query module are designed and implemented.
Keywords/Search Tags:natural language processing, Chinese words segmentation, Pointwise Mutual Information, word variant
PDF Full Text Request
Related items