Font Size: a A A

Combinating Of Rules And Statistics For New Words Detection Of Microblog Text

Posted on:2018-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZhouFull Text:PDF
GTID:2348330512479317Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The emergence of microblog new words has brought great challenges to word segmentation in short texts.In this paper,a comprehensive analysis on the detection of new words has been made from different angles.The formation rules of microblog new words are extremely complex with high degree of dispersion,and the extracted results by using traditional C/NC-value method have several problems,including relatively low accuracy of the boundary of identified new words and low detection accuracy of new words with low frequency.To solve these problems,a method of integrating heuristic rules,modified C/NC-value method and two kinds of statistical models which includes Conditional Random Field(CRF)and Support Vector Machine(SVM),was proposed.The main contribution of this paper is to propose a new method for new words detection of microblog texts,and to design and implement a two-stage method which can integrate rules and statistics.On the basis of integrating varieties of algorithms to complement each other,the recognition accuracy and adaptability of the microblog new words were improved,and the labor cost was reduced.The innovation of this paper is mainly embodied in the following three aspects:(1)The artificial heuristic rules are used to classify and summarize the microblog new words,and the rules are artificially designed by using POS,character types and symbols.(2)Traditional C/NC-value method is modified by merging the information of word frequency,branch entropy,mutual information and other statistical features to reconstruct the objective function,so as to improve the detection ability of statistical methods to new words.(3)By integrating the modified C/NC-value algorithm with the CRF and SVM statistical models respectively,we can improve the accuracy of the boundary of identified new words and the detection accuracy of new words with low frequency.The experimental results show that the proposed method can effectively improve the accuracy of microblog new words detection compared with the traditional methods.Furthermore,it can also improve the accuracy of word segmentation.
Keywords/Search Tags:Microblog new words, formation rules, statistical features, C/NC-value method, Conditional Random Field(CRF)model, Support Vector Machine(SVM)
PDF Full Text Request
Related items