Font Size: a A A

Research On New Word Detection From Microblog Data

Posted on:2014-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:Q L SuFull Text:PDF
GTID:2268330422450630Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Internet is profoundly changing the way people live, learn and work andother aspects, especially changing the way people communicate and express. Thecontinual emergence of new words is the evidence. Microblog, as the Web2.0era’smost popular social network applications, has become the main place of new words’creation and dissemination. New Word Detection is a fundamental task in the ChineseInformation Processing areas, and its result will directly affect the performance ofChinese word segmentation and other Chinese information processing tasks.Although many researches have been carried out and certain results have beenachieved, there are still some problems in New Word Detection research. First, theeffect of New Word Detection is not ideal in practical applications and often requireshuman intervention. Second, only little studies of New Word Detection are based onMicroblog and Internet corpus. Third, there is lack of studies analyzing new words toguide the application of new words. For the above analysis, we conducted the NewWord Detection research on Microblog, including the following works:Firstly, rules and statistical methods were combined for New Word Detection.We analyzed five classical statistical measures in Microblog new word extraction,and pointed out the problems of existing methods. On this basis, a new statisticalmeasure called weighted relative Branch entropy that is based on Branch entropy wasproposed. Experimental results show that the proposed method is superior to fiveclassical statistical measures. At last, we classified the new word extracted fromMicroblog into seven categories according to sources and explored the reasons forthe production of new words.Secondly, we combined New Word Detection and Microblog word segmentation.In word segmentation, we employed auxiliary rules according to Microblog textcharacteristics. For lack of labeled training data on Microblog texts, we usedKullback-Leibler divergence to choose the labeled data outside the domain as trainingdata. Then self-training method was conducted to efficiently exploit unlabeledMicroblog texts. Due to the problem of too many new words existed, we appendedthe proposed statistical measure into feature set which was utilized to train wordsegmentation model. In New Word Detection, we regarded high confidence level and low confidence level fragment as a candidate string and detected new words fromthese strings. The detected new words were then added to the dictionary and formattedadditional dictionary features in word segmentation model training. Experimentalresults demonstrate that combination of new words detection and word segmentationpromotes their performance.Finally, we analyzed life cycle of new word on Microblog. The probabilitydistribution function of logarithmic function was used to fit new word’s frequencyfirst and we analyzed the temporal distribution rules of the new word. Most of thenew words disappeared soon after being created, only a small part of the new wordscould survive, and gradually developed into a common word. Then the frequent itemset mining algorithm was applied to extract frequent words, and we analyzed thespatial distribution rules of new words.
Keywords/Search Tags:new word detection, statistical measure, word segmentation, life cycle
PDF Full Text Request
Related items