Font Size: a A A

Research And Implementation Of Microblog Data Oriented Named Entity Recognition

Posted on:2014-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:J XunFull Text:PDF
GTID:2268330425991536Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With popularity of microblog, microblog has become a new social media of information release and propagationg. By the end of December2012, Sina microblog reached more than500million registered users. Microblog data increased also in which there is much valuable information for organization and individual. Thus, information extraction, analysis and natural language processing carried out on microblog become a research hotspot. Named entity recognition is particularly important as a basic mission of these researches, but current named entity recognition on microblog research is not very mature. Tranditional methods of named entity recognition on microblog data cannot obtain satisfactory result, which hinders the follow-up work.This paper mainly researches named entity recognition on microblog data. Characteristics of microblog data lead to the failure of traditional methods. The fundamental reasons are listed in four points. First, each microblog is very short with limited information, which makes it difficult to fully integrate a lot of relevant information. Second, there is much noisy data, but noisy immunity of models is low. These make overfitting phenomenon happen on models in training course. Third, there is no sufficient complete training resource in microblog data, which may lead to under-training of models. Furthermore, it always needs much manual work. Fourth, fast information update of microblog data makes underfitting phenomenon happen in training course on low adaptability of models. Experiment shows that F1measure of named entity recognition results by conventional methods will almost drop20percents.In order to solve problems listed above, the paper implements named entity recognition on microblog data with many technologies. The precision, recall and F1measure of the research result on microblog background gets83.7%,79.8%and81.8%respectively. The result improves much relative to conventional methods. The paper overcomes disadvantages in conventional methods by following aspects. First, the research builds a semi-supervised named entity recognition frame, which figures out the lack of training data by repeating training model with new predictive results and training data. Meanwhile, the frame can accommodate the frequent update environment. Second, a KNN classifier and a CRFs model is combined in the frame which can make the best of information in global microblog sequence and each piece of local microblog, which improves precision and recall of result. Third, a data normalization module is added in frame to wipe noise out and standardize informal text. Besides, a entity uniformization module is also added to optimize named entity recognition result. This module not only revises the result of named entity recognition but also provides a conreference set for further research.
Keywords/Search Tags:Named entity recognition, Microblog, Semi-supervising, KNN classifier, CRFsmodel
PDF Full Text Request
Related items