Font Size: a A A

Two-phase Strategy Of Chinese Named Entity Recognition Based On CRFs In Micro-blog

Posted on:2016-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:F LiFull Text:PDF
GTID:2308330470973140Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of micro-blog service, micro-blog had become an important platform for people to exchange information, share resources routinely.Meanwhile, the number of micro-blog text showed explosive growth, which brought great challenge about how to get useful information from the extremely large amounts of micro-blog texts.Massive micro-blog texts provided new carriers for the studies of information processing.Based on micro-blog text, there was some hot research fields emerged, such as Information Retrieval(IR), Information Extraction(IE), analyses of public opinion and construction of knowledge graph and so on. As the basis and common problems in this field, Named Entity Recognition(NER) was also growing interest by researchers. Micro-blog texts lacked these qualities and had instead a short-handed and mixed language studded with emotions, so there were some non-adaptive compared with the traditional methods for NER based on normal text(e.g. news, articles, etc.). More or less, some problems were brought because of text noise interference and inappropriate feature selection, which caused to improper recognition effect and system overhead. To make the micro-blog text more norms and in line with the style of linguistics, this paper analyzed the micro-blog text carefully firstly, and proposed a series of targeted regularization methods for preprocessing. Secondly, to strengthen the relevance of feature selection, while reducing time-consuming caused by redundancy features, this paper proposed a two-stage NER strategy to identify the Named Entity(NE) in pre-processed micro-blog text, and that was NER task were divided into two sub-tasks: Named Entity Identification(NEI) and Named Entity Classification(NEC). The first stage, NE in the micro-blog texts were identified by a Conditional random fields(CRFs) model without categories. Then, after post-processing, another CRFs model was used to determine a correct type for each identified entity, especially, the boundary identification result of NE in first sub-task was selected as an input feature. In different stages of CRFs model, different features were selected, including location features, word feature, Part Of Speech(POS) feature,spelling feature, head / tail word features, boundary word features and word dictionary features, and the influences of the various features in the first stage selection were verified about NER.With the specification of feature selection and reduction of tags in each stage,two-stage strategy based on CRFs could improve the effect of NER, while effectively reducing training time, what’s more, feature expansions were utilized for post-processes to improve the performance of our approach. By comparison experiments, this paper verified thedifferent effects by different feature selections for NER. Among them, with the join of commonly NE dictionary to based model(based on word), the F-score had increased 11.43%.At the same time, the experimental results of two-stage strategy showed that our method was well feasible for Chinese NER, and F-score achieved 81.53% with less training time.
Keywords/Search Tags:NER, two-phase, Micro-blog, CRFs
PDF Full Text Request
Related items