Font Size: a A A

Research On A Two-Stage Method For Chinese Named Entity Recognition

Posted on:2009-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:N HeFull Text:PDF
GTID:2178360245970223Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As a basic task, also an important task for Information Extraction, Named Entity Recognition (NER) has been one of the central issues in natural language processing. Message Understanding Conference (MUC) sponsored by DARPA (Defense Advanced Research Projects Agency) in America had set NER as one of its sub-tasks since 1998, meanwhile, Named Entity (NE) is catalogued officially into three groups for the first time: 1.entity (organization names, person names, and location names); 2.temporal expression (data and time); 3.figure (monetary value and percentage). The following Automatic Content Extraction (ACE) contest had brought new features to NE, such as entity mention and relationship between entities.Since 2003, the Special Interest Group on Chinese Language Processing (SIGHAN) of Association of Computational Linguistics (ACL) presented bakeoff on Chinese word segmentation and named entity recognition. The bakeoff have been held four times until 2007. The first two only focused on Chinese word segmentation, however, the last two brought Chinese named entity recognition into consideration. NE in SIGHAN definition includes person names, location names, organization names and geopolitical names for some corpus. Participants are required to tag the scope and category of NE in un-segmented corpus.According to NE definition and annotation guideline proposed by SIGHAN bakeoff, a two-stage method for Chinese NER, which is boundary detection and category identification respectively, is presented. Considering the characteristics of different stages, different machine learning algorithms are implemented. To be concrete, Conditional Random Fields(CRFs) for boundary detection and Maximum Entropy Model(MaxEnt) for category identification. Owing to the two-stage method, the cost for training CRFs model is greatly reduced compared with traditional one-stage method, at the same time, the overall performance remains almost the same. It's especially meaningful for Conditional Random Fields (CRFs), for its tremendous training cost.The procedure for two-stage Chinese NER is as follows: at first, boundary detection is performed. As a sequence tagging problem, CRFs is very suitable here, for its ability of integrating large amount of features and absence of label bias problem, which is the defect of other digraph models. Secondly, Maximum Entropy (MaxEnt) is employed to identify NE category, because it is in keeping with the principle that when one has only partial information about the possible outcomes one should choose the probabilities so as to maximize the uncertainty about the missing information.There're several highlights in boundary detection experiment: 1. the performance of six label sets are compared comprehensively, the result shows that BIOE label set, which emphasizes both beginning and end of a NE, is the best; 2. comparison between different window size in feature templates is conducted, and the conclusion is that it should be neither too large nor too small. Although larger window size would get more features involved, the computational complexity grows as well, what's more, there would be data sparse problem. Smaller window size would lose some important context information, so neither too large nor too small window size is desired.When performing category identification, the features are catalogued into two groups, which is local features and global features. Local features are related with entity itself exclusively, and global features take context of NE into consideration. Experiment result shows that promising performance could be reached when using local features only. The reason is that confusion between different kinds of NE is rare, that is why the information about NE itself is sufficient for NE category identification.When the results for two-stage NER are derived, comparisons between one-stage and two-stage methods are made. Compared with one-stage, two-stage has brought on 80% reduction on time and memory consumption roughly, while the total performance remains almost the same. Both methods achieve competitive overall F-measure which is almost as good as top result in the bakeoff.More than 20 hours are needed for one-stage training procedure, but for two-stage method 3.5 hours is enough. There're about 100 million features in one-stage which calls for 12GB memory storage, however, only 6 million features are involved in two-stage, and memory occupation is reduced to 3.2 GB.Finally the advantage of two-stage method is proved theoretically, and some comments about future works are made.
Keywords/Search Tags:Chinese named entity recognition, Conditional Random Fields, Maximum Entropy Model, two-stage
PDF Full Text Request
Related items