Font Size: a A A

Research Onchinese Named Entity Recognization

Posted on:2013-12-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:H X JiangFull Text:PDF
GTID:1228330374999505Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Named Entities Recognition (NER) is to recognize proper entites like person names, location names, organization names, etc. in natural language. NER is a fundamental research task in Natural Language Processing (NLP). As an extension of Chinese segmentation task, Chinese NER has been widely used in information extraction, information retrieval, information recommendation, machine translation and other NLP applications. NER is playing a more and more important role in improving their performance. Currently, with the new requirements of NER, there are three main challenges in NER research:(1) NER has been applied in diverse situation from internet servers and PC to mobile devices with limited hardware-capabilities, where NER need meet the performance requirements and reduce model complexity;(2) With the rapid growth of the network data, new NEs are created rapdly, NER need to make use of large-scale data sets so that it can deal with new NEs effectively;(3) Named Entities (NE) contain not only person names, location names, organization names, but also publishing entities (film names, book names, music names), mercantile entities (brand, product names, product version), and so on.Focusing on the above challenges, our work makes the follows major contributions to Chinese NER:(1) To conquer the hardware limitation of mobile devices and meet the performance requirements, we present a knowledge-combined Second-order Hidden Markov Model (So-HMM) and efficient decoding algorithm for NER task in mobile devices. Then we build a recommendation system of mobile applications based on NER from short messages.The experimental results show that the NER performance is significantly improved by expending language and exploiting external knowledge, and the model complexity is significantly decreased by using a novel second-order backward A*decoding algorithm. The model achieves a satisfying performance in hardware-limited mobile devices.(2) We build an NE resource database of multiple types of entity from large-scale Web data set. Beginning with a small amount of labeled corpus, active learning (AL) strategy has been used to train CRF-based NE taggers, then the taggers are used to extract more named entities to build NE resource database from real time Web data; For different entity types with different distributions on internet, multiple entity types have been divided into two categories, for which we build different NE resource database based NER models respectively.The experimental results show that a high-quality NE resource database can effectively compensates the insufficient NE patterns instatistical model training. Simultaneously, the improved AL utility function can significantly reduce the workload of manual annotation of data.(3) We use the NE resource database based NER system to assist the analysis of web intentions in an intention analysis system which is based on the learning to rank method.The experimental results show that NEs have stronger meaning integrity and specificity than key words. It therefore can describe the core contents of web page in a better way. The NER system we built has positive contributions in intention analysis system.
Keywords/Search Tags:Named Entities Recognition, Second-order HiddenMarkov Model, Conditional Random Field, Active Learning, NamedEntities resource database, Intention Analysis
PDF Full Text Request
Related items