Font Size: a A A

Chinese Named Entity Recognition Based Statistical Machine Learning

Posted on:2005-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y MengFull Text:PDF
GTID:2168360125951384Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Named Entity (NE) Recognition(NER) is to classify every word in a document into some predefined categories . In the taxonomy of computational linguistics tasks,it falls under the domain of "information extraction", which extracts specific kinds of information from document .in other hand , Named entity recognition's result is decisive to precision of the latter segmentation, tagging, parsing. In one word, the research and application of NER are of great theoretical and practical significance.The NE task is to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages in text, statistics and rules. There are two typically previous approaches used in this task .One is NER based statistics, which auto extract NE information from real context and then use it to train system . The other is NER based rules, which used general regular expression that linguistics experts write. In this paper, We adopt a hybrid strategy based on statistics and rules. NE word information are extract from the corpora and Substantive corpora are used to train the model. At last the method is implemented successfully . The paper is arranged as follows:1. Provide a method to deal with two kinds of Chinese coding , that including two Chinese coding system inter-conversion and many Chinese coding aggregate. It is the foundation of latter pre-process and NER.2. Through a thorough analysis of all sorts of numerals, this paper puts forward a solution that identifies None-Chinese symbol and numerals, it first identifying these none-Chinese symbol in the text such as monetary amounts, and percentages after sentence breaking, Then identifying these none-Chinese symbol in the text. Finally, it recognized Chinese numberals.3. Provide a method based Evaluate function, this paper use statistic method to get the context information in large corpus,and then evaluate functions are adopt tomark all possible Chinese name, place name, foreign translated name. After this , a dynamic planning is used to find the possible position of these names.4. Provide a Chinese tree bank based decision tree approach to identify NE. A self-learning mechanism is integrated into our model which includes the following steps: auto-extraction of POS string sequences and their context information from the corpora and ID3 algorithm based tree training.5. Provide a Chinese organization name identification based on template matching by detail analyzing constitute of organization name.6. Provied two system using the hybride strategy mentioned in this paper and give some real example to explain how it works.In general , in use of the method in this thesis , The result of experiment indicates that this method achieves satisfactory accruacy. The NER involves all predefined categories and also considered the problem in pre-process of Chinese text. Its research and application are of much theoretical and practical significance.
Keywords/Search Tags:named entity recognition, statistics, machine learning, rules, text pre-process.
PDF Full Text Request
Related items