Font Size: a A A

Research On Chinese Named Entity Recognition And New Word Detection

Posted on:2008-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:L G LiuFull Text:PDF
GTID:2178360245997860Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Named entity and new word, as the basic information unit of text, are essential to the correct understanding of a text. Named entity and new word have been widely used in information retrieval, machine translation, text classification, automatic summarization or other Natural Language Processing applications. Accordingly, its solution will promote the research of the relevant fields. In this thesis, attention is concentrated on the Chinese named entity Recognition and new word detection. The dissertation concerns the following aspects:1. As for the poor efficiency, low practicability and bad performance in complex named entity, cascaded hidden markov model is adopted based on the nesting of Chinese named entity. This thesis firstly identifies the simple named entity, then the abbreviated named entity, and last the complex named entity. During the recognition, word segmentation method and Tags used in named entity recognition are designed. The thesis uses N-best method to output N results to the next processing to search the best result.2. As for the data sparseness and poor naturalization in cascaded hidden markov model, the thesis uses transformation based learning as post-processing method of cascaded hidden markov model. Because the learned rules are optimized, transformation based learning has a good performance right now. The thesis uses the formal test data from 2004 863 Evaluation to test the named entity recognition result, and the F-measure achieves 83%.3. As for the length limitation, domain limitation of the new detected word and the new correct word losing problem, this thesis uses the strategy of combining statistics and rules in new word detection. The new word candidate set is constructed based on string repetitiveness searching. Then the thesis uses stop word list, stop word tag list, head stop word tag list, tail stop word tag list, and fixed window to filtrate the rubbish string. Then frequency ratio method and TF/IDF method are used to resort the learned new words. Finally the experiment concludes that frequency ratio method is good at general new word detection and TF/IDF is good at term detection. The thesis finally evaluates the result, the precision value achieves 60% and the recall value achieves 90%.
Keywords/Search Tags:named entity recognition, new word detection, hidden markov model, frequency ratio, TF/IDF
PDF Full Text Request
Related items