Research On Chinese Named Entity Recognition And New Word Detection

Posted on:2008-06-09

Degree:Master

Type:Thesis

Country:China

Candidate:L G Liu

Full Text:PDF

GTID:2178360245997860

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Named entity and new word, as the basic information unit of text, are essential to the correct understanding of a text. Named entity and new word have been widely used in information retrieval, machine translation, text classification, automatic summarization or other Natural Language Processing applications. Accordingly, its solution will promote the research of the relevant fields. In this thesis, attention is concentrated on the Chinese named entity Recognition and new word detection. The dissertation concerns the following aspects:1. As for the poor efficiency, low practicability and bad performance in complex named entity, cascaded hidden markov model is adopted based on the nesting of Chinese named entity. This thesis firstly identifies the simple named entity, then the abbreviated named entity, and last the complex named entity. During the recognition, word segmentation method and Tags used in named entity recognition are designed. The thesis uses N-best method to output N results to the next processing to search the best result.2. As for the data sparseness and poor naturalization in cascaded hidden markov model, the thesis uses transformation based learning as post-processing method of cascaded hidden markov model. Because the learned rules are optimized, transformation based learning has a good performance right now. The thesis uses the formal test data from 2004 863 Evaluation to test the named entity recognition result, and the F-measure achieves 83%.3. As for the length limitation, domain limitation of the new detected word and the new correct word losing problem, this thesis uses the strategy of combining statistics and rules in new word detection. The new word candidate set is constructed based on string repetitiveness searching. Then the thesis uses stop word list, stop word tag list, head stop word tag list, tail stop word tag list, and fixed window to filtrate the rubbish string. Then frequency ratio method and TF/IDF method are used to resort the learned new words. Finally the experiment concludes that frequency ratio method is good at general new word detection and TF/IDF is good at term detection. The thesis finally evaluates the result, the precision value achieves 60% and the recall value achieves 90%.

Keywords/Search Tags:

named entity recognition, new word detection, hidden markov model, frequency ratio, TF/IDF

PDF Full Text Request

Related items

1	Design And Implementation Of A Hidden Markov Model Based Model For Legal Named Entity Recognition
2	Study On Chinese Named Entity Recognition Based On Hidden Markov Model
3	The Field Of Music, A Combination Of Rules And Statistical Named Entity Recognition
4	Research On Named Entity Recognition Based On KL-HMM
5	The Research On Named Entity Recognition In Chinese Information Processing
6	Chinese Lexical Analysis And Named Entity Identification Using Hierarchical Hidden Markov Model
7	A Study On The Recognition Of Biomedical Named Entity Based On Statistic
8	Cross-Domain Andcross-Style Chinesenamed Entity Recognition
9	Research On Chinese Named Entity Recognition
10	Research And Application Of Named Entity Recognition Method For The Bidding Data