Font Size: a A A

Research Of Named Entity Recognition And Automatic Pattern Acquisition In Information Extraction

Posted on:2006-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:X J WuFull Text:PDF
GTID:2168360155971720Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advent of the information era and the development of the Internet, information explosion has become the bottleneck of information processing. There is an urgent need for the quick and accurate acquisition of these information. Information extraction is one of the most powerful measures to solve this problem.However, named entity recognition, automatic pattern acquisition and coreference resolution are all urgent problems to be solved. This paper researched the named entity recognition technology and automatic pattern acquisition technology and presented a series of solutions. In named entity recognition this paper mainly researched the Chinese name recognition and identification of Chinese organization names. Based on statistic over large-scale corpus, this paper built a Chinese name identification knowledgebase and presented the method of person name recognition with statistics and rules. This method gave attention to recall rate and precision rate. After test, the recall rate and precision rate are respectively 91.35% and 92.23%.In Chinese organization name recognition, this paper uses the machine learning method of Co-Training to build six knowledge-bases. Using organization compositive probability and the coinstantaneous probability of organization name words and suffixes, using information about inner characters of organization names and pre-introductory and post-introductory words of organization names, this paper presented an identification algorithm of Chinese organization names based on statistics and rules. The experiment achieved 90.2% precision and 81.7% recall respectively by close test, and 88.5% precision and 75.5% recall respectively by open test.Another work of this paper is the research on automatic pattern acquisition technology in information extraction. This paper presented an automatic patternacquisition method based on similarity computation in a creative way. Given a seed pattern, relevant patterns can be learned automatically from a large scale of unlabeled training corpus. The generated patterns can be put to use after a little manual correction. Compared to other algorithms, APAMBSC requires much less human intervention and avoids the necessity of hand-tagging training corpus. Experimental results show that APAMBSC learns patterns that achieve precision of 79.45% and recall of 66.51% in open test.At last this paper had a try about the design of Chinese information extraction system. Utilizing the technology this paper researches and the technology of our laboratory, this paper designed a system of Chinese information extraction.
Keywords/Search Tags:Information extraction, named entity recognition, Chinese name recognition, organization name recognition, Co-Training, automatic pattern acquisition, Similarity Computation
PDF Full Text Request
Related items