Font Size: a A A

Research On Chinese Time Expression Recognition

Posted on:2011-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:T WuFull Text:PDF
GTID:2178360305497848Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Named entity recognition technique has received more and more attentions recently along with the great development of information processing technology. In this dissertation, it will focus on time expression recognition, which is one of the most important directions within named entity recognition research area.Time expression has quite a lot of useful applications within natural language processing. It can be used to determine event sequence in topic detection and tracking; it can be used to answer time related questions such as "when" and "how long" in automatic question and answer system; it can make the translated text easier understood in machine translation; it can be also used to promote precision of analysis of web page structure in some special tasks.There are two main technologies in time expression recognition:sequence labeling method based on machine learning theory and rule-based method. This dissertation takes deep and careful researches into these two technologies.For sequence labeling method, this paper introduces two main supervised learning models, which are conditional maximum entropy model (CME) and conditional random filed model (CRF). It implements two complete time expression recognition systems based on CME and CRF separately. Experimental results show that, though sequence labeling methods domain in named entity recognition domain, yet CME gets only a 79.1% F-score while CRF gets only a 79.5% F-score for time expression recognition task. Consequently, methods based on machine learning theory do not fit this specific task.At present, the most popular technology in recognizing time expression is still traditional rule-based method. Furthermore, this dissertation does deep exploration in this research direction. Firstly, it manually generates human labeled rules to match time expressions within large scale text corpus. Secondly, in order to promote recall value and save labor cost, it continuously designs an automatic rule learning algorithm relying on training corpus, which fully takes usage of labeled information from training corpus provided by organizers. Thirdly, for the purpose of a higher precision value, this paper consequently adds error-driven theory in the system to prune the rule corpus. It finally shows that this algorithm efficiently reduce the "noise" from rule auto-learning processions. Finally, so as to promote the F-score that represents the overall performance of the whole system, it proposes the "basic time unit" concept and also generates basic-time-unit rules through word segmentation technique from natural language processing research area. The experimental results show the algorithm obviously promotes the entire performance of time expression recognition system.The main contribution of this dissertation is the proposed "auto-generating basic time unit rule corpus" algorithm. The algorithm generates rules based on "basic time unit", which improves the recall value. In the mean time, it prunes the rules corpus through error driven method, which consequently reaches a high precision. The two features obviously improve the overall efficiency compared to the baseline system. The experimental results on ACE07 Chinese Corpus surpass the best present performance with an 89.8% F-score.Furthermore, the proposed algorithm has great applicability and expansibility. It can be used in generating high-performance domain rule corpus, so as to recognize specific entities.Finally, a practical system in time expression recognition is implemented based on the researches.
Keywords/Search Tags:time expression recognition, basic time unit, TIMEX2, error driven theory, regular expression, named entity recognition, conditional maximum entropy, conditional random filed
PDF Full Text Request
Related items