Font Size: a A A

POS Tagging System On Hierarchical Classification Labels

Posted on:2010-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:W PanFull Text:PDF
GTID:2178360275991509Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a basic operation of natural language processing,POS tagging provides us with such useful information about a word and its neighbors that it becomes a common part of many complicated applications.The POS tagging task is the basis for document understanding,document generation and many other NLP-related research. After years of development,it is considered to be a relatively mature field of study. However,a lot of unconventional datasets such as hierarchical classification labels have come along with the rise of Internet and information explosion.On the other hand,current POS tagging tools are all built on full-fledged sentences and as a result, their performances on these datasets are very poor.Based on this observation,this thesis takes a close research into POS tagging algorithms on hierarchical classification labels.The thesis first presents a brief introduction to core technologies and research methods of current POS tagging researches,including four classic models and algorithms.In the process of manual annotation,we manage to find six major differences between full-fledged sentences and hierarchical classification labels, which explain the performance decrease of traditional tools.Besides,we point out two key problems that should be dealt with:path information and proper nouns.Then we propose our POS tagging algorithm on hierarchical classification labels using Maximum Entropy Model.In order to integrate path information into the input, a new tag PATH is introduced,along with three new features making use of this information.To identify the many proper nouns in classification labels,we build a dictionary and a database respectively from WordNet and Wikipedia.Both are later encoded into MEM in form of binary features.These modifications lead to significant improvement on Dmoz dataset and thus prove the effectiveness of our theory.This POS tagging algorithm can apply to webpage classification systems.Current systems depend heavily on human-annotated data as training corpus.We come up with a way to automatically generate training corpus with the help of a POS tagger on classification labels and a search engine.The overall performance of this system is quite acceptable.
Keywords/Search Tags:Hierarchical classification labels, POS tagging, MEM, Wikipedia, WordNet
PDF Full Text Request
Related items