Font Size: a A A

Knowledge Acquisition From Text

Posted on:2009-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H WangFull Text:PDF
GTID:1118360245470119Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Text is one of the most important media for people to describe the world, express their thoughts and diffuse knowledge. Coming with knowledge economy, more and more attention has been paid on text knowledge management by researchers and engineers. But there are still some problems for text knowledge management systems: How to acquire the subject of the texts? How to extract the topic words of the texts? How to high-light personalized important information for different people? How to provide exact information for users? Keyword extraction and information extraction may help to solve these problems, which are important technologies in text processing. This paper focused on keyword extraction from single document and rule generation for information extraction. And main achievements are as following:1) Word sense disambiguation based on semantic networks and UW-PageRankThis paper proposes a word sense disambiguation method based on semantic networks and UW-PageRank, which is able to disambiguate all the words in whole text at one time without corpus and training.For Chinese, we use HowNet as knowledge base and build undirected weighted graph which use sememes as vertices and relatedness of sememes as weighted edges. Then UW-PageRank is applied on the graph to score the importance of sememes. Score of each definition of one word can be computed from the score of sememes it contains. Then, the highest scored definition is assigned to the word. This algorithm is tested with text indexing experiment and SENSEVAL-3.For English, we use WordNet as knowledge base and build undirected weighted graph which use synsets as vertices and relatedness of synsets as weighted edges. Then UW-PageRank is applied to score the importance of synsets. The highest scored synset is assigned to the word. This algorithm is tested with SemCor corpus.2) Keyword extraction based on semantic networks and UW-PageRankThis paper proposes a keyword extraction method based on semantic networks and UW-PageRank. After word sense disambiguation, one sense is assigned to one word, so the semantic graph can be pruned according to the results with only "right" sense. Then, UW-PageRank is applied to mining the most important senses, i.e. keywords.We test our algorithm on manually tagged Chinese and English papers and comparing with Tf algorithm, our algorithm performs better.3) Heuristic rule generation algorithm for Chinese information extraction: RGA-CIEThis paper proposes a heuristic rule generation algorithm for Chinese information extraction: RGA-CIE, which is domain independent for free text of Chinese. RGA-CIE applies supervised learning with bottom-up strategy, which is a rule generalization processwith a heuristic method to decide rule generalization path and Laplacian~* formula toevaluate the performance of rules. And semantic extension is also applied to improve the flexibility of rules. The learned rules have been tested on Commercial News Information Extraction System, and achieve a performance of 0.84 as precision and 0.82 as recall, which is better than the manually wrote rules. We also applied information extraction technology on ontology instance learning and made great contribute to Traveling in Beijing System.
Keywords/Search Tags:Keyword Extraction, Information Extraction, Word Sense Disambiguation, WordNet, HowNet, PageRank
PDF Full Text Request
Related items