Font Size: a A A

Mining Semantic Knowledge From Chinese Wikipedia

Posted on:2010-06-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1118360278965458Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
To achieve semantic information based natural language processing, computers need to access to a great deal of background knowledge. Large-scale semantic networks or dictionaries are mainly constructed artificially, with cost of manpower and material resources during construction and maintenance. Mining semantic information from existing corpus for common-sense semantic resources construction becomes a hot topic in the recent years. Wikipedia, an online open encyclopedia, could be used not only as a corpus but also a knowledge resource with rich semantic information. To some extent the quality is comparable with famous manually constructed ones. This paper introduces works on semantic mining from Chinese Wikipedia for natural language processing and semantic resource construction.On semantic relatedness calculation, this paper presents a new "multi-path searching" algorithm on Wikipedia's hyperlinked networks including the category graph and the document graph. Web pages are downloaded from the Chinese Wikipedia with hyperlinks between lines extracted for sematic mining. Path searching is done with path-length and the weight of nodes or edges integrated for relatedness calculation. Related word pairs are collected from hyperlink references, with part of them tagged the semantic relatedness by human-beings to construct a test set. With experiments, relatedness are measured and compared with classical algorithms for detailed analysis.Wikipedia could also be used for semantic expansion and document relatedness calculation. A matrix showing direct links are constructed by extracting the redirect pages, category graph and document graph etc. With matrix multiplication, contributions with direct and undirect paths for relatedness are integrated into a new matrix for semantic transformation. For two vectors extracted with text frequency, this matrix could be used to transform them as new vectors with background information in Wikipedia being expanded. Traditional vector based relatedness algorithms for documents can also be used for the new vectors in the semantic space. This algorithm also suggests a solution of collecting semantically related word pairs and groups.On semantic knowledge resource construction from Wikipedia, this paper works in two different ways. On one hand, learning from the phrase structure of category nodes and relations between them, the sentence syntactic patterns in documents etc, different relation types are extracted. By adding the relation types into the category graph, a semantic hierarchy network is constructured. On the other hand, thousands of core words with single meanings and none-phrase forms are selected from Wikipedia word list, while other words are described with the most related groups of words for basic concept and valuable properties. With these descriptions, a liner word semantic dictionary is created. To study more on maintenance and expansion of current semantic knowledge resources, the Wikipedia category graph is mapped to HowNet. More works are done on adding new words and named entities into HowNet with similar semantic interpretation added by learning existing patterns between the two resources.
Keywords/Search Tags:Wikipedia, Semantic Knowledge, Information Extraction, Semantic Dictionary, Natural Language Processing
PDF Full Text Request
Related items