Mining Semantic Knowledge From Chinese Wikipedia

Posted on:2010-06-26

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Li

Full Text:PDF

GTID:1118360278965458

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

To achieve semantic information based natural language processing, computers need to access to a great deal of background knowledge. Large-scale semantic networks or dictionaries are mainly constructed artificially, with cost of manpower and material resources during construction and maintenance. Mining semantic information from existing corpus for common-sense semantic resources construction becomes a hot topic in the recent years. Wikipedia, an online open encyclopedia, could be used not only as a corpus but also a knowledge resource with rich semantic information. To some extent the quality is comparable with famous manually constructed ones. This paper introduces works on semantic mining from Chinese Wikipedia for natural language processing and semantic resource construction.On semantic relatedness calculation, this paper presents a new "multi-path searching" algorithm on Wikipedia's hyperlinked networks including the category graph and the document graph. Web pages are downloaded from the Chinese Wikipedia with hyperlinks between lines extracted for sematic mining. Path searching is done with path-length and the weight of nodes or edges integrated for relatedness calculation. Related word pairs are collected from hyperlink references, with part of them tagged the semantic relatedness by human-beings to construct a test set. With experiments, relatedness are measured and compared with classical algorithms for detailed analysis.Wikipedia could also be used for semantic expansion and document relatedness calculation. A matrix showing direct links are constructed by extracting the redirect pages, category graph and document graph etc. With matrix multiplication, contributions with direct and undirect paths for relatedness are integrated into a new matrix for semantic transformation. For two vectors extracted with text frequency, this matrix could be used to transform them as new vectors with background information in Wikipedia being expanded. Traditional vector based relatedness algorithms for documents can also be used for the new vectors in the semantic space. This algorithm also suggests a solution of collecting semantically related word pairs and groups.On semantic knowledge resource construction from Wikipedia, this paper works in two different ways. On one hand, learning from the phrase structure of category nodes and relations between them, the sentence syntactic patterns in documents etc, different relation types are extracted. By adding the relation types into the category graph, a semantic hierarchy network is constructured. On the other hand, thousands of core words with single meanings and none-phrase forms are selected from Wikipedia word list, while other words are described with the most related groups of words for basic concept and valuable properties. With these descriptions, a liner word semantic dictionary is created. To study more on maintenance and expansion of current semantic knowledge resources, the Wikipedia category graph is mapped to HowNet. More works are done on adding new words and named entities into HowNet with similar semantic interpretation added by learning existing patterns between the two resources.

Keywords/Search Tags:

Wikipedia, Semantic Knowledge, Information Extraction, Semantic Dictionary, Natural Language Processing

PDF Full Text Request

Related items

1	Research And Implementation On Computing Semantic Relatedness Using Chinese Wikipedia
2	The Representation Of Chinese Semantic Knowledge And Its Application In The Chinese-English MT System
3	Investigation Of Categorical Semantic Information Processing In The Brain And Natural Language Processing Models
4	A Study On Neural Network-based Natural Language Semantic Representation
5	Design And Implementation Of Knowledge Extraction System For Overlapping Relations In Complex Semantic Context
6	Automatic Knowledge Extraction From The Chinese Natural Language Web Documents And Knowledge Consolidation
7	The Research Of Vietnamese Language News Build Lexical Chain Based On Converged Network Semantic Knowledge
8	Semantic Annotation For Documents In Professional Domain Based On NLP
9	Building Semantic Knowledge-Bank Based On The Binary Combinatorial Grammar
10	Crowdsourcing For Synonyms Proofreading And Acquisition In Chinese Large-scale Semantic Knowledge Base