Extracting Structured Information From The Chinese Wikipedia And Measuring Relatedness Between Words

Posted on:2012-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:H C Zhang

Full Text:PDF

GTID:2178330335969238

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

To improve the intelligent of the computer, it is important to incorporate the semantic knowledge into the natural language processing (NLP). Along with the rapid development of the information processing demand, nowadays, it is an important research field to extract semantic knowledge from all kinds of free data corpuses.The Wikipedia is the biggest free, web-based encyclopedia, which is written collaboratively by volunteers around the world. It has many advantages, such as widely knowledge coverage, highly structured degree and rapidly information update speed, so more and more researchers realize it as a semantic knowledge resource. But the Wikipedia official website just offers some foundation backup data files, much structured semantic knowledge can't be used directly.Therefore, in this paper, we firstly extract the structured information from these backup data files; then, we abstract the objects which the Wikipedia uses to organize the knowledge, and implement an open source framework; finally, we propose a new method for computing the relatedness between words. The main research works are as follows:Firstly, extracting the structured information. We firstly download the backup data from the Wikipedia official website, and then we convert the traditional Chinese text data into the simplified Chinese. After a series of processes, we gain much kinds of structured information, such as the internal links, the category taxonomy, and the anchor. To read and use the statistical data expediently, we save the data in Mysql data base, and create indexes for the important fields.Secondly, abstracting the structure of the Wikipedia. After analyzing the different roles of the Wikipedia terms, we divide the total terms into six classifications. And for each classification, we provide the user with many open API, which can reduce the difficulty of using the structured semantic knowledge.Thirdly, we propose a new method for computing the relatedness between words. We compare the Chinese Wikipedia with the traditional knowledge base, and find their similarities and differences. Then we combine the advantage of proposed method and the features of the Chinese Wikipedia data, and propose a new method for this purpose, which uses all the three kinds of semantic knowledge mentioned above, and incorporate the semantic knowledge into it. To evaluate our method, we implement it on different data set, and compare its result with other methods. Finally, we prove that our method is effective.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research And Implementation On Computing Semantic Relatedness Using Chinese Wikipedia
2	Research Of Semantic Relatedness Measure Based On Wikipedia Structure
3	Research On Concept And Short Text Semantic Relatedness Calculation Method
4	A Study On The Analytical Method Of Chinese And Vietnamese Bilingual News
5	Mining Semantic Knowledge From Chinese Wikipedia
6	Term Relatedness from Wiki-Based Resources Using Sourced PageRank
7	Wikipedia Based Conceptual Graph Model And Its Application
8	Research And Implementation Of The Knowledge Search System Based On Wikipedia
9	The Full-Text Semantic Annotation System Based-on Chinese Wikipedia
10	Research Of Extracting Structured Ontology From Wikipedia Infoboxes