Font Size: a A A

Extracting Structured Information From The Chinese Wikipedia And Measuring Relatedness Between Words

Posted on:2012-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:H C ZhangFull Text:PDF
GTID:2178330335969238Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
To improve the intelligent of the computer, it is important to incorporate the semantic knowledge into the natural language processing (NLP). Along with the rapid development of the information processing demand, nowadays, it is an important research field to extract semantic knowledge from all kinds of free data corpuses.The Wikipedia is the biggest free, web-based encyclopedia, which is written collaboratively by volunteers around the world. It has many advantages, such as widely knowledge coverage, highly structured degree and rapidly information update speed, so more and more researchers realize it as a semantic knowledge resource. But the Wikipedia official website just offers some foundation backup data files, much structured semantic knowledge can't be used directly.Therefore, in this paper, we firstly extract the structured information from these backup data files; then, we abstract the objects which the Wikipedia uses to organize the knowledge, and implement an open source framework; finally, we propose a new method for computing the relatedness between words. The main research works are as follows:Firstly, extracting the structured information. We firstly download the backup data from the Wikipedia official website, and then we convert the traditional Chinese text data into the simplified Chinese. After a series of processes, we gain much kinds of structured information, such as the internal links, the category taxonomy, and the anchor. To read and use the statistical data expediently, we save the data in Mysql data base, and create indexes for the important fields.Secondly, abstracting the structure of the Wikipedia. After analyzing the different roles of the Wikipedia terms, we divide the total terms into six classifications. And for each classification, we provide the user with many open API, which can reduce the difficulty of using the structured semantic knowledge.Thirdly, we propose a new method for computing the relatedness between words. We compare the Chinese Wikipedia with the traditional knowledge base, and find their similarities and differences. Then we combine the advantage of proposed method and the features of the Chinese Wikipedia data, and propose a new method for this purpose, which uses all the three kinds of semantic knowledge mentioned above, and incorporate the semantic knowledge into it. To evaluate our method, we implement it on different data set, and compare its result with other methods. Finally, we prove that our method is effective.
Keywords/Search Tags:semantic relatedness, Chinese Wikipedia, structured information
PDF Full Text Request
Related items