Font Size: a A A

The Construction Of Knowledge Base Based On Chinese Encyclopedia

Posted on:2016-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:L F WangFull Text:PDF
GTID:2308330470467665Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,developments of mobile Internet are in full swing, Internet of Things, cloud computing and other technologies, network applications emerging one after another, network data witnessed explosive growth. Facing such a large amount of data, how to derive valuable knowledge and make full use of it with deep calculation and analysis has become a hot research topic.Currently, different countries have built as many as 50 kinds of knowledge base, most of which are based on English Wikipedia or other English resources. Actually,Chinese encyclopedias (Baidu Encyclopedia, Hudong Encyclopedia and Chinese Wikipedia) have large amount of entries with high quality.This thesis contributes to build a knowledge base based on Chinese encyclopedias and has made some work as follows.(1)This thesis designs and implements a multi-threads web spider to download encyclopedia pages. We use breadth-first approach to the download the URLs of pages and categories and then download the pages.After analyzing the structured features of web pages,we use heuristics and other methods to extract semantic information from them.(2)The method of using the classification system of Hudong Encyclopedia to construct concept hierarchy system is presented in this thesis.This method extracts linguistic features and semantic features of categories to train a Adaboost model to extract hyponymy relations between categories.We use the relations to construct concept hierarchy system automatically. The same method is used to extract the relationship between category and entry.(3)This thesis uses Conditional Random Fields to extract attribute values from the unstructured text of Encyclopedia.Firstly,we identify attribute-value pairs from Hudong Encyclopedia pages that are featured with Infoboxes, which in turn can be used to learn which attributes we should pay attention to for different Hudong Encyclopedia entries.We then use a keyword matching approach to identify candidate sentences for each attribute in a plain Hudong Encyclopedia article. At last, we train a CRF model to extract corresponding values from these candidate sentences.In this thesis, we construct concept hierarchy system from the category system of Hudong Encyclopedia, and we perform experiments on Hudong Encyclopedia articles focusing on category "People",achieving excellent performance.
Keywords/Search Tags:Hyponymy, CRFs, property values, knowledge base, Chinese online encyclopedia
PDF Full Text Request
Related items