Font Size: a A A

Research And Implementation Of The Construction Of Chinese RDF Knowledge Base

Posted on:2017-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:J F HuangFull Text:PDF
GTID:2308330485977480Subject:Software engineering
Abstract/Summary:PDF Full Text Request
People can obtain abundant information from the Big Data on the Internet, and they only need to put the keywords into the search engines to get the relevant news and data links. However, it is inefficient for people to acquire knowledge and information when facing the Big Data that continues increasing. Currently the information on the Internet are stored and published through the documents that associated with the hyperlinks. This way can make the people understand the information in the document, but it is hard for computer to understand the meaning of it. In order to make use of the Big Data with a better way, some foreign research institutions have built the knowledge bases from the English Wikipedia, such as FreeBase, DBPedia, etc. There are also some knowledge bases in China, such as Baidu knowledge base, Sogou knowledge cube and Tsinghua Xlore. The knowledge base has an important value in the field of knowledge graph, data fusion and artificial intelligence question answering. Foreign knowledge bases such as FreeBase provide the public resource description framework (RDF) data resources, but they have little Chinese entity information. The research on building a high quality Chinese RDF knowledge base has become a hot research field.Based on the above background, the methods of constructing a high quality Chinese RDF knowledge base is studied in this thesis, and the work is carried out in the following aspects:(1) The technology of Web crawling for large-scale online encyclopedia is studied, and the specific problems and challenges of Web crawling are analyzed. An online encyclopedia data crawling system is constructed which is combined with the Scrapy framework and the Spring MVC framework. The performance of the crawling system is stable and has a good user interface. Then a proxy IP address automatic extraction algorithm is proposed which can extract proxy IP address effectively and solve the anti-crawling problem.(2) The technology of online encyclopedia entity information extraction is studied, and the method of semantic annotation for the extracted information is proposed through RDFS information and RDF data standardization. Then the RDF data storage method based on graph database is studied, and a RDF data storage system based on NEO4J is developed. Compared with the traditional relational database storage, the experimental results show that the system can meet the requirements for large RDF data storage and SPARQL query.(3) The problem of entity alignment encountered in constructing the Chinese RDF knowledge based on Baidu encyclopedia and Hudong encyclopedia heterogeneous data sources is studied. Then a method of entity alignment based on entities’attributes and the features of context topics is proposed. A comparison of the proposed approach with several traditional entity alignment methods show that it is superior to the existing entity alignment methods.(4) Combined with the technology of large scale online encyclopedia data crawling, the method of RDF data transformation, storage and SPARQL query of entity information and the method of entity alignment based on heterogeneous data sources, the Chinese RDF knowledge base automatic building system is designed and implemented. The system can automatically download the online encyclopedia data by configuring the web crawling task, extract the entity information, standardize the RDF data and store the RDF data into graph database. The system can provide the function of the entity information retrieval and SPARQL query for external applications.
Keywords/Search Tags:knowledge base, resource description framework, web crawling, information extraction, graph database, topic feature, entity alignment
PDF Full Text Request
Related items