Font Size: a A A

Information Extraction Research And Application From Network Data

Posted on:2016-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y BaiFull Text:PDF
GTID:2308330461470248Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the Internet, artificial intelligence technology has been widely used in human social activities, but the knowledge base that has different sizes need to be builded to support this technology application and development. But the construction of knowledge base often needs to use the structured data that was extracted from the vast amounts of unstructured and semi-structured data. This research focuses on the information extraction from massive network data. The research content includes the large-scale web data collection and collation, semi-structured data extraction, unstructured data extraction and the construction of RDF knowledge base.Douban, Dianping and Chinese network encyclopedia contain a large number of semi-structured, unstructured data, and these data are good data source for information extraction. But the data of these webs is protected and the agent ip is prohibited, this problem leads to it is very difficult to get as much information as possible for the crawler. Meanwhile, because the attributes of Chinese network encyclopedia are defined by numerous internet users, which leads to there are thousands of attributes in Chinese network encyclopedia. The previous researchers only extract information from the attributes of high frequency, and give up information extraction of the most of attributes.To solve the above problems and build a structured knowledge base, this thesis completes the work that contains the following steps, and proposes corresponding solutions for the above problems. The details are as follows.First, the problem of the HttpClient proxy server downloads data by using dynamic IP is researched in this thesis. In order to solve the problems of data protection and IP prohibition in Douban, Dianping, Baidu encyclopedia, Hudong encyclopedia, the HttpClient proxy server downloads data by circularly using multiple free agent IP addresses and multi-threading method.Second, in order to extract the semi-structured data, this thesis researches an approach of semi-automated extraction based on regular expressions. According to the characteristics of the semi-structured data in Douban, Dianping, Chinese encyclopedia, and combined with regular and string matching, the semi-automatic information extraction method is presented in this thesis.Third, in order to extract information from the unstructured data, the approach to attribution hierarchical construction and attribution normalization is researched in this thesis. There are the problems that the same attribute uses different attribute descritrion words in unstructured texts in the encyclopedias. It is difficult to establish the same schema of knowledge base using different encyclopedias data set.The approach of attribute hierarchical construction and attribution normalization is put forward by this thesis.Finally, the knowledge base is constructed by using Resource Description Framework. After getting the structured data, they are organized to uniform format.These data are established to their own Resource Description Framework knowledge base according to the different sources of data.In this thesis, semi-structured data are extracted in data sets of Douban, Dianping, Baidu encyclopedia and Hudong encyclopedia.Attribute hierarchical construction and attribution normalization are finished in different category in data sets of Baidu encyclopedia and Hudong encyclopedia.Then the unstructured information extraction experiment is finished in the data set of "character" category of Hudong encyclopedia. According to the experimental results, this proposed method can not only solve the problem of covering too few attributes in information extraction, but also provides the idea and method for the construction of knowledge base containing more attributes.
Keywords/Search Tags:Network data, Information extraction, Semi-structured data, Unstructured data, Resource Description Framework(RDF)
PDF Full Text Request
Related items