Font Size: a A A

Research On The Application Of Named Entity Recognition In Content Mining Of Chinese Local Chronicles

Posted on:2012-07-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:S L ZhuFull Text:PDF
GTID:1115330368985530Subject:History of science and technology
Abstract/Summary:PDF Full Text Request
Ancient books, such as Chinese local chronicles, have very early origins and also continued for a long time. These ones have all kinds of types and a large number. According to the statistics of Union Catalog of Chinese Local Chronicles, about more than 110.000 volumes of 8264 kinds of Local Chronicles, which account for about one-tenth of Chinese ancient books, are still preserved, and they are only the ones compiled from the period of Song Dynasty to Republic of China. Collecting and using Local Chronicles is a good Chinese tradition in history. In the 1950's, Wan Guoding, the famous historian of agriculture and one of the principal founders of the subject of Chinese Agricultural History, led dozens of people to extract and finish the thematic material named Local Chronicle: Produce. These materials have great value in the field of agricultural science and technology and also the field of economy as they recorded the information about the names, performances, uses and distributions of products in detail. Nowadays, in the information age with the rapid development of information technology, how to use these techologies to collect materials about local chronicles and reduce the difficulty of exploitation at the same time, has become a realistic subject. Based on Local Chronicle:Produce, this paper attempts to explore a new method to collect ancient books such as local chronicles.Firstly, the author focuses on the main contents of the collection of local chronicles, varied kinds of methods on the behaviour of collection and also the existing research achievements. Then, this paper elaborates on the origin of Local Chronicle:Produce and gives an account of the process of collecting Local Chronicle:Produce both by hand and digitally. After this, problems on local chronicles collecting are analyzed and the purpose and meaning of the present research is brought out. And then the paper introduces some basic linguistic knowledge about the concept, the role, the task of recognizing as well as the characteristics and difficulties of named entity recognition. The author also summarizes the current related researches both at home and abroad and discusses the methods of named entity recognition. At last, the author formulates the method of location names recognition from Local Chronicle:Produce according to the characteristics of Chinese local chronicles and the location names in Local Chronicle:Produce.Based on the Local Chronicle:Produce of Guangdong, Fujian and Taiwan, this paper focuses on the construction of a recognition system of location names in Local Chronicle: Produce, and also the exploration of the method of content mining of Chinese local chronicles. Then, according to the statistics about the related recogniton results, the author has a research on products, location names and rules. The main contents are as follows:(1) The recognition system about location names in Local Chronicle:Produce includes two function modules of full-text database and the location names recognition subsystem.The construction of the full-text database:Based on the characteristics of the statement format of Local Chronicle:Produce of Guangdong, Fujian and Taiwan, this paper makes a standard textual format and also designs the structure of database, drawing on previous analysis. And the full-text database has the functions of the full text retrieval, key words retrieval, the cluster retrieval and the data analysis.Recognition subsystem of location names in Local Chronicle:Produce:it uses the Rules-based and Statistics-based method to achieve automatic recognition of location names about products, combining with the local chronicles'own peculiarity. The subsystem has the functions of the rule management, the location names recognition, the database of the location names and the statistics of the information. After some tests, it proves that the system can meet the needs of the related researchers on ancient books retrieval and knowledge discovery. And the recognizing effect will be optimized by improving and perfecting the rules gradually.(2) Analysis and research about the production of Local Chronicle:Produce:This article makes a statistics and analysis about all productions recorded in Local Chronicle:Produce of Guangdong, Fujian and Taiwan from the sides of the period of history. the types of local chronicle and also their regions. The result which is counted from historical period shows that the average number of products recorded in each local chronicle is increased progressively from Ming Dynasty to Qing Dynasty and then to Republic of China. The result counted from local chronicle's types shows that the average number of products recorded in each local chronicle is gradually decreasing from the province to the district and then to the county. Counted from regions, the statistical result shows that regions of productions in Local Chronicle:Produce of Guangdong, Fujian and Taiwan not only contain the products in the three provinces, but also all the ones in Hainan Province and part fields of Guangxi Province.(3) The research of the content mining of Local Chronicle:Produce, based on the location names,includes the statistics and analysis about all the correct location names, the distribution of the products in varied provinces, the propagation of the products and also the introduction of products that are introduced from other places.All the correct statistics and analysis are based on the 7179 operative recognition records of location names. Provinces classify and analyse the records according to the names in the provinces, outside the provinces, abroad and also the names which covers wide fields. Statistical analysis shows that compared to the other two provinces, the exchanges and the communication that Taiwan Province has with the outside world is relatively wider.Based on the relevant statistical data, the research about the distribution of the prodcts, analyses the specific distribution of products in the provinces of Guangdong, Fujian and Taiwan, and uses ArcGIS software to draw thematic maps, so the relevant content can be showed comprehensively and intuitively. The result shows that there are two main factors which determine the diversity of local products. The first one is the region's natural factors, including its geographical location, natural environment and climatic conditions. The second one is the human factor in the region, including the development and utilization of natural resources and also the introduction of the products from other places.Based on the relevant statistical data, the research about the dissemination of provincial products, analyses the spread of the products in the provinces of Guangdong, Fujian and Taiwan in detail, with the same ArcGIS software to draw the thematic maps. The result shows that the range of the products'inter-regional exchange and dissemination reduces gradually with the expansion of the distance between the regions. The farther the distance does, the less exchange and dissemination the products will do. Based on the relevant statistical data, the research about the introduction of the products from other places, compares the introduction situation of the Guangdong. Fujian and Taiwan provinces. The result shows that there are two reasons to promote the introduction and spread of the products. The first one is the trading between the regions. The second one is the colonial aggression and war.(4) Based on the recognition rules, the researches of the content mining of Local Chronicle:Produce include the reseach about the statistical analysis of all the recognition rules, the comparison of the products' distribution in varied provinces and also the research about the way of the products' propagation and introduction.All the statistics and analysis are based on the 7179 operative recognition records of the recognition rules. According to the meaning that the rules express, the system classify these recognition rules to two types, the rule to identify the distribution names of the places that the products distribute, and also the rule in order to identify the places where the products are introduced from.Based on the statistical data related to the recognition rules, this paper discusses the distribution of the products, shows the details about the products' places of origin, places where they distribute, their merits and also their accounts that the local record describes. And it also summarizes part of the products' origin places and high-yield places.Based on the statistical data related to the recognition rules, this paper also explores how the products are introduced from other places and how they are spreaded to other ones. It summarizes three main ways for the products to be introduced and spreaded in the Ming and Qing Dynasties. The products can be introduced and spreaded by foreign trading, the way of tribute, or be passed by the monks.In short, this paper takes Local Chronicle:Produce as corpus and realizes the location names recognition of Local Chronicle:Produce by using the named entity recognition technology. Based on the bibliometric analysis on the recognition results, the paper researches on the content mining of Local Chronicle:Produce in order to explore a new method of collecting ancient books based on the contents. The innovations of this paper are:(1) It uses the theories and methods about named entity recognition on ancient books, such as Chinese local chronicles, to recognize location names from Chinese local chronicles.(2) It analyzes the products' names, location names and recognition rules from recognition results of Local Chronicle:Produce by bibliometric method. Knowledges about products'distribution, propagation and introduction are acquired And it achieves the digital collection of ancient books, which based on the content.(3) It uses the GIS thematic maps, so that the distribution and the introduction of the products in Local Chronicle:Produce be showed more intuitively. It breaks the traditional mode of written expression, so that the space feature of the chronicles can be fully revealed.Named entities include person names, location names and organization names and so on. This paper just recognizes the location names in Guangdong, Fujian and Taiwan provinces of Local Chronicle:Produce. And in the future, the one can do some researches on the recognition of other entities like person names, organization names an so on by modifying or re-entry, re-organize rules, so that the one can mining and use the ancient information from multiple perspectives, providing the industrial and agricultural productions and scientific researches the historical reference evidence.
Keywords/Search Tags:Local chronicle, Local Chronicle, Produce, Location name recognition, Content mining, Ancient books collection
PDF Full Text Request
Related items