Font Size: a A A

Research Of Web Chinese Information Intelligent Extraction And Classification

Posted on:2006-08-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:M HuFull Text:PDF
GTID:1118360155953719Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The fast development of the world wide web results in the thriving of the Internet. With the fast development of the Internet technology, the Internet information and technology have increasingly expand, the need and demand for information exploitation have ever increased, and at the same time, the techniques of data warehouse, data mining and intelligence computation have been in the ascendant. All these trends indicate that data resources are more and more plentiful, the quantity of information is larger and larger, and will continuously explode. So, the study of information retrieval, information acquisition and information classification etc. become more and more important. Nevertheless, current methods and tools universally or widely used for information searching, extraction and classification are far from satisfactory when dealing with abundant information, and the techniques of information extraction and classification lack intelligence. However, more and more people find that the world wide web cannot satisfy their increasing needs, and the world wide web has at least the following limitations in extraction and classification: 1. The data that world wide web provides are abundant, but it lacks the description of the data, that is, lacks Metadata. 2. The link provided by HTML—the basis of the world wide web –lacks semantic meaning. 3. The quality and the effectiveness of the key—word—based world wide web searching engine are far from being satisfactory. Which is mainly demonstrated by the fact that the interrelation of searching results is not ideal, and the displaying form of searching results is single. The defects of current information searching tools are obvious. Firstly, they lack the expression of semantic relationship between key words. Secondly, they can not well define the indexing words and interrelated words. Thirdly the searching texts are not very accurate, etc. To solve the above problem, it's an effective way to apply the techniques of data mining, artificial intelligence and natural language understanding to information searching, extraction and classification In order to make effective data searching, classification and selection to Chinese texts, this dissertation makes further study of indexing descriptors , and the classification of Chinese texts by using data mining ,statistic theories and what the author has found in the field of Chinese text. Furthermore, global dynamic shape of the genetic algorithm are analyzed combined with simplified 2-bit problem, thus the global convergence of the algorithm is proved. Based on genetic algorithm, this dissertation puts forward a heuristic variation strategy and a Chinese texts summarizing and extracting algorithm by making use of the field background knowledge put forward in this dissertation. The lattice machine theory is expanded, and is applied in the classification of Chinese texts, and the result is comparatively satisfactory. An experimental framework of Agent-based web information mining is established based on BDI logic research. This dissertation has mainly accomplished the following tasks: 1. Put forward and realize an automatic select algorithm for Chinese texts indexing descriptors. This algorithm is based on the method of automatically indexing descriptors .The method also fuses the cutting of key words with indexing descriptors by the use of statistic theories combined with text background knowledge. It also puts forward a finite weight for indexing descriptors. This algorithm is applied in the "Documentary Files Automatic Indexing Full Text Searching System"(a project of the State Archives Administration of China), and the result achieved is very satisfactory. 2. Currently, the analysis of the operational mechanism of genetic algorithm mainly focuses on the problem of limit behavior, while the study of algorithm's global dynamic shape is comparatively less. Starting from a typical problem of simple 2-bit, global dynamic shape of GA is analyzed comprehensively. Four mathematical models are established for the choice of various parameters. Through the analysis of every non-mobile points attiaction of these models, the influence of different evolution operator on dynamic shape is discovered. The global convergence is proved for this problem. Put forward and realize a Chinese text indexing descriptors algorithm based on genetic algorithm. A coding program , adaptation function ,selective strategy and genetic operator are devised in order to solve the problem of indexing descriptors. A heuristic variation probability formula is devised by introducing heuristic information into the design of variation operator according to acquired field knowledge. The above measures adopted produce a comparatively result. 3. Expand "equilabeled"of lattice machine and put forward the concept of "intersection labeled"so as to solve the problem of multi-decision attribute...
Keywords/Search Tags:data mining, indexing classification, genetic algorithm, system dynamics, fixed point, attraction, information classification, lattice machine, multi-agent communication, speech act
PDF Full Text Request
Related items