Font Size: a A A

Research On Web Information Retrieval Based On Data Mining

Posted on:2007-12-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:M XuFull Text:PDF
GTID:1118360215496996Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet makes people can easily access to information, but the amount of publicly available information on the web is growing explosively and only a small portion of the information on the web is truly relevant or useful for a person. So how to help people find their needs on Internet becomes a problem. Search engines can help users to find their needs to some extent,but it can't satisfy completely the needs of users. In this background and At the base of analyzing the present search engine as viewed from structural model, here gives a novel search engine model (hybrid model) and uses some methods or theories of data mining to solve some problems of Information retrieval. The content includes how to construct a rational search engine, how to organize the resource of Internet with reason, how to find the implicit resource in Internet and how to maintain data obtained. The main contributions of this dissertation are summarized as follows:Firstly, at the base of analyzing the present search engine as viewed from structural model, here gives a novel search engine model (hybrid model), it can find user's needs quickly and exactly, and here analysis some key techniques for implement of the hybrid model.Secondly, the algorithm of a hierarchical document categorization based on Fisher linear discriminant(HDCF) was proposed according the thought of Fisher linear discriminant. The algorithm is to categorize documents hierarchically according to their topics, to get positive feature words and negative feature words in each category by the thought of Fisher linear discriminant, than categorize the document given. The algorithm overcomes the assumption that the feature words appear independently in documents and deals with the problem of a document involving more than one category. With comparing other algorithms by using the measure of recall and precision in experiments, the results show HDCF is more effective than others.Thirdly, for the need of online document classification, a semi-supervised learning system was proposed based on ART (adaptive resonance theory). It overcomes the limitation in the assumption in other semi-supervised learning algorithms that probabilistic distribution of data is known, and has the strong ability of learning new patterns and correcting errors because of stability and plasticity of the adaptive resonance theory. Higher adaptability of the system was advanced by setting vigilance parameters dynamically. Experimental results illustrate that the performances of the proposed system is better than the discriminant CEM (classification expectation maximization) algorithm, particularly when there are noise data and new patterns.Fourthly, to overcome the limitation of existing models, a new model of cyclic association rules proposed is can divide a cycle into several time segments of possibly different length through clustering analysis, and the corresponding algorithm is also given. The experiments illustrated it can discover cyclic association rules more precisely. At the same time, because it difficult to find cyclic generalized association rules among data items at low or primitive levels of abstraction due to the sparsity of data in multidimensional data and many data has different levels. The algorithm of mining Cyclic Generalized Itemsets (CGI) was proposed for the reason that these rules maybe discover at high levels. Because cyclic rules may be quite sensitive to a little noise, this paper uses the noise-ratio as the criterion of identifying cyclic itemsets for dealing with the problem and at the same time utilizes the cycle-pruning technique to reduce the computing time of the data mining process by exploiting the relationship between the cycle and generalized frequent itemsets, at the same time, analysis the meaning and redundancy of the rules discovered. The paper gives the algorithm of mining Cyclic Generalized Itemsets (CGI). The experiment shows the CGI algorithm can yield significant results efficiently.Fifthly, association rules show the regularities in a large amount of data, but sometimes they are too many for us to be understood. So a new problem of knowledge management is created. To solve the problem a novel algorithm is presented which utilizes a hierarchical clustering method and its distance definition (called RadioD) is based on the attributes of association rules so that these rules can be effectively grouped and be better understood. The experimental result s show that the algorithm is effective.Sixthly, the methods to solve the problem of maintaining discovered sequential patterns have mainly two kinds. One is simply applying algorithms of mining sequential patterns to the updated database, but it scans not only changed data but also unchanged data in the original database which is very large. If the database is updated frequently, it takes much time. Another is according to the number of records changed in the database to decide when to operate the whole database, but the number of sequential patterns changed is not proportion to the number of the records changed. So we use sampling techniques to estimate the degree of sequential patterns changed to determine whether we should update the mined sequential patterns by operating the whole database or not. This can solve better the problem of maintenance of the sequential patterns.
Keywords/Search Tags:search engine, hierarchical document categorization, adaptive resonance theory, semi-supervised learning, cyclic association rules, cyclic generalized association rules, grouping, sequential patterns
PDF Full Text Request
Related items