Font Size: a A A

Research On Several Association Rule Mining Problems For Web Information Retrieval System

Posted on:2010-07-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ShenFull Text:PDF
GTID:1118360278965446Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In current century, information bomb becomes remarkable with a high-speed update, and users' requirements about search results continues increasing, so that how to achieve useful information from a huge mount of web information resources is one of the vital problems. On one hand, in some situations it is not most efficient to use the key words to search web pages which contain required information. Mining the association relationship among web pages can guide users to obtain more useful pages via one useful page. On the other hand, many web novices are not well in using few simple words to describe their complex search targets correctly. Due to many abbreviations, synonyms and associated words, it is easy to understand the inherent ambiguity in language. Accordingly, the same word can represent different search demands; likewise, the same searching demand can be described by different words. Therefore, it is helpful to mine the association relationship to construct the effective search words and find the resultant information. Since the quality of searching results of Chinese web information retrieval system is still not very good, this dissertation focuses on solving several association rule mining problems in web information retrieval system. The contributions are as follows:1. Based on the analysis of linkage relationship between web pages, a new algorithm for mining related pages is proposed in this dissertation. The HTML segmentation step is first introduced in the process of mining related pages. Combining with other technologies, such as page template filtering and anchor text similarity boosting, the precision of related pages is improved by the algorithm. In order to handle large corpus in practical engineering project, the detailed flowchart of how to implement the algorithm in parallel is also illustrated in this dissertation.2. Chinese abbreviations are widely used in Chinese texts for convenience or space saving. Since abbreviations and their original definitions can be substituted freely without changing article meaning, it has brought much challenge in web information retrieval. For this reason, an effective and novel approach is proposed to identify Chinese abbreviations and their definitions automatically. First, the longest common sequence algorithm is used to extract abbreviation-definition pair candidates from anchor texts. Further, a support vector machine model is trained to filter the genuine abbreviation-definition pair from candidates. Experiment results show an encouraging performance.3. Mining the association relationship between Chinese words and clustering them according to its topics can help web information system provide diverse searching results and generate related queries. In this dissertation, a simple but powerful algorithm to cluster Chinese words is proposed by using Chinese punctuation characteristics. The algorithm can efficiently cluster paratactic words in large Chinese corpus through the approximation of the dense sub-graph mining algorithm into bipartite graph. Two algorithms are also proposed to further improve the precision and recall of the words clusters. Many Chinese words within the same topic can be obtained from these algorithms. Experimental results indicate that the algorithm is very suitable for Chinese terms clustering and application in practical engineering.4. How to help users construct precise queries to describe their searching target is an important research area in web information retrieval. In this dissertation, a composite framework is proposed to suggest related queries for the original queries submitted by users. This framework suggests related queries according to several factors such as relevance, popularity and effectiveness, in order to narrow users' targets and obtain searching results with higher precision. In addition, the framework uses click information in query logs, Chinese abbreviation-definition pairs, Chinese words clusters and Chinese synonyms to modify original query without changing its meaning, which can help users get more results relevant to their searching target. Experiments show that the framework can suggest related queries for users with high efficiency. The quality of searching results of web information system may be improved by this framework.
Keywords/Search Tags:Web information retrieval, Distributed computing framework, Related pages, Chinese abbreviation definition pairs, Chinese words clustering, Chinese related Queries, Chinese language model, Chinese synonyms mining
PDF Full Text Request
Related items