Font Size: a A A

Research On The Key Techniques Of Web Information Intelligent Acquisition

Posted on:2005-09-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y JiaFull Text:PDF
GTID:1118360185495665Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Internet pioneers a whole new world for people. In this virtual world, people share the information and communicate with each other in a totally new manner. Anyone can release any information anywhere at anytime, which makes the Internet the most important source of information. But people still find themselves lost in the sea of electronic documents because of the difficulties to find what he/she really wants. To effectively help the users locate useful information, studying and develop intelligent tools of web information acquisition has been found widely applicable and practically valuable and thus has attracted much attention in recent years. This dissertation studies the model, algorithm and application of several key research topics of information acquisition i.e. information crawling, information extraction, news event detection and tracking, cause-and-effect of news event etc. The major contributions of this dissertation are as follows:(1) A Multi-Agent model of web information gathering is proposed based on theories and techniques of Agents via analyzing the process and examples of information gathering, and applying and developing of the model are fulfilled based on MAGE. At the same time, to satisfy personalized special information demands of users, we analyze the specialties of topic web pages and the techniques of topic-focused crawling and bring forward three topic-related computing models which are computing model based on hyperlink relations, computing model based on HTML metadata of URL, computing model based on content of pages. Special topic tracking mode of information gathering is realized based on the above-mentioned three topic-related computing models, and the experiments show the flexibility, extensibility, and effectiveness of the multi-agent based model of information gathering and feasibility of the three topic-related computing models.(2) Through analyzing the specialties of noise data, we put forward three kinds of noise detection model including noise detection model based on HTML metadata of URL, noise detection model based on redundancy of noise data, noise detection model based on information entropy of texts of noise URL. With an algorithm of noise eliminating based on the three kinds of noise detection models, practical experiments have proved the feasibility and validity of these models and mat the algorithm can provide high-quality data for subsequent processing, such as text classification, clustering, news event detection and tracking, and organization of cause-and-effect of news event.(3) An algorithm of new words/phrases discovery in large corpus is proposed based on the incorporation of statistical selection and rule-guided filtering. First, the algorithm proceeds with the segmentation and Part-Of-Speech tagging of the corpus, followed by the co-occurrence analysis of the preprocessed corpus based on the bi-grams statistical model. Then the binary statistical results can be used to filter words/phrases based on statistical computing methods. To eliminate noise data in words/phrases selection, many kinds of rules, such as rules based on POS, on length of words, and on...
Keywords/Search Tags:information acquisition, information extraction, knowledge discovery, data mining, text mining, web mining, information gathering, topic-focused crawling, noise eliminating, information retrieval, text classification, clustering, text summarization
PDF Full Text Request
Related items