Font Size: a A A

Research And Implementation Of Internet Information Gather & Process System

Posted on:2006-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiangFull Text:PDF
GTID:2178360182477324Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the prevalence of Internet application, the focus of data mining technology has transferred from database mining to the Web mining. This thesis summarized the Internet information gather and Web mining. By using Web page model, SPIDER searching technology and content parsing technology, we develop a flexible visual Internet Information Gather & Process System. This system can automaticlly track Web to gather information, then filter and extract information according to the user's choice, lastly saves information into database.Support Vector Machine (SVM) is a new learning pattern recognition method developed in recent years based on the statistical learning theory. It puts up some special advantage in the fields of nonlinear and high dimensional pattern recognition. The Web data mining of the system proposed in this paper is using the automatic classification function supplied by the software SVM-light to implement the automatic classification of Web information. The experimental results prove that this system has performance on both of precision and speed. It can effectively discover the"noticeable"information.Features are used to specify which information is relevant to a classification task. The number of features affects the classifier's speed. Including a large number of features can result in long training and classification times. Feature selection problems are related to the problems of input dimensionality reduction. Feature selection is an important preprocessing step in automatic text classification. It reduces the size of the vocabulary used to represent text documents, and therefore makes the classification process more efficient. Moreover, careful feature selection often improves the accuracy of the classifier. Feature selection for text classification is based on a greedy filtering approach. Features are evaluated on the basis of statistical function independently, and a feature score is assigned to each word. In this paper the Document Frequency, Information Gain, Mutual Information and CHI Statistic are analyzed in detail and compared through experiments. It is approved that the feature selection methods with well behave in usual text categorization are unsuitable for Web Chinese texts. The reason of this difference is analyzed and the better feature selection methods are proposed. These methods are of great benefit to improve the classifying efficiency and accelerate the process of classifiers.
Keywords/Search Tags:Internet Information Gather, Web Textual Mining, Support Vector Machine, Feature Selection
PDF Full Text Request
Related items