Research And Implementation Of Internet Information Gather & Process System

Posted on:2006-10-05

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Liang

Full Text:PDF

GTID:2178360182477324

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the prevalence of Internet application, the focus of data mining technology has transferred from database mining to the Web mining. This thesis summarized the Internet information gather and Web mining. By using Web page model, SPIDER searching technology and content parsing technology, we develop a flexible visual Internet Information Gather & Process System. This system can automaticlly track Web to gather information, then filter and extract information according to the user's choice, lastly saves information into database.Support Vector Machine (SVM) is a new learning pattern recognition method developed in recent years based on the statistical learning theory. It puts up some special advantage in the fields of nonlinear and high dimensional pattern recognition. The Web data mining of the system proposed in this paper is using the automatic classification function supplied by the software SVM-light to implement the automatic classification of Web information. The experimental results prove that this system has performance on both of precision and speed. It can effectively discover the"noticeable"information.Features are used to specify which information is relevant to a classification task. The number of features affects the classifier's speed. Including a large number of features can result in long training and classification times. Feature selection problems are related to the problems of input dimensionality reduction. Feature selection is an important preprocessing step in automatic text classification. It reduces the size of the vocabulary used to represent text documents, and therefore makes the classification process more efficient. Moreover, careful feature selection often improves the accuracy of the classifier. Feature selection for text classification is based on a greedy filtering approach. Features are evaluated on the basis of statistical function independently, and a feature score is assigned to each word. In this paper the Document Frequency, Information Gain, Mutual Information and CHI Statistic are analyzed in detail and compared through experiments. It is approved that the feature selection methods with well behave in usual text categorization are unsuitable for Web Chinese texts. The reason of this difference is analyzed and the better feature selection methods are proposed. These methods are of great benefit to improve the classifying efficiency and accelerate the process of classifiers.

Keywords/Search Tags:

Internet Information Gather, Web Textual Mining, Support Vector Machine, Feature Selection

PDF Full Text Request

Related items

1	A Study On Feature Selection Algorithms Based On Support Vector Machine And Its Application
2	L_p Regular Izat Ion In Support Vector Machine For Features Selection
3	Research On Support Vector Machine Based Text Classfication
4	Research On Intrusion Detection Methods For Multimedia Internet Of Things Based On Machine Learning
5	Feature Selection Research Based On Maximum Relevance Minimum Redundancy
6	Feature Selection Based On Linear Twin Support Vector Machine And Application
7	Incorporating K-means, Triangle Area Support Vector Machine And Feature Selection Algorithms For Intrusion Detection System
8	Support Vector Classifier Machine Based On Feature Analysis
9	The Study Of Classification Methods And Its Applications In Web Mining Based On Statistical Learning
10	Research On Microrna Recognition Based On Support Vector Machine