Font Size: a A A

Research Of Network Information Collection And Intelligent Processing Technology

Posted on:2013-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:L N ZouFull Text:PDF
GTID:2248330371981317Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Whether the scientific research or study we all need to find the latest professional information and news and trends through the Internet, but the explosion of information also make people get information more and more difficult in the ocean of information. On the one hand, the information on the Internet increases everyday and updates quickly, this requires a lot of time for information search; On the other hand, there are large repeat information on the Internet, and the format of information is not standard, that increase the difficulty of searching information for users. So the technology for network information collection and intelligent processing arises at the historic moment.Users can search a large number of information through search engine, but without information extraction, organization and processing. Along with the progress of information, information search has improved from "general" into "personality and intelligence" as the users demand more and more on acquisition of information. On the market at present there have been a lot of information collection tools that can satisfy the needs of information acquisition for users to certain extent, but for information processing is poor. Due to the text information accounts for a large part of the Internet, how to automatically classify the text information in Internet becomes the key technology of information processing.First, this paper introduces the web crawler and analyzes the principle of web information acquisition, duplicated webpage deletion and the method of information extraction based on the analysis of the existing information collection and information processing technology. And does a in-deep research to the key technology of text classification for intelligent information processing, improved the existing feature selection method and the text classification algorithm. With the improved KNN algorithm constructs a automatic text classifier, take the sogou corpus as the training corpus in classification model, then trained out the best K value and the characteristic dimension for this corpus through the experiments, and verified it has better effect of classifying by improved KNN algorithm.The innovations of this paper are as follow:(1) The method of feature selection in text information processing is improved in this paper, proposes the thought of synonyms merger by introducing the TongYiCi CiLin, replace and calculate the synonyms before feature selection, so as to reduce the dimension of feature space.(2) An improved KNN algorithm has been presented in this paper. By use of the clustering center vector, we put the distance of the under classified text and the category of text into the similarity calculation formula, and take the ratio of the number of common features appear in two texts and the maximum number of respective features of two texts as the adjust factor in the formula.(3) Constructs a automatic text classifier with the improved KNN algorithm, the connection between the under classified text and the category could be a prior consideration in classification stage, when the relationship between the two is ambiguous, comparing with all training texts, determine the category of the under classified text according to the result of the comparison.
Keywords/Search Tags:Network information collection, KNN algorithm, Feature selection, Vectorspace model, Text classification
PDF Full Text Request
Related items