Data Feature Extraction Of Blogs And Filtering Of Splogs Based On Classification

Posted on:2010-07-22

Degree:Master

Type:Thesis

Country:China

Candidate:R Yan

Full Text:PDF

GTID:2178360302459878

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, blogs become a new application of network communication following Email, BBS, QQ / ICQ, and it goes into people's daily lives quickly to become the basic services based on Internet. Meanwhile, splogs(spam blogs) also spread rapidly to every corner of the blogosphere; and the existence of a large number of splogs has seriously affected the accuracy of information retrieval, which makes the user's experience worse and worse. So how to determine the splogs precisely has become one urgent problem in the field of information retrieval. In the information security field, the opinion analysis of blog content has drawn more and more attention, but the existence of splogs will affect the result of opinion analysis seriously, and reduce the accuracy and credibility greatly. Therefore, it is necessary to filter the splogs for further analysis and retrieval.In this paper, we proposed a method of part-of-speech analysis based on the existing feature extraction of splogs. Firstly, in the grammatical structure of Chinese, a sentence is composed by subject,predicate,object, and especially in the oral statement, there are a lot of elliptical sentences which are composed by subject and predicate, or predicate only. Secondly, most blog authors record in their blogs what they are interested in, or their own feelings and situations, so in the blogs, there are rich adjectives and mood words to express themselves. Thirdly, usually, splogs are written to increase the users' click-through rates, or hope to improve the importance of a page in the search engine by increasing links and keywords, so there are a lot of terms in the articles, especially industry-related terminology. Therefore, analyzing the part-of-speech of blogs and extracting some part-of-speech-related features will increase the complementarities between features greatly and improve the effectiveness of classifiers.We also designed a dynamic assembly classification algorithm for filtering splogs. Firstly, the algorithm constructs a treelike assembly classifier to support the classification. Then it presents a dynamic adjusting strategy to train the assembly classifier. Comparing with the traditional classifiers such as single classifier and simply ensemble classifier, this algorithm also adjust the combinational structure of the classifier in an adaptive way, so as to reduce the impact of the sparse features and unbalanced data of the splogs. The experiments show that this algorithm can get better precision rate and recall rate for Filtering of Splogs.Finally, we designed and realized an information retrieval prototype system based on blog content with the filtering of splogs, and it achieves good performance.

Keywords/Search Tags:

splog classification, assembly classifier, AdaBoost algorithm, ensemble learning, text clustering

PDF Full Text Request

Related items

1	Integrated Classifier Learning Algorithm
2	Research On Key Problems In Text Classification And Clustering
3	Research On Adaptive Boosting Algorithm And Ensemble Classifier
4	Research On English Text Classification Algorithm Based On Ensemble Learning
5	Research On A Combined Classification Based On Algorithm Based On AdaBoost
6	The Research On Classifier Ensemble Learning For Data Mining
7	The Improvement Of The Weighting Method In AdaBoost
8	Short Text Classification Method Based On Ensemble Learning
9	A Text Sentiment Classification Model Based On Multiple Multi-classifier Systems
10	Research On Key Issues Of Image Analysis With Ensemble Learning