Font Size: a A A

Data Feature Extraction Of Blogs And Filtering Of Splogs Based On Classification

Posted on:2010-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:R YanFull Text:PDF
GTID:2178360302459878Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, blogs become a new application of network communication following Email, BBS, QQ / ICQ, and it goes into people's daily lives quickly to become the basic services based on Internet. Meanwhile, splogs(spam blogs) also spread rapidly to every corner of the blogosphere; and the existence of a large number of splogs has seriously affected the accuracy of information retrieval, which makes the user's experience worse and worse. So how to determine the splogs precisely has become one urgent problem in the field of information retrieval. In the information security field, the opinion analysis of blog content has drawn more and more attention, but the existence of splogs will affect the result of opinion analysis seriously, and reduce the accuracy and credibility greatly. Therefore, it is necessary to filter the splogs for further analysis and retrieval.In this paper, we proposed a method of part-of-speech analysis based on the existing feature extraction of splogs. Firstly, in the grammatical structure of Chinese, a sentence is composed by subject,predicate,object, and especially in the oral statement, there are a lot of elliptical sentences which are composed by subject and predicate, or predicate only. Secondly, most blog authors record in their blogs what they are interested in, or their own feelings and situations, so in the blogs, there are rich adjectives and mood words to express themselves. Thirdly, usually, splogs are written to increase the users' click-through rates, or hope to improve the importance of a page in the search engine by increasing links and keywords, so there are a lot of terms in the articles, especially industry-related terminology. Therefore, analyzing the part-of-speech of blogs and extracting some part-of-speech-related features will increase the complementarities between features greatly and improve the effectiveness of classifiers.We also designed a dynamic assembly classification algorithm for filtering splogs. Firstly, the algorithm constructs a treelike assembly classifier to support the classification. Then it presents a dynamic adjusting strategy to train the assembly classifier. Comparing with the traditional classifiers such as single classifier and simply ensemble classifier, this algorithm also adjust the combinational structure of the classifier in an adaptive way, so as to reduce the impact of the sparse features and unbalanced data of the splogs. The experiments show that this algorithm can get better precision rate and recall rate for Filtering of Splogs.Finally, we designed and realized an information retrieval prototype system based on blog content with the filtering of splogs, and it achieves good performance.
Keywords/Search Tags:splog classification, assembly classifier, AdaBoost algorithm, ensemble learning, text clustering
PDF Full Text Request
Related items