Font Size: a A A

Research On Gender Classification Of Blog Authors

Posted on:2013-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2248330371459430Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Blogs are web sites, which are usually managed and posted new articles by bloggers themselves. With the rapid development of the blogs, the value that blogs as an important source of information is also on the increase. In natural language processing and other aspects there are a lot of research on blogs, and many commercial companies use the imformation in blogs to provide value-added services, such as the blog search, blog subject traking, and the emotional analysis of people’s opinions in product and service.The gender classification of blog authors is a research job which is full of commercial applicaton value. For example, this research can help users find what themes and products are most talked about by men and women, and what products and services are liked or unliked by men and women, these information can be used to make targeted advertising and develop target product. So the research on gender classification of blog authors is of vital significance.Paper mainly realizes the gender classification of blog authors, and focuses on the study that how to improve the gender classification accuracy of blog authors, for a given blog, collecting releted essential featues, using the extraction best feature collection which is the result of this subject research and naive bayes classifier, we can classify the gender of the given blog, and the gender classification accuracy of the blog author can reach74.49%, the specific realization including four parts:Firstly, extracting the features for the gender classification of blog authors, which are the common features and the part-of-speech features; Secondly, realizing the feature selection method for the gender classification of blog authors, which are feature selection method based on single feature selection criteria and ensemble feature selection method; Thirdly, combining the simple bayes classifier with candidate feature sets and10-fold cross validation to selection the best feature set; Fourthly, designing and realizing the candidate feature set which merges high classification ability features to improve the gender classification accuracy of blog authors.Paper adopts the simple bayes classifier with candidate feature sets and10-fold cross validation to classify the gender of blog authors and validate the classification results, the finaly results of the experimental are as follows:The classification accuracy of the experiment based on the features with part-of-speech sequence which is62.99%is higter than that of the experiement based on the features without part-of-speech which is60.59by2.4%,; The classification accuracy of experiment adopting ensemble feature selection method is higher than the experiment adopting the feature selection method based on a single feature selection criteria method respectively, the compare results are72.89%>67.57%,72.89%>68.19%,72.89%>70.49%,72.89%>67.26%and72.89%>66.97%, moreover, the classification accuracy of the experiment adopting ensemble feature selection method which is72.89%is higher than that of the experiment adopting no feature selection method which is60.59%ty12.3%; The classification accuracy of the experiment based on the candidate feature set which merges high classification ability features which is74.49%is higher than that of the experiment based on the candidate feature set without improvement which is72.89%by1.6%. As a result, the classification accuracy of the experiment based on the candidate feaute set...
Keywords/Search Tags:gender classification of blog authors, classification features, feature selection method, bayes classifier, 10-fold cross validation
PDF Full Text Request
Related items