Font Size: a A A

The Extraction System Of The Medical Post Bar Advertising

Posted on:2017-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhangFull Text:PDF
GTID:2348330488950947Subject:Engineering
Abstract/Summary:PDF Full Text Request
Post bar is a relatively large Chinese social networking platform, at present, a large number of advertisements appear in the post bar, even half of the posts posted on the post bar web page are advertising.These ads seriously reduce the quality of the post bar, not only let the users waste a lot of time browsing the useless information, but also seriously hinder the user to communicate with each other and get useful information through the post bar platform. In addition, some false advertisements may also mislead people to judge the valuable information as rubbish, confusing favorable information and harmful information,especially advertising about medicine caters to some of the patients or their families, so that they believe in a number of false advertising, which delay the patient to receive regular treatment. Now, a large number of advertisements on the post bar still rely on manual processing by the post bar master, obviously the efficiency is not high. In order to realize the intelligent recognition of the advertising information, the advertising extraction system is developed in this paper, which can feedback the message that which posts are advertising to the user when people are viewing the posts, and remind the user which information can not be viewed, which can also let the users avoid some network fraud caused by false advertising.Ad extraction is a direction of information extraction, information extraction is to filter out the information that people are interested in from the specific information flow. The information extraction in this paper can be translated into text categorization. The core module of the ad extraction system is the extraction of the advertising text, so the focus of this paper is the design and implementation of the text classification module. The general process of text categorization system includes text preprocessing, text representation, training classification model and testing classifier. The main achievements are as follows:(1) Accessing to text data in the post bar. The crawler program which achieves grabbing text data from the post bar is prepared.(2) The acquisition of training samples and test samples. Training samples and test samples are from the post bar text, 200 training samples and 40 test samples, and they divided into two categories of advertising text and non advertising text.(3) Text segmentation and stop words elimination. Text segmentation is realized by jiebasegmentation tool, and according to the characteristics of the text in the post bar, the open source stop word list is modified.(4) Feature selection for training samples. A combination of information gain and logistic regression is proposed for feature selection, and it is implemented by Python language. First of all, this paper uses the information gain method to pre select the feature, and then uses the recursive elimination based on logistic regression to select the feature, finally, through the classification effect to determine the number of the feature to retain.(5) The text representation of vector space model is realized. This paper will be from 200 training texts selected feature words form a word set. According to the word set, the document set is transformed into a matrix, the number of rows of the matrix is equal to the number of articles in the document set, and the number of columns of the matrix is equal to the number of feature words in the word set, and each data in matrix is the weight of each feature word in the article. The weight of this paper is obtained by TF-IDF algorithm. Each sample and its categories of data are stored in a different folder.(6) In this paper, we use decision tree and naive Bayes to train the classifier. By comparing the classification efficiency of the two classification algorithms, this paper finally chooses the decision tree as the classification algorithm of the ad extraction system.(7) Classification results. Using 40 samples to test the extraction system of advertising, the accuracy of the extraction system is 97.4%, that is to say, we can fully identify the samples of advertising,but there are still a part of non advertising samples were judged to be ad class samples.
Keywords/Search Tags:Post Bar, Advertising, Feature Selection, Machine Learning
PDF Full Text Request
Related items