Font Size: a A A

A Research On Stacking Integration Methods Based On Four Classifiers

Posted on:2021-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:R ChenFull Text:PDF
GTID:2428330623972757Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,massive amounts of data have emerged.The volume of unstructured data such as voice,image,and text has grown much faster than structured data.Among them,short texts such as product descriptions contain a wealth of information,and how to extract this information has important research value in search engines and news topic classification.In the e-commerce platform and physical store merchandise management scenarios,it is often necessary to establish a three-level category classification system for products to gain insight into consumer preferences.However,due to non-standard entry and other reasons,the actual product category misbinding is more common,so it is necessary to establish an automatic product category identification model.At present,there are relatively mature solutions for long text classification,but the difficulty of product title classification is that the title classification is to classify very short texts that are very general.Usually,the title does not exceed 20 words,and often has semantic ambiguity.And feature sparsity.Simply applying long text classification methods to heading classification does not usually yield satisfactory results.This paper discusses the integration methods of product titles based on Bayesian,nearest neighbor,support vector machine,three traditional classification algorithms and the emerging text classification algorithm fast Text,and establishes a stacking integration model that effectively combines four base classifiers.This makes the automatic identification of product categories feasible.First,in response to the corpus imbalance existing in traditional Chinese text segmentation,four Taobao keyword dictionaries external reference vocabularies were introduced to verify thevalidity of the product title corpus,and then a hybrid model was used to train the product title text.Get the corresponding word vector representation.Secondly,the four algorithm models of Bayesian,Nearest Neighbor Method,Support Vector Machine,and Fast Text are studied and optimized.For the case where the feature words used as brand and other feature words in the product title do not have class condition independence,this article combines feature items After further division,a double-layer Bayesian classification model was established.Finally,the above four algorithm models are fused,that is,a stacking integration model is established.The experimental results show that the system stability and classification accuracy of the integrated algorithm model in the field of short text(product title)classification are higher than the four base classifiers after optimization,which proves that the stacking integration algorithm proposed in this paper is a more accurate It is an effective and accurate short text classification algorithm.
Keywords/Search Tags:short text classification, Bayesian hierarchical model, Fast Text, stacking algorithm
PDF Full Text Request
Related items