Multi-label Text Categorization Of ZhiHu Title Based On Deep Learning

Posted on:2019-05-29

Degree:Master

Type:Thesis

Country:China

Candidate:C Zhang

Full Text:PDF

GTID:2428330545954557

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

It is known that the Zhihu website is the most popular knowledge-based question and answer community in the China Internet,and there are about 70 million users sharing or looking for knowledge on this website.The basic function of Zhihu is to enable some users to post questions and other users to answer the questions.On the Zhihu website,the user who posts the question sets a few labels for each question,and then the user who wants to answer the question finds the user's question by the labels and responds the question.At present,the topic labels of the Zhihu website are annotated by users,which leads to bad experience for users.Specifically,it cannot recommend the appropriate answers to users timely and effectively since the labels annotated by users may be inaccurately.Furthermore,this method results to huge amount of human labor under condition of Zhihu's massive text data.Thus,designing a high-performance,high-precision multi-label automatic labeling system is significant to improve Zhihu website 's users experience and reduce its operating cost.This paper designs a multi-label automatic labeling model based on deep learning technology.The main work of this paper includes the following aspects:(1)This article designs and implements a Python web crawler to obtain large number of data from Zhihu website and preprocess of the acquired data,including data cleaning,text segmentation,word vector training using Word2Vec tools.(2)This paper implements a multi-label text classification model based on deep learning,including classification models based on CNN,LSTM,and CNN-LSTM.The optimal parameter settings for these models have been explored by a large number of experiments.The classification accuracy of these models was 96.39%,96.45%,and 96.99%.The hybrid model based on CNN-LSTM reduces the classification error rate of CNN and LSTM by 16.62%and 15.2%.

Keywords/Search Tags:

Deep learning, Text classification, Web crawler

PDF Full Text Request

Related items

1	Research And Implementation Of Topic Crawler In The Field Of Inspection And Quarantine
2	Research On Key Technologies Of Chinese Text Classification Based On Deep Learning
3	Reserch On Application Of News Text Classification Based On Deep Learning
4	Research And Application Of Text Classification Technology Based On Deep Learning
5	Research On Text Classification Of Deep Learning Mixing Model Based On Map Reduce
6	Research On Method Of Chinese Text Sentiment Classification Based On Deep Learning
7	Multitask Text Classification Based On Deep Learning
8	Design And Implementation Of Long Text Classification Algorithm Based On Deep Neural Network
9	Research On Text Classification Based On Deep Neural Network
10	The Research Of Text Classification Based On Deep Learning