Font Size: a A A

Research On Literature Mining Algorithm Based On Machine Learning

Posted on:2020-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:H L GeFull Text:PDF
GTID:2428330596476075Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the continuous growth of the number of researchers,academic literature is also growing day by day,which makes the classification of literature of great significance.Academic paper classification is an effective method in text data mining,which helps researchers to explore the information based on text clustering.There are many kinds of classification algorithms applied to multi-domain sets,which are widely used.However,the application of literature classification in such narrow fields as indoor positioning or medicine is an extremely difficult task.There are a large number of irrelevant word overlapping problems in narrow fields,which makes the classification difficult.Although it plays an important role in scientific research,it has not been well studied.The research focus of this thesis is to process the literature data in the narrow field of SpringerLink,extract good text features and obtain good classification effect.The main work is as follows:(1)Based on the LDA topic modeling,this thesis introduces the word vector representation and fuses the word vector semantics,proposes a method based on LDA and Word2 vec extended feature text representation,which learns the features from the topic and word context semantics respectively.In the "indoor location" and "computer science" literature under the corpus of the experimental results show that the performance of LDA and Word2 vec extended feature is significantly better than two basic models of LDA and LDA-w2 v.It expresses the semantic information of short text more precisely,overcomes the problem of sparse short text and poor topic focus to a certain extent,and based on LDA and Word2 vec extended feature classification is essentially a semi-supervised learning method,without a large number of tagging corpus.(2)In order to better solve the problem of poor topic focus,this thesis introduce abstract extraction algorithm,combined with feature extension,proposed EWLDA-EF text representation model.Classification experiments are carried out on literature corpora of "indoor location" and "computer science",and the classification effect is improved.The results show that the ewlda-ef text representation model can overcome the problem of poor topic focus.(3)This thesis analyzes the confidence level distribution of the correct and mispredicted samples in the corpus is analyzed,and uses the set learning to determine the final category of the whole test sample by sub-model voting.A literature corpus classification method based on the EWLDA-EF combined classification model is proposed.Experiments show that under the literature data of "indoor location",when the number of topics is 60,the MicroF1 value of EWLDA-EF combined classification model reaches 0.8355,which is 1.02% higher than that of EWLDA-EF model.In the literature corpus of "computer science",when the number of topics is 40,its MicroF1 value reaches 0.8579,which is an improvement of 0.99% compared with EWLDA-EF model.It can be seen that the combined classification model is more effective than the single model,indicating that using the combined classification model to improve the classification performance may be a good idea.
Keywords/Search Tags:LDA, Word2Vec, text classification, abstract extraction, combined classification model
PDF Full Text Request
Related items