Font Size: a A A

Title Classification Research Of Collected Documents Based On Subject Matching

Posted on:2021-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:S M YuFull Text:PDF
GTID:2428330614961095Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Under the background of information explosion,the problems of information flooding,information overload and information waste are increasingly serious.The standardized management and automatic classification of high-value information is of great significance for improving the team's document management system and building a personal knowledge system.For the problem of classifying the title short text of favorite documents,this paper proposed an unsupervised subject word extraction algorithm,defined the subject word representation,and then marked the document titile based on the subject word representation to solve the problem of automatic classification.In order to eliminate the fuzziness of classification targets and the differences between different users,the selection specification of subject words was defined,the selection scope and granularity of subject words were reasonably limited,and a document classification specification based on custom rich tags was proposed.In addition,the concepts of co-occurrence item sets and co-occurrence item relationship types were defined,and the candidate subject terms discrimination conditions were used as the basic research conditions of the subject word extraction algorithm.The algorithm of subject word extraction could be divided into four steps: document set preprocessing,candidate subject word selection,subject word set simplification optimization and subject word representation.In the preprocessing stage of the document set,this paper designed a multi-phrase extraction algorithm,which could extracts binary phrases and high-gram phrases efficiently.Then,a candidate subject word selection algorithm was designed to obtain the co-occurrence item set of candidate subject word sets and subject words.In the simplification and optimization stage of the subject word set,the strategies of simplifying the equivalent feature items,eliminating the redundant component items,eliminating the bidirectional component relationship,and eliminating the phrase component words were adopted successively to simplify the subject word set and optimize the co-occurrence item set,so as to screen out high-quality subject words and eliminate the redundancy of the co-occurrence item set.In the subject word representation stage,the co-occurrence item set was decomposed into 4 sets as the subject word set feature,and the subject word was divided into 4types,which distinguished the importance of the subject words.Finally,the title text classification algorithm was designed based on the subject word representation.The classification algorithm tagged the document with <subject words,constituent words> binary tags,which reflected the hierarchical relationship of the document topics,had better interpretability.In the experimental stage,due to the manual labeling results of phrases,subject words,and document classification labels,which were greatly affected by the data set andsubjective factors,the rationality and accuracy of the labeling results couldn't be guaranteed.Therefore,evaluation indicators such as accuracy were not applicable.The experiment verified the effect of the algorithm steps,and performed a qualitative comparison analysis with the traditional algorithm.The verification results showed that on the short text data set of the network favorites title crawled in this paper,the subject word extraction algorithm extracted 253 subject words with 1 type from 3493 feature items,the number was moderate,and the meaning of the subject words was reasonable.The classification algorithm created a label index for the document and obtained a total of 4174 pairs of binary labels.The comparative analysis results showed that,compared with the traditional algorithm,this algorithm had good performance in terms of ease of use,interpretability,stability,performance and so on.There are 6 figures,14 tables and 65 references in this paper.
Keywords/Search Tags:short text classification, co-occurrence item set, keyword extraction, subject word representation, high-gram phrases extraction, rich tag, classification specification
PDF Full Text Request
Related items