Font Size: a A A

Research Of Massive Chinese Document De-duplication Based On Topic

Posted on:2018-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2348330563452441Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet,the cost of information transmission is gradually decreased,and most documents can be reproduced,disseminated,modified,formatted,and commented,which results in a large number of similar or even repeated documents.These similar documents not only consume a lot of computing resources in the process of information retrieval and storage,and also provide users with a small amount of effective information even provide no information,which will directly affect the quality of Internet data and information dissemination efficiency.Therefore,it is imperative to put forward an efficient method to deal with the duplicate documents in massive documents.Traditional clustering based algorithms cannot solve the problem of large scale documents and the simhash based solutions can solve the problem of large scale documents but precision is not high.Therefore,an algorithm of massive Chinese document de-duplication based on topic is proposed in this paper,which sets up a unified document vector model firstly,gives the feature word with a precise weight,and then hash map the document vector to a binary code,finally,detects the similar documents by detecting the similarity of hash codes.The contributions of this paper are as follows.The traditional document vector has the disadvantages of high dimension and sparse,so a dimensional reduction method which based on word2 vec is proposed in this paper,which builds the bag of feature words by Document Frequency(DF)to retain the important feature of the document as much as possible,takes advantage of the semantic analysis technique of word2 vec to reduce the size of bag of feature words,which replaces the semantically relevant words with the product of a topic word and proper parameters.The classical TF-IDF algorithm only considers the weight of the term frequency and the inverse document frequency,without considering the weights of other feature of word.After analyzing summary of Chinese expression habit,an algorithm of adaptive weight of position and part of speech of word based on TF-IDF is proposed in this paper,which can dynamically determine the weight of position according to the position of word and dynamically determine the weight of part of speech according to the part of speech of word.Aiming at the problem that the de-duplication of massive documents is too large,this paper proposes a de-duplication scheme of Chinese document based on LSH,which takes the low-dimensional precision document vector as input,and then hash map the document vector to a binary code,finally,detects the similar documents by detecting the similarity of hash codes.
Keywords/Search Tags:document feature, document vector, document de-duplication, TF-IDF, word2vec
PDF Full Text Request
Related items