Research Of Massive Chinese Document De-duplication Based On Topic

Posted on:2018-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:J Chen

Full Text:PDF

GTID:2348330563452441

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet,the cost of information transmission is gradually decreased,and most documents can be reproduced,disseminated,modified,formatted,and commented,which results in a large number of similar or even repeated documents.These similar documents not only consume a lot of computing resources in the process of information retrieval and storage,and also provide users with a small amount of effective information even provide no information,which will directly affect the quality of Internet data and information dissemination efficiency.Therefore,it is imperative to put forward an efficient method to deal with the duplicate documents in massive documents.Traditional clustering based algorithms cannot solve the problem of large scale documents and the simhash based solutions can solve the problem of large scale documents but precision is not high.Therefore,an algorithm of massive Chinese document de-duplication based on topic is proposed in this paper,which sets up a unified document vector model firstly,gives the feature word with a precise weight,and then hash map the document vector to a binary code,finally,detects the similar documents by detecting the similarity of hash codes.The contributions of this paper are as follows.The traditional document vector has the disadvantages of high dimension and sparse,so a dimensional reduction method which based on word2 vec is proposed in this paper,which builds the bag of feature words by Document Frequency(DF)to retain the important feature of the document as much as possible,takes advantage of the semantic analysis technique of word2 vec to reduce the size of bag of feature words,which replaces the semantically relevant words with the product of a topic word and proper parameters.The classical TF-IDF algorithm only considers the weight of the term frequency and the inverse document frequency,without considering the weights of other feature of word.After analyzing summary of Chinese expression habit,an algorithm of adaptive weight of position and part of speech of word based on TF-IDF is proposed in this paper,which can dynamically determine the weight of position according to the position of word and dynamically determine the weight of part of speech according to the part of speech of word.Aiming at the problem that the de-duplication of massive documents is too large,this paper proposes a de-duplication scheme of Chinese document based on LSH,which takes the low-dimensional precision document vector as input,and then hash map the document vector to a binary code,finally,detects the similar documents by detecting the similarity of hash codes.

Keywords/Search Tags:

document feature, document vector, document de-duplication, TF-IDF, word2vec

PDF Full Text Request

Related items

1	Research And Implementation Of Document Similarity Based On Word2vec
2	Research And Application Of Document Semantic Representation Method
3	Scientific Research Document Retrieval And Recommendation System Based On Doc2Vec
4	A Study Of Document Composite And Document Security For Ubiquitous Computing Mode
5	The Design And Implementation Of Document Flow System Based On J2EE And Workflow
6	Based On B/s Architecture Of Document Processing And Processing System Analysis And Design
7	The Timing Of Document Flow Management System Analysis And Design
8	The Research Of Reusable Requirement Document Component Model
9	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
10	Studies On Affinity Propagation Based Pseudo-Relevance Feedback And Document Expansion For Spoken Document Retrieval