Font Size: a A A

The Design Of Similar Document Detection System Based On Text Segmentation

Posted on:2011-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:B LiFull Text:PDF
GTID:2178360305954985Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet Technology .It provides us with the diversification of resources at an alarming rate in the steady increase. In these circumstances ,the appearance of search engines become a main way to search for information ,users of internet can use the search engines to search more information of they need. It is so easy to know the information what they want to know.But, with the development of Internet Technology, with a high-speed increase of resources, the information of the internet are also become more and more. This will inevitably makes the information of the web pages show diversity in some cases. Some sites copy most of the information form the other for its own purposes .Even some sites directly reprinted the original information of some other. This lead the information which is obtained by the search engine is full of a lot of similar or even duplicate. First, it is a waste of cyberspace. Secondly, it causes much inconvenience to the users in the process of extracting the useful information. It not only reduces efficiency, but also has a negative impact on the reliability of the information. Removal of the duplicate information has become an imperative step in order to obtain valid information on the Internet. So my thesis focus on extracting the information of the homepage, and doing a similar contrast detection in current enforceable platform with the help of the contribution of the teachers and seniors. By detecting the document can know which information is approximate, or repetitive. Removal of the duplicate information will bring great convenience in our daily work and improve the efficiency of processing information.In the beginning of my thesis, I summed up the main steps and various methods of removing the duplicate information, such as: removal of the similar information based on the approximate eigenvectors, removal of duplicate information based on the fingerprint algorithm, the detection of the similar document based on the keywords, sub-signature algorithm and stochastic mapping algorithm and so on. These strategies and algorithms have their own characteristics and application. Finally, we use the block-based strategy for the detection of the similarity. We not only do the detection of the similar documents in order to get close to real-life application of the specific issues. But we set up a hypothesis that we get the information for our own purpose through the help of the Internet when we are in trouble in our daily life. This thesis focuses on doing a comparison with each other through the Hash value, detecting the prepared documents whether similar or not by comparing the threshold given before the testing, using the following steps: information search, web collection, page segmentation, noise processing, text extraction, Hash calculation, Hash table mapping. And this process provides a series of relatively complete solution to this kind of problems.As web searching for information and collection of web page, they are the technology we often to use be, here we do not to say more about these. We use Google search engine to finish the work of searching and collecting information. Next, we talk about the steps of page segmentation technology. At the first, we though the resolve of these web page we have, so we can find the DOM tree of them, we also can get many information form the DOM trees, and then cut off the parts just like which advertisement, guide of the page and some other links. From one side, we could use a filter to remove〈img〉,〈script〉,〈style〉from the pages, for another, DOM trees will calculate the size of their node. Threshold depends on the blocks size and positions. Here we use node's link to divide by the characters which are not the links, when it is bigger than the threshold, and we can say:the nodes are links, and remove these nodes. After having deleted Web page noise, Simple making use of Sun's java JDK 1.40 include a open and with.Regular expressions java.util.regex package. Write a program of java language to get the code of the web page, and then use regular expressions in java.util.regex package, at the last, keep those words in a document. Next step is divide the documents to segments, we line the blocks size from small to big like, Word, Shingle, and Document. There are some problems that we should know, the segment size is small, to the result we need is accurate, but the time using of calculate is too long to wait, this kind of process is not we want to see. On the opposite situation, because the block's size is too big, a little difference will change the Hash code, it will be ignore more documents that not the copy but similar documents. All above these, we considered the speed of calculation, efficiency and results, we choose Shingle methods to divide the documents to segments, use java language to change the segments to Hash codes, and then project those Hash codes to a Hash table. Based on the table elements traverse, compare with documents by the Hash table mapping. At the last, we can easy to get a number that how many same Hash code are there, and compare this number with the threshold, if the number bigger than threshold, then we consider that the two documents are similar.
Keywords/Search Tags:Segmentation technology, Shingle, Threshold, Hash
PDF Full Text Request
Related items