Font Size: a A A

Text Mining Method Based On Content And Structure And It's Distributed Application Research

Posted on:2017-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:M J XuFull Text:PDF
GTID:2348330482491348Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In today's era of information explosion, there is increasing big data that is constantly emerging. According to incomplete statistics, these data has a rapid growth with doubling every three months. Thus, these isomerous big data will eventually become neglected "data garbage", if we do not have the aid of the help of effective information retrieval method. Full-text retrieval technique could efficiently store and manage these data.Although full-text search engine adheres to the excellent architecture, there're still some shortcomings in search capability. Because of default similarity score algorithm of full-text search engine considers only word frequency characteristics and existing retrieval precision of full-text search engine is lowly, the paper would improve existing full-text search engine respectively from the two aspects about the content and structure. Starting from the angle about improving default similarity score algorithm, the paper considers the distance features of the query terms in the document to improve precision and recall rate in the full-text retrieval domain based on the content of the document. Starting from the angle about the structure character of the document, that is there is only a main theme in a document generally, and document writer would like to describe the full-text with subject as the core from the viewpoint of multiple child themes. The paper fully masters different unit with local significance and their mutual relation based on the logical structure of the document. Eventually, the improved full-text retrieval results have the better user experience and pertinence. Thus, it must be a very meaningful research to explore to consider the distance character and physical structure of the text, and take this opportunity to study about the application of distributed full-text retrieval platform according to the text data.The paper make a series of research mainly about text mining method based on the content and structure and their distributed applications. Specific research work is as follows:1. The paper gives a new sentence similarity calculation model based on the characteristic of segmentation distance. Firstly, the model preprocesses the query string and document, and then, calculates the segmentation distance between query string and keywords abstracted from document, by identifying keywords and query terms in the document. As a result, we get the similarity score between the query string and document. Finally, we apply the improved algorithm to the actual Lucene's similarity ranking algorithm, and verify effectiveness by using indicators such as MAP, P@n.2. The paper optimizes the text segmentation algorithm and applies the theme shard partitioned by the optimized algorithm to the retrieval mode that can improve the precision and recall. Firstly, the paper optimizes and improves Text Tiling algorithm which is the text document segmentation algorithm with lexical cohesion as the core. For another, the paper performs text segmentation to form theme shard collection by using the improved text segmentation algorithm. At last, the paper considers the child theme structure characteristics in the process of full-text retrieval to improve the performance of information retrieval and user experience.3. The paper applies the improved algorithm to distributed full-text search platform Solr Cloud. Firstly, the paper set up a complete set of fully distributed full-text search platform. Secondly, the paper mixes the improved algorithm based on the participle distance characteristics to the core component of full-text search engine. Starting from the angle of text structure, the paper applies theme shards to specific full-text retrieval operation. Experiments show that the improved algorithm not only makes the full-text retrieval operation optimized in terms of precision rate and recall rate, but also greatly improves the user experience.
Keywords/Search Tags:Text mining, full-text search, segmentation distance, text structure, Solr Cloud
PDF Full Text Request
Related items