Font Size: a A A

Design And Implementation On Large-scale Patent Literatures Translation And Cross-language Retrieval System Based On Hadoop

Posted on:2016-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhangFull Text:PDF
GTID:2308330476955009Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Patent data plays a very important role in the development of computer science and technology, many research invention can be found in the patent literature. However, with the arrival of the era of big data, patent data tend to show a rapid growth trend, and it has a wide range of types and also it is written in complex languages, which needs to be well-designed structure for patent data in order to adapt to the rapid development. Therefore, research on efficient translation and retrieval method for the development of patent has very important research significance and practical value.This paper aims to study the storage and translation technology for the patent data by the distributed structure-Hadoop, and on the basis to achieve a variety of retrieval methods. The main research work are as follows:1) Put forward a dynamically scalable and efficient data storage structure of three layers, the bottom of which builds data management for different types of patent files to avoid data conflicts; the middle layer stores the patent directory information in well-designed HTable by the Map Reduce program; the top layer is the index for catalog by using the Lucene index technology. This structure is for the storage of large-scale patent data and quick search.2) Consider that large number of patent are small files, put forward to combine splits to some appropriate split based on Map Reduce to reduce the percentage of time consuming by the system, thereby improving the overall efficiency of translation. The results shows that this method could improve the efficiency by 20 percentages.3) Study on the implementation of Multi-function for patent data retrieval, which contains cross-language information retrieval, advanced retrieval and the IPC classification retrieval and so on. The paper proposes to use the co-occurrence relations between words to score and sort the candidate translations to achieve disambiguation, which has a preferable performance in the polysemy problem.
Keywords/Search Tags:Hadoop, patent translation, small files, cross-language retrieval
PDF Full Text Request
Related items