Design And Implementation On Large-scale Patent Literatures Translation And Cross-language Retrieval System Based On Hadoop

Posted on:2016-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:D Zhang

Full Text:PDF

GTID:2308330476955009

Subject:Software engineering

Abstract/Summary:

Patent data plays a very important role in the development of computer science and technology, many research invention can be found in the patent literature. However, with the arrival of the era of big data, patent data tend to show a rapid growth trend, and it has a wide range of types and also it is written in complex languages, which needs to be well-designed structure for patent data in order to adapt to the rapid development. Therefore, research on efficient translation and retrieval method for the development of patent has very important research significance and practical value.This paper aims to study the storage and translation technology for the patent data by the distributed structure-Hadoop, and on the basis to achieve a variety of retrieval methods. The main research work are as follows:1) Put forward a dynamically scalable and efficient data storage structure of three layers, the bottom of which builds data management for different types of patent files to avoid data conflicts; the middle layer stores the patent directory information in well-designed HTable by the Map Reduce program; the top layer is the index for catalog by using the Lucene index technology. This structure is for the storage of large-scale patent data and quick search.2) Consider that large number of patent are small files, put forward to combine splits to some appropriate split based on Map Reduce to reduce the percentage of time consuming by the system, thereby improving the overall efficiency of translation. The results shows that this method could improve the efficiency by 20 percentages.3) Study on the implementation of Multi-function for patent data retrieval, which contains cross-language information retrieval, advanced retrieval and the IPC classification retrieval and so on. The paper proposes to use the co-occurrence relations between words to score and sort the candidate translations to achieve disambiguation, which has a preferable performance in the polysemy problem.

Keywords/Search Tags:

Hadoop, patent translation, small files, cross-language retrieval

Related items

1	Design And Implementation Of Cross-language Parallel Retrieval System Based On Hadoop For Patent
2	Design And Implementation Of The Key Techniques For Storing And Retrieving Massive Small Files In Hadoop
3	Research On Techniques Of Query Translation For Cross-language Information Retrieval
4	Research On Access Optimization Of Small Files In Hadoop Cluster
5	Research And Optimization Of Small Files Processing Techniques In Hadoop
6	Cross-Language Information Retrieval Based On Statistical Language Modeling
7	Research On Processing Techniques Of Massive Small Files Based On Hadoop
8	Research And Implementation Of Small Files Storage Management Based On Hadoop
9	Study On Processing Of Massive Small Files Based On Hadoop
10	Research And Implementation Of Design Patent Images Retrieval System Based On Hadoop