Font Size: a A A

Research On Extraction Method Of Investment Points In Company Research Report

Posted on:2021-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y J MiaoFull Text:PDF
GTID:2428330629488947Subject:Engineering
Abstract/Summary:PDF Full Text Request
The Investment points are essential for people to make investment decisions.They usually appear on the homepage of the company research report.In addition to investment points,the homepage also includes stock name,stock code,title,analyst information,and chart data,etc.The company research reports issued by financial institutions are mostly in PDF and the text content in PDF documents is not easy to process.Moreover,the accuracy of extracting investment points directly based on rules or models is not high.Based on the above questions,this paper analyzes the layout structure of the company research report and learns from the VIPS algorithm to propose a method of extracting investment points of the company research report.The main work of this method consists of two parts.The first part is the design and implementation of visual cues-based PDF page segmentation algorithm.Through the study of the logical structure,physical structure and basic objects of PDF documents,this paper parses PDF documents with PDFBox and wraps the relevant information into the data structure designed in this paper.Then,through analyzing the layout structure of the company research report homepage and the similarity between PDF page segmentation and web page segmentation,this paper designs visual cues-based PDF page segmentation algorithm according to the same or similar visual perception of the same semantic content in PDF pages.This algorithm mainly includes separator detection,separator scoring,and block reconstruction.In the scoring strategy of the separator,this paper designs 23 rules.In terms of block reconstruction,this paper designs 5 parameters to adjust the block size and depth.The second part is extracting investment points of the company research report based on PDF page segmentation.On the basis of PDF page segmentation,this paper designs the labels of semantic blocks,makes full use of the display features,position features and text features of the blocks as the data for training SVM,then determines kernel function through contrast experiments,uses grid search and cross validation to optimize the model.Finally,this paper extracts investment points of the company research report through combining SVM and rules.The experimental result of extracting investment points from 1,000 company research reports shows that the visual cues-based PDF page segmentation algorithm is effective for extracting investment points from company research reports.F1 increases by 18.6%.In addition,using the composed method of SVM and rules is better than using a single method.Precision,recall and F1 all reach 93.8%.
Keywords/Search Tags:Company Research Report, Investment Points, PDF, Block, SVM
PDF Full Text Request
Related items