Large Scale Data Based Enterprise Address Recognition System

Posted on:2019-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:M Wu

Full Text:PDF

GTID:2428330542482341

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Since the reform and opening-up,China has vigorously developed its economy.With the reform of the commercial system in recent years,the cost of entrepreneur-ship for domestic SMEs has been greatly reduced.While in the past the registration and change of the company need to be handled by the State Administration of In-dustry and Commerce,it is now simplified and the local relevant administrative unit where the company locates will do.In the first quarter of 2017,there were 1.255 million enterprise registrations nationwide,with average 14 thousand registrations per day.By the end of March at the same year,the total number of registered enterprises across the country.has reached to 89.357 million.Faced with such rapid increase in number of companies,it is difficult to ensure the legality and standard-ization,which requires real-time supervision by relevant departments.The relevant de.partments such as the Industrial and Commercial Bureau hold a large number of business registration information,among which the business address is one of the most,critical information.One of the most effective solutions is to compare whether the registered business address of a business enterprise is a daily real operational address.Address is a kind of geographic location information that is closely related to people's daily life.However,because of history,region,customs and other factors,many address information cannot be directly compared.Address matching technology can effectively complete the description and comparison of address information.This article describes in detail the sub-database table,big data processing,natural language processing t.echnology,and summarizes the characteristics of the Chinese addresses.Based on the existing company name,the big data crawler obtains the company's possible operational address from the internet and stores it with the original industrial and commercial registered address into the data storage through the sub-banking sub-table.In the face of tens of millions of enterprise information,a big data real-time streaming computing system built with Flume,Kafka,and Spark Streaming is used to perform a matching process on the original address and the crawler address.The matching module is composed of administrative divisions based on dictionary matching and non-administrative divisions based on NPL word vector matching.The entire system structure decouples each functional module to facilitate system iteration and management.Finally,a large number of company information crawled from the Yellow Pages websites were used as a data set to conduct experiments on system stability,high efficiency,and matching accuracy,and produce analyzed results.

Keywords/Search Tags:

Sharding, Big data, Word vectors

PDF Full Text Request

Related items

1	Sentiment Word Vectors Generation Generation Model Research Based On Deep Learning
2	Relevant Words Based On Word To Vectors And The Application In Topic Crawler System
3	Design And Implementation Of Indoor 3D Reconstruction Based On Dynamic Sequential Visual-word-vectors
4	Research On Data Sharding Problem Based On Relational Database
5	Research On Short Text Topic Model Based On Word Network And Word Vectors
6	Exploring Dialogue Text Classification Based On Word Mixture Vectors
7	Improving Word Vector Model With Part-of-Speech And Dependency Grammar Information
8	Automatic Topic Labelling Based On Word Vectors
9	An Optimization Scheme For Sharding Blockchain Based Heterogenous IoT System
10	Research On Fusion Sorting Model Of Domainized Word Vectors