Font Size: a A A

Large Scale Data Based Enterprise Address Recognition System

Posted on:2019-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:M WuFull Text:PDF
GTID:2428330542482341Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Since the reform and opening-up,China has vigorously developed its economy.With the reform of the commercial system in recent years,the cost of entrepreneur-ship for domestic SMEs has been greatly reduced.While in the past the registration and change of the company need to be handled by the State Administration of In-dustry and Commerce,it is now simplified and the local relevant administrative unit where the company locates will do.In the first quarter of 2017,there were 1.255 million enterprise registrations nationwide,with average 14 thousand registrations per day.By the end of March at the same year,the total number of registered enterprises across the country.has reached to 89.357 million.Faced with such rapid increase in number of companies,it is difficult to ensure the legality and standard-ization,which requires real-time supervision by relevant departments.The relevant de.partments such as the Industrial and Commercial Bureau hold a large number of business registration information,among which the business address is one of the most,critical information.One of the most effective solutions is to compare whether the registered business address of a business enterprise is a daily real operational address.Address is a kind of geographic location information that is closely related to people's daily life.However,because of history,region,customs and other factors,many address information cannot be directly compared.Address matching technology can effectively complete the description and comparison of address information.This article describes in detail the sub-database table,big data processing,natural language processing t.echnology,and summarizes the characteristics of the Chinese addresses.Based on the existing company name,the big data crawler obtains the company's possible operational address from the internet and stores it with the original industrial and commercial registered address into the data storage through the sub-banking sub-table.In the face of tens of millions of enterprise information,a big data real-time streaming computing system built with Flume,Kafka,and Spark Streaming is used to perform a matching process on the original address and the crawler address.The matching module is composed of administrative divisions based on dictionary matching and non-administrative divisions based on NPL word vector matching.The entire system structure decouples each functional module to facilitate system iteration and management.Finally,a large number of company information crawled from the Yellow Pages websites were used as a data set to conduct experiments on system stability,high efficiency,and matching accuracy,and produce analyzed results.
Keywords/Search Tags:Sharding, Big data, Word vectors
PDF Full Text Request
Related items