Font Size: a A A

Research And Application Of Address Identification Based On Text Mining

Posted on:2020-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:X F MaFull Text:PDF
GTID:2428330596476509Subject:Engineering
Abstract/Summary:PDF Full Text Request
It is a significant part of the entire logistics chain that companies accurately classify them based on the address on the package to the corresponding delivery area.For a long time,logistics companies tend to complete this part by manual sorting,but a lot of manual intervention has brought high-parcel delivery costs,delivery efficiency reduction and other issues.With the increasing demand for smart logistics in recent years,automated address classification has become the trend of the times.This thesis mainly studies the problem of automatic address classification based on machine learning,including deep learning algorithms,and builds classification models with excellent performance,which provides a new strategy for automatic sorting.The specific research contents are as follows:1)An address classification model based on integrated learning is proposed in this thesis.Through the observation of the address dataset syntax structure characteristics,developed a text cleaning method,and proposed the address segmentation method based on the regular expressions and word,thus completing bag-of-words feature construction and the address cut without the aid of address factor dictionary.Then,based on the information entropy,the bag-of-words features are filtered,and the principal component analysis algorithm is used to reduce the feature dimension.The random forest basic classifier is trained on the bag-of-words feature.The semantic vector of the address is obtained by combining the TF-IDF weight and the word vector,and a softmax basic classifier is constructed.Finally,the integrated classifier is trained by the Stacking method.By comparing with other traditional machine learning algorithms,the model can get better classification results.2)The deep learning method is utilized to establish an address classification model.Aiming at the problem that models cannot effectively classify the typo samples,a multichannel address classification model(MCC)based on self-attention mechanism is proposed to improve the model's performance.Based on the improvement of Transformer,the model introduces a variety of coding methods to mine the semantic information implicit in the text,so that the model can use other channels to represent the text information when encountering with typos.Experiments show that the deep learning model proposed in chapter four can achieve a better recognition effect on the address text,achieving a precision value of 0.9227,a recall value of 0.9264,and an F1 value of 0.9242.At the same time,experiments on dataset composed of typos samples prove that the proposed model can better classify noise-containing samples.3)To accelerate the application of automatic address classification method in actual production environment,an address classification web service based on Flask is constructed.The service integrates modules such as model training,model evaluation and model application.An address classification service is provided and users can upload their own dataset and train models on them at the same time.
Keywords/Search Tags:address classification, machine learning algorithm, text classification
PDF Full Text Request
Related items