| The rapid increase in machine learning algorithms’ performance, driven by recent progress in deep learning, and the still untapped potential of these improved algorithms are changing the world we live in and products we build.Deep learning models, such as convolutional neural networks (CNN), recurrent neural networks (RNN) and the long short-term memory (LSTM), have consid-erably changed the landscape of computer vision, speech recognition, natural language processing and so on. It becomes a big challenging problem to train deeper neural networks which are necessary to take advantage of big data.The difficulty of training deep neural networks was not reduced until un-supervised pre-training was proposed. However, there are still problems to unleash the power of further deeper neural networks: (1) Vanishing gradient problem and exploding gradient problem. They are difficulties found in train-ing deep neural networks with gradient based methods (e.g. back-propagation algorithm). These problems make it really tough to tune parameters of earlier layers in networks and they become worse as the number of layers in the ar-chitecture increases. (2) Overfitting. It happens when a model fits the detail of both latent distributions and noise but does not perform well on new data.It negatively impacts ability of models to generalize. In this paper, problems mentioned above will be alleviated from three aspects, which are the non-linear activation function, the initialization method and the regularization method.The main contribution in this paper can be summarized as following:1. A multi-layer maxout network (MMN) is proposed as a trainable non-linear activation function. It inherits advantages of both a non-saturated acti-vation function and a trainable activation function approximator. It is powerful enough to approximate any activation function. One major benefit of the MMN activation function is the reduced likelihood of the gradient to vanish. And the other benefit is the improved feature representation of convolutional neural net-works.2. A robust initialization method, specifically for the MMN activation function with a theoretical proof, is proposed. It is valid for the maxout activa-tion function as well. The experimental results on CIFAR-10, CIFAR-100 and ImageNet demonstrate that this novel initialization method could reduce the in-ternal covariate shift when signal is propagated through layers and alleviate the vanishing gradient problem and the exploding gradient problem.3. A novel companion objective function for regularization of deep convo-lutional neural networks is proposed. Regularization is an essential technique discussed in an attempt to solve the overfitting problem in deep convolutional neural networks. There are three benefits of the novel companion objective function.(1) Two kinds of auxiliary supervision are proposed, which are employed for convolutional filters and non-linear activations respectively in the compan-ion objective function. Both of them could enhance the performance by alleviat-ing the overfitting problem and auxiliary supervision for non-linear activations provide more efficiency.(2) Regularization of auxiliary supervision in the pre-training phrase is discussed. With the assistance of auxiliary supervision, CNNs could obtain a more favorable initialization for the end-to-end supervised fine-tuning.(3) This type of companion objective function is verified to be compatible with other regularization strategies such as dropout and data augmentation. |