We Implemented a Convolutional Neural Network (CNN) and the PyTorch library to analyze and recognize real-world digital numbers in the Street View House Numbers (SVHN) Dataset. The SVHN contains over 600,000 digit images with an order of magnitude more labeled data compared to the MNIST dataset.
The data in the SVHN dataset is obtained from house numbers in Google Street View images. SVHN contains 73,257 digits for training, 26,032 digits for testing, and 531,131 additional to use as extra training data. There are 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9, and '0' has label 10. SVHN provides two types of formats for the data points, and we decided to use Format 2, an MNIST-like format, to train and test our models. All the images are in a fixed resolution of 32-by-32 pixels. The following image shows some training image examples, and we can see that there are distracting digits to the sides of the digit of interest.
The following table shows the number of samples of each label in the training and testing dataset.
We will use softmax loss, accuracy, and error rate to evaluate our models on the SVHN dataset.
We will apply convolutional neural networks and the PyTorch library to investigate the effects of different CNN parameters on prediction accuracy.
CNN is most commonly applied to process/classify multi-dimensional data such as images. 3 x 32 x 32 Images from the SVHN dataset are fed as input images with specific batch sizes into the CNN model during training. In an ideal scenario, the model should acquire unique characteristics and patterns for every class label and not be distracted by distracting digits around our interests.
The figure on the left outlines the design and architecture of our CNN model. Training images are fed to the CNN model and passed through different layers of convolution, batch normalization, ReLu, and MaxPooling. Batch normalization is used to improve the training speed and performance of the model. It works by normalizing the output of the previous layer before passing it to the next layer, with the goal of reducing the internal covariate shift. After these layers, we flatten the data and pass them into fully connected layers with ReLu and dropout to finally classify the digital numbers in the input images.
There are many hyperparameters of a CNN that can possibly be tuned in order to try to maximize the accuracy of our model. Thefollowing table lists the different hyperparameters that we tried, along with the resulting accuracy on the training and validation setson epochs 5, 15, 25, and 50.
By inspecting the testing accuracy obtained in row 2, we achieved the highest testing accuracy of 95%. The accuracy results in row 3 are interesting; with dropout=0.5, some testing accuracies are higher than training accuracies. It takes a lot of time to train the model once, so it is difficult to test the CNN on too many configurations.
The following graph on the left shows the training error rates and testing error rates on the training and testing datasets with respect to each epoch. Since the problem is a classification problem, we choose to use 0/1 loss to calculate our model's accuracies and error rates. The graph on the right displays more concrete information. The parameters used to get the following graph are: Learning rate=0.05, Padding=1, Kernel size=3, 3 convolutional layers, batch normalization for each convolutional layer, pooling=MaxPooling, dropout=0.25, and batch size=64, total epoch=50.
Looking at the graph in 4.2.1, we can see that around epoch 14, we have the lowest testing error rate. After epoch 14, the testing error rates are stable in general, with fluctuations in local while the training error rates are decreasing. Therefore, after epoch 14, the model is experiencing overfitting, and before epoch 14, both training error rates and testing error rates are decreasing. Our model is experiencing underfitting. We can change the pooling function, remove the noise in the training dataset, and lessen the complexity of the convolutional layers to reduce overfitting.