Readings

Book Chapter

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Chapter 10

Others

STAT 365/665: Data Mining and Machine Learning COurse at Yale Univ.

Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell, Chapter 13

Imperial College Machine Learning - Neural Networks

Neural Networks and Deep Learning, M. Nielsen

Introduction

Deep Neural Networks, DNN, Deep Learning Networks are powerful machine learning (ML) algorithms
A very active area of research
In part of of both Supervised and Unsupervised learning using multiple layered models
Requires lots of data (hence, cheaper computation and availability of larger data sets make the algorithm feasible)
The cornerstone of deep the neural network [Artificial] Neural Network
It is inspired by the structure and functions of biological neural networks
For the history read related wikipedia section
Feed Forward Network (FNN), Convolutional Neural Networks (CNN), Recurrent neural networks (RNN),…, are the network structure composed of artificial neurons are suggested and succesfully applied for some specific problems
Succesfully applied in:
- Image processing
- Speech recognition
- Natural language processing
- Control systems
- Finance
- Medicine
- Content creation
- …
We will loosely discuss Feed Forward Network

Keywords

Perceptron (a binary classifier)
Artificial Neurons (generalized version of perceptron)
[Deep] Neural Networks (Network of Artificial Neurons, also called Artificial Neural Networks, ANN)
Layers and Nodes
Activation Function
Back-propagation
Batching, mini-batching
Regularization
Dropout
Learning Rate

Artificial Neuron, Perceptron

Example(1)

A decision to attend a outdoor activity based on these three factors (let’s assume binary factors)
- Is the weather good, \(x_1\)?
- Does your friend want to go with you, \(x_2\)?
- Is it near public transportation, \(x_3\)?
Based on your utility function \(U(.)\) you will decide to attend (for the sake of simplicity once more, it is an additive function)
If the utility level exceeds some threshold then the decision is yes (attend) otherwise no

\(\text{outcome, }y = \begin{cases} 1 & \text{if } w_1 x_1+w_2 x_2+w_3x_3 \geq threshold \\ 0 & \text{if otherwise} \end{cases}\)

\(y = \begin{cases} 1 & \text{if } x \cdot w \geq threshold \\ 0 & \text{if otherwise} \end{cases}\)

This will create a separating hyperplane for decision
We can present graphically

- \(y\) is a step function where it takes \(1\) if a linear combination of \(x's\) are greater than a threshold value

Or in general we can re-create a representation

*: Image is from https://stats.stackexchange.com/questions/419716/whats-the-difference-between-artificial-neuron-and-perceptron

Inspiration

Here is the biological neuron and perceptron

https://askabiologist.asu.edu/neuron-anatomy

Perceptron is at the center of the artificial neural networks acts similar to biological neuron. It activates other neurons based on the values that receives from input terminals
It is a mathematical node and the basic processing element
In this example, body just get the sum of weighted inputs then activation function converts this value into zero or one. This is one of the example of the activation functions. There are several activation functions suggested other than step function.

Multi-Layer Perceptron

Network of perceptrons may be shown as*:

In this example there are input layer, \(L_1\), hidden layer, \(L_2\) and the output layer, \(L_3\)
In more complex setting, there may be more than one hidden layer, time to time called deep neural network
An example, deep Feedforward Network,

Source https://uc-r.github.io/feedforward_DNN

Activation Functions

Simple percetron with a more information about mathematical process shown below*

*: Source http://euler.stat.yale.edu/~tba3/stat665/lectures/lec12/lecture12.pdf

Activation functions in hidden layers are typically nonlinear
In most cases the same activation function is used in the network
The commonly used activation functions \(f(.)\) are the following (as a notation, \(z= \boldsymbol{w \cdot x + b}\)):
- Step function (explained above)
- Sigmoid function, same as used in logistic regression: \(f(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}\)
- Rectified Linear Unit, \(RelU\): \(f(z) = (z)_+= \begin{cases} z & \text{if } z \geq 0 \\ 0 & \text{if otherwise} \end{cases}\)
- Linear function, \(f(z)=z\)
- Hyperbolic Tangent, \(tanh\), function: shifted sigmoid function, \(f(z)=\frac{2}{1+e^{-2z}}\)
- Softmax function, generally used at the final layer (output layer)
- Softplus function, \(f(z)=log(1+e^z)\)
The activations are like nonlinear transformations of linear combinations of the features (inputs)

Model Learning (Fitting)

Network is trained (finding the weights) iteratively
Propagation backward (backpropagation) algorithm was the first popular one:
- Start with random weights
- Using gradient descent algorithm, gradually adjust the weights based on the errors until no changes
Gradient Descent algorithm that is applied for optimization (for example minimizing loss function) requires, initialization, stopping condition, step size (learning rate), gradient of the function
The example of a simple linear regression is:

Universal Function Approximator

Neural networks can be used to approximate a function
One hidden layer is enough to model any piecewise continuous function (Hornik et.al., 1989)
This is an example of \(y=x^2\) function that is modeled with two hidden layers each with three neurons and logistic activation function (using neuralnet package of R)

Number of Neurons

Number of hidden neurons is a task specific problem
Using too many neurons is increasing the risk of overfitting
It is model selection problem
There is no standard and accepted way of choosing better network structure
Neural networks have input layer, output layer and number of hidden layers
Number of Inputs, Outputs and the complexity of the problem
Using more hidden layers may create optimization problem
For many problems single hidden layer with enough nodes is enough
The example below shows single hidden layer with six neurons

More Considerations

Neural Network power increases with if neurons operates independently
In practice, some neurons begins to detect the same features of the data (adaptation)
Dropout is the solution of this co-adaptation
- Each hidden neurons is randomly omitted with some probability for each training instant
- Inputs can also be dropout similarly
When hidden neurons are randomly selected, weak learning model is obtained during each training epoch. Combination of these weak learners results in stronger predictive power
As a summary:
- Dropout reduces the likelihood of co-adaptation in noisy samples
- Larger dropout fraction introduces more noise during training, hence slow down the learning
- Large Deep Neural Networks get more benefit from dropout

*Image Source: http://neuralnetworksanddeeplearning.com/chap3.html

Batching and Mini-Batching: Back propagation calculates the change in neuron weights (known as delta of gradients) for all neurons. For a very large DNN millions of deltas needed to be calculated. The time taken to converge may make impossible to train DNN using batch learning. Mini-batching may be used to speed up the training time
Early Stopping: Data set divided into three sub-sets, training, validation and test data sets. More iteration using training always decreases the errors. The early stopping may limit the iteration during training by using validation set. At the end of each epoch, validation set errors calculated. When the errors are saturated (do not change much) training stop. Then the model used with test set to measure the performance

*Image Source: https://www.geeksforgeeks.org/machine-learning/backpropagation-in-neural-network/

Regularization: It is used to reduce overfitting, even with a fixed network and fixed training data. The idea is the same as the traditional regularization in which another term (regularization term) added to the cost function. This decreases the values of the weights

*Epoch: One complete forward pass and one backward pass of the error for all training instances

Network Types

Though in this course we consider MLP (FNN) there are different types of networks commonly used in practice:
- Convolutional neural networks: Inputs are images, objects identification in a picture, differentiating the images are common tasks
- Recurrent neural networks: For sequential data, language translation, speech recognition, natural language processing, and image captioning
- Long short-term memory networks: For sequential data considering the memory in series, handwriting recognition and video-to-text conversion are common tasks
- Generative adversarial networks: Generate new data sets that share the same statistics as the training set, art creation is an example of task
- Deep belief networks:

Neural Networks: An Intro