Programming Exercise 4: Neural Networks Learning


This exercise is to implement the backpropagation algorithm for neural netwroks.
The application is hand-written digit recognition.

  • ex4data1.mat: training set of ahnd-written digits
  • ex4weights.mat: neural network parameters for exercise 4
  • computeNuericalGradient.m: numerically compute gradients.

구현해야 할 함수들

  • sigmoidGradient.m: compute the gradient of the sigmoid function.
  • randInitializeWeights.m: randomly initialize weights.
  • nnCostFunction.m: Neural network cost function.

Neural Networks

이전 과제에서는 matlab으로 feedforard propagation을 구현 했었다.
이번엔 python을 이용해서 최적의 parameter들을 찾는 backpropagation을 구현해 보도록 한다.

1.1 Visualizing the data

데이터를 읽어서 2-dimensional plot으로 표시하는 작업을 수행 한다.
5000개의 training example들로 구성 되어 있다.
각각의 손글씨 이미지들은 20 pixel by 20 pixel grayscale image로 구성되어 있다.
각각의 pixel들은 floating point number로 구성 되어 있다.
이것을 unrolled하게 된다면 400-dimensional vector로 구성 된다.
각각은 단일 row vector가 된다. 트레이닝이 5000이므로 최종적인 input X는 5000 by 400이 된다.

y는 1~9 숫자값을 가지고 있다. matlab에서는 0을 반드시 가지고 있으므로 10으로 그것을 labeled 한다.

Python 구현
필요 라이브러리 Load

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.io #Used to load the OCTAVE *.mat files
import scipy.misc #Used to show matrix as an image
import matplotlib.cm as cm #Used to display images in a specific colormap
import random #To pick random images to display
import scipy.optimize #fmin_cg to train neural network
import itertools
from scipy.special import expit #Vectorized sigmoid function
datafile = 'data/ex4data1.mat'

mat = scipy.io.loadmat( datafile )

X, y = mat['X'], mat['y']

#Insert a column of 1's to X as usual
X = np.insert(X,0,1,axis=1)

print "'y' shape: %s. Unique elements in y: %s"%(y.shape,np.unique(y))
print "'X' shape: %s. X[0] shape: %s"%(X.shape,X[0].shape)
#X is 5000 images. Each image is a row. Each image has 400 pixels unrolled (20x20)
#y is a classification for each image. 1-10, where "10" is the handwritten "0"

출력 결과

'y' shape: (5000, 1). Unique elements in y: [ 1  2  3  4  5  6  7  8  9 10]
'X' shape: (5000, 401). X[0] shape: (401,)

이미지 출력함수

def getDatumImg(row):
    """
    Function that is handed a single np array with shape 1x400,
    crates an image object from it, and returns it
    """
    width, height = 20, 20
    square = row[1:].reshape(width,height)
    return square.T
    
def displayData(indices_to_display = None):
    """
    Function that picks 100 random rows from X, creates a 20x20 image from each,
    then stitches them together into a 10x10 grid of images, and shows it.
    """
    width, height = 20, 20
    nrows, ncols = 10, 10
    if not indices_to_display:
        indices_to_display = random.sample(range(X.shape[0]), 100)
        
    big_picture = np.zeros((height*nrows,width*ncols))
    
    irow, icol = 0, 0
    for idx in indices_to_display:
        if icol == ncols:
            irow += 1
            icol  = 0
        iimg = getDatumImg(X[idx])
        big_picture[irow*height:irow*height+iimg.shape[0],icol*width:icol*width+iimg.shape[1]] = iimg
        icol += 1
    fig = plt.figure(figsize=(6,6))
    img = scipy.misc.toimage( big_picture )
    plt.imshow(img,cmap = cm.Greys_r)
displayData()

최종적으로 10x10으로 축소해서 100장을 출력 한다.

1.2 Model representation

NN의 구성은

  • 3 layers. an input layer (20x20)+1, a hidden layer, and an output layer.
  • a set of network parameters ($\theta^{(1)}$,$\theta^{(2)}$)이고 이미 완벽히 학습된 parameters는ex4weights.mat에 저장되어 있다.
  • $\theta^{(1)}$은 25 dimension을 $\theta^{(2)}$는 10 dimension을 가진다.

  • theta1 has size 25 x 401
  • theta2 has size 10 x 26
  • X has size 5000 x 401
  • Y has size 5000 x 1

모델 표현에 필요한 코드는 다음과 같다.

#You have been provided with a set of network parameters (Θ(1),Θ(2)) 
#already trained by us. These are stored in ex4weights.mat
datafile = 'data/ex4weights.mat'
mat = scipy.io.loadmat( datafile )
Theta1, Theta2 = mat['Theta1'], mat['Theta2']
# The matrices Theta1 and Theta2 will now be in your workspace
# Theta1 has size 25 x 401
# Theta2 has size 10 x 26

# These are some global variables I'm suing to ensure the sizes
# of various matrices are correct
#these are NOT including bias nits
input_layer_size = 400
hidden_layer_size = 25
output_layer_size = 10 
n_training_samples = X.shape[0]

#Some utility functions. There are lot of flattening and
#reshaping of theta matrices, the input X matrix, etc...
#Nicely shaped matrices make the linear algebra easier when developing,
#but the minimization routine (fmin_cg) requires that all inputs

def flattenParams(thetas_list):
    """
    Hand this function a list of theta matrices, and it will flatten it
    into one long (n,1) shaped numpy array
    """
    flattened_list = [ mytheta.flatten() for mytheta in thetas_list ]
    combined = list(itertools.chain.from_iterable(flattened_list))
    assert len(combined) == (input_layer_size+1)*hidden_layer_size + \
                            (hidden_layer_size+1)*output_layer_size
    return np.array(combined).reshape((len(combined),1))

def reshapeParams(flattened_array):
    theta1 = flattened_array[:(input_layer_size+1)*hidden_layer_size] \
            .reshape((hidden_layer_size,input_layer_size+1))
    theta2 = flattened_array[(input_layer_size+1)*hidden_layer_size:] \
            .reshape((output_layer_size,hidden_layer_size+1))
    
    return [ theta1, theta2 ]

def flattenX(myX):
    return np.array(myX.flatten()).reshape((n_training_samples*(input_layer_size+1),1))

def reshapeX(flattenedX):
    return np.array(flattenedX).reshape((n_training_samples,input_layer_size+1))
  • flattenParams 들어온 matrix로 구성된 list를 단일 [n,1] matrix로 변환 한다.
  • reshapeParams: 들어온 flattened_array를 다시 theta1theta2로 변경 한다.
  • flattenX 들어온 X를 [n,1]로 변경 한다.
  • reshapeX flatten을 다시 X로 reshape한다.

1.3 Feedforward and cost function

Now you will implement the cost function and gradient for the neural network.
First, complete the code in nnCostFunction.m to return the cost.

nnCostFunction은 아래 4개의 인자를 입력으로 받는다.

  • mythetas_flattened
  • myX_flattened
  • myy
  • mylambda=0.

원래 참값을 아래와 같이 표현을 변경해주는 코드가 필요하다.

def computeCost(mythetas_flattened,myX_flattened,myy,mylambda=0.):
    """
    This function takes in:
        1) a flattened vector of theta parameters (each theta would go from one
           NN layer to the next), the thetas include the bias unit.
        2) the flattened training set matrix X, which contains the bias unit first column
        3) the label vector y, which has one column
    It loops over training points (recommended by the professor, as the linear
    algebra version is "quite complicated") and:
        1) constructs a new "y" vector, with 10 rows and 1 column, 
            with one non-zero entry corresponding to that iteration
        2) computes the cost given that y- vector and that training point
        3) accumulates all of the costs
        4) computes a regularization term (after the loop over training points)
    """
    
    # First unroll the parameters
    mythetas = reshapeParams(mythetas_flattened)
    
    # Now unroll X
    myX = reshapeX(myX_flattened)
    
    #This is what will accumulate the total cost
    total_cost = 0.
    
    m = n_training_samples

    # Loop over the training points (rows in myX, already contain bias unit)
    for irow in xrange(m):
        myrow = myX[irow]
                
        # First compute the hypothesis (this is a (10,1) vector
        # of the hypothesis for each possible y-value)
        # propagateForward returns (zs, activations) for each layer
        # so propagateforward[-1][1] means "activation for -1st (last) layer"
        myhs = propagateForward(myrow,mythetas)[-1][1]

        # Construct a 10x1 "y" vector with all zeros and only one "1" entry
        # note here if the hand-written digit is "0", then that corresponds
        # to a y- vector with 1 in the 10th spot (different from what the
        # homework suggests)
        tmpy  = np.zeros((10,1))
        tmpy[myy[irow]-1] = 1
        
        # Compute the cost for this point and y-vector
        mycost = -tmpy.T.dot(np.log(myhs))-(1-tmpy.T).dot(np.log(1-myhs))
     
        # Accumulate the total cost
        total_cost += mycost
  
    # Normalize the total_cost, cast as float
    total_cost = float(total_cost) / m
    
    # Compute the regularization term
    total_reg = 0.
    for mytheta in mythetas:
        total_reg += np.sum(mytheta*mytheta) #element-wise multiplication
    total_reg *= float(mylambda)/(2*m)
        
    return total_cost + total_reg
       

def propagateForward(row,Thetas):
    """
    Function that given a list of Thetas (NOT flattened), propagates the
    row of features forwards, assuming the features ALREADY
    include the bias unit in the input layer, and the 
    Thetas also include the bias unit

    The output is a vector with element [0] for the hidden layer,
    and element [1] for the output layer
        -- Each element is a tuple of (zs, as)
        -- where "zs" and "as" have shape (# of units in that layer, 1)
    
    ***The 'activations' are the same as "h", but this works for many layers
    (hence a vector of thetas, not just one theta)
    Also, "h" is vectorized to do all rows at once...
    this function takes in one row at a time***
    """
   
    features = row
    zs_as_per_layer = []
    for i in xrange(len(Thetas)):  
        Theta = Thetas[i]
        #Theta is (25,401), features are (401, 1)
        #so "z" comes out to be (25, 1)
        #this is one "z" value for each unit in the hidden layer
        #not counting the bias unit
        z = Theta.dot(features).reshape((Theta.shape[0],1))
        a = expit(z)
        zs_as_per_layer.append( (z, a) )
        if i == len(Thetas)-1:
            return np.array(zs_as_per_layer)
        a = np.insert(a,0,1) #Add the bias unit
        features = a

실행 코드

#Once you are done, using the loaded set of parameters Theta1 and Theta2,
#you should see that the cost is about 0.287629
myThetas = [ Theta1, Theta2 ]

#Note I flatten the thetas vector before handing it to the computeCost routine,
#as per the input format of the computeCost function.
#It does the unrolling/reshaping itself
#I also flatten the X vector, similarly
print computeCost(flattenParams(myThetas),flattenX(X),y)

실행 결과

0.287629165161

함수 실행

  • flattenParams(myThetas) -> (10285,1) # (25*401) + (10*26) 이다.
  • flattenX(X) -> (2005000, 1) #5000*401 이다.
  • propagateForward는 인자로 1개의 입력 값을 받아서 thetas [theta1, theta2]가 저장된 것을 가지고 연산을 시작 한다. 최종적으로 z W*X가 저장된 것과 a activation(z)된 결과를 저장된한 두 개를 각layer별로 저장한 tuple을 반환하게 된다.

1.4 Regularized cost function

overfitting 문제를 막기 위해서 Regularized cost function을 Design 한다.

#Once you are done, using the loaded set of parameters Theta1 and Theta2,
#and lambda = 1, you should see that the cost is about 0.383770
myThetas = [ Theta1, Theta2 ]
print computeCost(flattenParams(myThetas),flattenX(X),y,mylambda=1.)

2. Backpropagation

오류 역전파를 구현 한다.
computeCost를 통해서 grad값을 받을 수 있다.
계산한 graident를 가지고 cost function $J(\theta)$를 advanced optimizer fmincg를 이용해서 최적화 할 수 있다.

우선 각각의 parameter들에 대한 cost-function에 대한 gradient를 backpropagation algorithm으로 계산해 본다.

2.1 Sigmoid Gradient

sigmoid의 gradient를 알기 위해서는 derivative를 수행 한다.
$g'(z) = g(z)(1 - g(z))$
여기서 $g(z)$는 아래와 같다.
$g(z) = \frac{1}{1+e^{-z}}$ 이다.

def sigmoidGradient(z):
    dummy = expit(z)
    return dummy*(1-dummy)

위 구현이 맞는지 확인해 보는 방법은 z값을 아주 큰 양수나 음수를 넣었을 때 0에 근접한 값이 나오면 된다.
그리고 z=0이라면 gradient는 0.25가 되어야 한다.

실행결과는 아래와 같다.

print(sigmoidGradient(5))
print(sigmoidGradient(0))
print(sigmoidGradient(-5))
0.00664805667079
0.25
0.00664805667079

이러한 특성 때문에 추후에 Vanishing gradient 문제가 발생한다.

원인은 다음과 같다.

  • Weight matrix W가 큰 값으로 초기화 되었다면
  • $Z = W^{T}X$의 결과는 당연히 큰 값이다.
  • sigmoid(Z)는 당연히 0아니면 1이다.
  • sigmoid(z)의 도함수는 $ g(z)(1 - g(z))$ 이므로 결국
  • z가 0,1이 빈번하면 계속 0이된다. 대부분의 경우에 말이다.
  • 그리고 이렇게 앞부분에서 0이 발생하면 backwrad pass는 모두 chain rull로 연결 되어 있으므로 전부다 zero가 된다.
  • local gradient의 maximum은 0.25이다. 그것도 z가 0.5일 때 말이다. 즉 1번 sigmoid를 지날 때마다 그것의 magnitude는 항상 one quarter 만큼 까지 작아지며 그 이상 작아지는 경우도 많다.
  • 결국 초기값 설정이 중요하다.

2.2 Random Initialization

초기 weight값을 저장한 $\theta$를 랜덤하게 초기화 시켜야 한다.
모두 같은 값이면 항상 같은 error를 가지므로 값이 같이 변해서 의미가 없어 진다.

uniformly 하게 range [-x,x]범위로 초기화 하는 것이다.
작은 값으로 초기화 하는것이 sigmoid의 derivative 특성상 유리하다. 큰 값으로 해버리면 0이되고 chain rule을 이용한
backpropagation 알고리즘 특성상 학습이 되지 않는다.

초기화 공식은 아래와 같다.
$$ \varepsilon { init }=\frac { \sqrt {6} }{ \sqrt {L{in}+L_{out}} } $$
where $L_{in}=s_l$ and $L_{out}=s_{l+1}$ 이다.

def genRandThetas():
    epsilon_init = 0.12
    theta1_shape = (hidden_layer_size, input_layer_size+1)
    theta2_shape = (output_layer_size, hidden_layer_size+1)
    rand_thetas = [ np.random.rand( *theta1_shape ) * 2 * epsilon_init - epsilon_init, \
                    np.random.rand( *theta2_shape ) * 2 * epsilon_init - epsilon_init]
    return rand_thetas

2.3 Backpropagation

def backPropagate(mythetas_flattened,myX_flattened,myy,mylambda=0.):
    
    # First unroll the parameters
    mythetas = reshapeParams(mythetas_flattened)
    
    # Now unroll X
    myX = reshapeX(myX_flattened)

    #Note: the Delta matrices should include the bias unit
    #The Delta matrices have the same shape as the theta matrices
    Delta1 = np.zeros((hidden_layer_size,input_layer_size+1))
    Delta2 = np.zeros((output_layer_size,hidden_layer_size+1))

    # Loop over the training points (rows in myX, already contain bias unit)
    m = n_training_samples
    for irow in xrange(m):
        myrow = myX[irow]
        a1 = myrow.reshape((input_layer_size+1,1))
        # propagateForward returns (zs, activations) for each layer excluding the input layer
        temp = propagateForward(myrow,mythetas)
        z2 = temp[0][0]
        a2 = temp[0][1]
        z3 = temp[1][0]
        a3 = temp[1][1]
        tmpy = np.zeros((10,1))
        tmpy[myy[irow]-1] = 1
        delta3 = a3 - tmpy 
        delta2 = mythetas[1].T[1:,:].dot(delta3)*sigmoidGradient(z2) #remove 0th element
        a2 = np.insert(a2,0,1,axis=0)
        Delta1 += delta2.dot(a1.T) #(25,1)x(1,401) = (25,401) (correct)
        Delta2 += delta3.dot(a2.T) #(10,1)x(1,25) = (10,25) (should be 10,26)
        
    D1 = Delta1/float(m)
    D2 = Delta2/float(m)
    
    #Regularization:
    D1[:,1:] = D1[:,1:] + (float(mylambda)/m)*mythetas[0][:,1:]
    D2[:,1:] = D2[:,1:] + (float(mylambda)/m)*mythetas[1][:,1:]
    
    return flattenParams([D1, D2]).flatten()

$\delta$를 구하기 위해서 Backpropagation을 수행하면 다음과 같다.

#Actually compute D matrices for the Thetas provided
flattenedD1D2 = backPropagate(flattenParams(myThetas),flattenX(X),y,mylambda=0.)
D1, D2 = reshapeParams(flattenedD1D2)

아래의 $\delta$는 구하려고하는 2개의 $\theta$값의 matrix 구조와 정확히 일치한다.
이유는 해당 $\delta$가 gradient를 의미하고 이것을 이용해서 $\theta$값을 조정하기 때문이다.

  • D1.shape은 (25,401)이다. $\theta_{1}$ 25 x 401
  • D2.shape은 (10,26)이다. $\theta_{2}$ 10 x 26

입력받는 인자는 4개이다.

  • mythetas_flattened <- flattenParams(myThetas), 구하고자하는 2개의 $\theta$를 담고있다.
  • myX_flattened <- flattenX(X), 입력된 이미지값 20x20+1(bias)
  • myy <- y, 참값
  • mylambda=0. <- 정규화의 수치

델타 값을 구하는 공식은 아래와 같다.
$$ \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ g'(z^{(l)}) $$
해당 공식과 같이 각각의 델타값을 계산한 것이다.

2.4 Gradient cheecking

def checkGradient(mythetas,myDs,myX,myy,mylambda=0.):
    myeps = 0.0001
    flattened = flattenParams(mythetas)
    print(flattened.shape)
    
    flattenedDs = flattenParams(myDs)
    myX_flattened = flattenX(myX)
    n_elems = len(flattened) 
    #Pick ten random elements, compute numerical gradient, compare to respective D's
    for i in xrange(10):
        import time
        #Actually compute D matrices for the Thetas provided
        start_time=time.time() #taking current time as starting time
        x = int(np.random.rand()*n_elems)
        epsvec = np.zeros((n_elems,1))
        epsvec[x] = myeps
        cost_high = computeCost(flattened + epsvec,myX_flattened,myy,mylambda)
        cost_low  = computeCost(flattened - epsvec,myX_flattened,myy,mylambda)
        mygrad = (cost_high - cost_low) / float(2*myeps)
        elapsed_time=time.time()-start_time #again taking current time - starting time 
        print (elapsed_time)
        print "Element: %d. Numerical Gradient = %f. BackProp Gradient = %f."%(x,mygrad,flattenedDs[x])

실행

checkGradient(myThetas,[D1, D2],X,y)
(10285, 1)
0.372140884399
Element: 8125. Numerical Gradient = 0.000022. BackProp Gradient = 0.000022.
0.366824150085
Element: 6642. Numerical Gradient = -0.000070. BackProp Gradient = -0.000070.
0.368250131607
Element: 7657. Numerical Gradient = -0.000001. BackProp Gradient = -0.000001.
0.362774133682
Element: 5270. Numerical Gradient = -0.000031. BackProp Gradient = -0.000031.
0.356532096863
Element: 5933. Numerical Gradient = 0.000057. BackProp Gradient = 0.000057.
0.350827932358
Element: 5394. Numerical Gradient = 0.000001. BackProp Gradient = 0.000001.
0.391266107559
Element: 2606. Numerical Gradient = -0.000006. BackProp Gradient = -0.000006.
0.363577127457
Element: 6160. Numerical Gradient = 0.000075. BackProp Gradient = 0.000075.
0.379949092865
Element: 8722. Numerical Gradient = 0.000001. BackProp Gradient = 0.000001.
0.373861789703
Element: 931. Numerical Gradient = 0.000100. BackProp Gradient = 0.000100.

backpropagation을 이용할 경우 단 0.37의 시간으로 모든 gradient를 계산 할 수 있다.
하지만 numerical 방식으로는 한번에 1개의 theta값만 계산 가능하므로 계산적 오버해드가 크다고 할 수 있다.

2.5 Learning parameters using fmincg

fmin_cg함수를 이용해서 학습을 수행 한다.

  • Minimize a function using a nonlinear conjugate gradient algorithm.

간략히 설명하면 아래와 같다.

  • 장점
    • No need to manually pick $\alpha$.
      • inter-loop를 돌면서 최적의 learning rate을 스스로 매번 다르게 결정한다.
    • Often faster than gradient descent.
  • 단점
    • More complex.
#Here I will use scipy.optimize.fmin_cg

def trainNN(mylambda=0.):
    """
    Function that generates random initial theta matrices, optimizes them,
    and returns a list of two re-shaped theta matrices
    """

    randomThetas_unrolled = flattenParams(genRandThetas())
    result = scipy.optimize.fmin_cg(computeCost, x0=randomThetas_unrolled, fprime=backPropagate, \
                               args=(flattenX(X),y,mylambda),maxiter=50,disp=True,full_output=True)
    return reshapeParams(result[0])

fmin_cg의 인자는 아래와 같다.
scipy.optimize.fmin_cg(fx0fprime=Noneargs=()gtol=1e-05norm=inf,epsilon=1.4901161193847656e-08maxiter=Nonefull_output=0disp=1retall=0,callback=None)

  • f(x, *args): Objective function to be minimized. Here x must be a 1-D array of the variables that are to be changed in the search for a minimum, and args are the other (fixed) parameters of f.
  • fprime: callable, fprime(x, *args), optional
    A function that returns the gradient of f at x. Here x and args are as described above for f. The returned value must be a 1-D array. Defaults to None, in which case the gradient is approximated numerically (see epsilon, below).

학습 수행 결과

#Training the NN takes about ~70-80 seconds on my machine
learned_Thetas = trainNN()

실행 결과

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.279615
         Iterations: 50
         Function evaluations: 112
         Gradient evaluations: 112

정확도 검증

def predictNN(row,Thetas):
    """
    Function that takes a row of features, propagates them through the
    NN, and returns the predicted integer that was hand written
    """
    classes = range(1,10) + [10]
    output = propagateForward(row,Thetas)
    #-1 means last layer, 1 means "a" instead of "z"
    return classes[np.argmax(output[-1][1])] 

def computeAccuracy(myX,myThetas,myy):
    """
    Function that loops over all of the rows in X (all of the handwritten images)
    and predicts what digit is written given the thetas. Check if it's correct, and
    compute an efficiency.
    """
    n_correct, n_total = 0, myX.shape[0]
    for irow in xrange(n_total):
        if int(predictNN(myX[irow],myThetas)) == int(myy[irow]): 
            n_correct += 1
    print "Training set accuracy: %0.1f%%"%(100*(float(n_correct)/n_total))

정확도 결과

computeAccuracy(X,learned_Thetas,y)
Training set accuracy: 96.2%


'MOOC > Machine Learning (python)' 카테고리의 다른 글

Week 05: 2. Backpropagation Algorithm  (1) 2017.02.05
Week 05: 1. Neural Network Cost Function  (0) 2017.01.28
Autonomous Driving Example  (0) 2016.10.12
Putting it together  (0) 2016.10.12
Random Initialization  (0) 2016.09.19

+ Recent posts