06_Logistic_Regression

# 06: Logistic Regression

Classification
• Where y is a discrete value
• Develop the logistic regression algorithm to determine what class a new input should fall into
• Classification problems
• Email -> spam/not spam?
• Online transactions -> fraudulent?
• Tumor -> Malignant/benign
• Variable in these problems is Y
• Y is either 0 or 1
• 0 = negative class (absence of something)
• 1 = positive class (presence of something)
• Later look at multiclass classification problem, although this is just an extension of binary classification
• How do we develop a classification algorithm?
• Tumour size vs malignancy (0 or 1)
• We could use linear regression
• Then threshold the classifier output (i.e. anything over some value is yes, else no)
• In our example below linear regression with thresholding seems to work

• We can see above this does a reasonable job of stratifying the data points into one of two classes
• But what if we had a single Yes with a very small tumour
• This would lead to classifying all the existing yeses as nos
• Another issues with linear regression
• We know Y is 0 or 1
• Hypothesis can give values large than 1 or less than 0
• So, logistic regression generates a value where is always either 0 or 1
• Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
• What function is used to represent our hypothesis in classification
• We want our classifier to output values between 0 and 1
• When using linear regression we did hθ(x) = (θT x)
• For classification hypothesis representation we do hθ(x) = g((θT x))
• Where we define g(z)
• z is a real number
• g(z) = 1/(1 + e-z)
• This is the sigmoid function, or the logistic function
• If we combine these equations we can write out the hypothesis as
• What does the sigmoid function look like
• Crosses 0.5 at the origin, then flattens out]
• Asymptotes at 0 and 1

• Given this we need to fit θ to our data
Interpreting hypothesis output
• When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that y=1 on input x
• Example
• If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
• hθ(x) = 0.7
• Tells a patient they have a 70% chance of a tumor being malignant
• We can write this using the following notation
• hθ(x) = P(y=1|x ; θ)
• What does this mean?
• Probability that y=1, given x, parameterized by θ
• Since this is a binary classification task we know y = 0 or 1
• So the following must be true
• P(y=1|x ; θ) + P(y=0|x ; θ) = 1
• P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary
• Gives a better sense of what the hypothesis function is computing
• Better understand of what the hypothesis function looks like
• One way of using the sigmoid function is;
• When the probability of y being 1 is greater than 0.5 then we can predict y = 1
• Else we predict y = 0
• When is it exactly that hθ(x) is greater than 0.5?
• Look at sigmoid function
• g(z) is greater than or equal to 0.5 when z is greater than or equal to 0

• So if z is positive, g(z) is greater than 0.5
• z = (θT x)
• So when
• θT x >= 0
• 결국 위와 같을 때, 즉 Z가 0보다 크다면 0.5 이상의 값이 되어서 logistic regression은 이 때 1로 prediction을 하게 된다.
• Then hθ >= 0.5
• So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0
• The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
• Let's use this to better understand how the hypothesis makes its predictions
Decision boundary
• 좀 더 디테일 하게 들어가 보자.
• hθ(x) = g(θ0θ1xθ2x2)

• 최종 적으로 학습을 통해서 hypethosis는 아래와 같은 weight vector를 가진다고 가정해 보자.
• So, for example
• θ0 = -3
• θ1 = 1
• θ2 = 1
• So our parameter vector is a column vector with the above values
• So, θT is a row vector = [-3,1,1]
• What does this mean?
• 결국 위의 예제와 같이 0을 기점으로 1과 0의 prediction이 결정 된다.
• The z here becomes θT x
• We predict "y = 1" if
• -3x0 + 1x1 + 1x2 >= 0
• -3 + x1 + x2 >= 0
• We can also re-write this as
• If (x1 + x2 >= 3) then we predict y = 1
• If we plot
• x1 + x2 = 3 we graphically plot our decision boundary

• Means we have these two regions on the graph
• Blue = false
• Magenta = true
• Line = decision boundary
• Concretely, the straight line is the set of points where hθ(x) = 0.5 exactly
• The decision boundary is a property of the hypothesis
• Means we can create the boundary with the hypothesis and parameters without any data
• Later, we use the data to determine the parameter values
• i.e. y = 1 if
• 5 - x1 > 0
• 5 > x1
Non-linear decision boundaries
• 비선 형에 대해서도 고차원 다항식을 통해서 decision boundary를 얻을 수 있다.
• Get logistic regression to fit a complex non-linear data set
• Like polynomial regress add higher order terms
• So say we have
• hθ(x) = g(θ0θ1x1θ3x12θ4x22)
• We take the transpose of the θ vector times the input vector
• Say θT was [-1,0,0,1,1] then we say;
• Predict that "y = 1" if
• -1 + x12x22 >= 0
or
• x12x22 >= 1
• If we plot x12x22 = 1 (원의 방정식)
• This gives us a circle with a radius of 1 around 0

• Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
• More complex decision boundaries?
• By using higher order polynomial terms, we can get even more complex decision boundaries

모든 다 할 수 있다. 복잡한 다항식을 이용 한다면,

Cost function for logistic regression
• Fit θ parameters
• Define the optimization object for the cost function we use the fit the parameters
• Training set of m training examples
• Each example has is n+1 length column vector

• This is the situation
• Set of m training examples
• Each example is a feature vector which is n+1 dimensional
• x0 = 1
• y ∈ {0,1}
• Hypothesis is based on parameters (θ)
• Given the training set how to we chose/fit θ?
• Linear regression uses the following function to determine θ

• Instead of writing the squared error term, we can write
• If we define "cost()" as;
• cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
• Which evaluates to the cost for an individual example using the same measure as used in linear regression
• We can redefine J(θ) as

• Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)

• To further simplify it we can get rid of the superscripts
• So
• 위에서 정의한 Cost Function은 linear regression에서는 optimization 알고리즘인 gradient descent로 잘 동작 했지만, hypothesis function이 sigmoid로 바뀐다면 이것은 optimize가 잘 되지 않는 문제점이 있다.
• 결국 다른 cost function의 구조를 찾아야 한다.
• What does this actually mean?
• This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
• If we use this function for logistic regression this is a non-convex function for parameter optimization
• Could work....
• What do we mean by non convex?
• We have some function - J(θ) - for determining the parameters
• Our hypothesis function has a non-linearity (sigmoid function of hθ(x) )
• This is a complicated non-linear function
• If you take hθ(x) and plug it into the Cost() function, and them plug the Cost() function into J(θ) and plot J(θ) we find many local optimum -> non convex function
• 만약 이상태로 gradient descent algorithm을 실행 한다면, global minimum으로 converge 되지 않는다.

• Why is this a problem
• Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a global minimum
• We would like a convex function so if you run gradient descent you converge to a global minimum
A convex logistic regression cost function
• To get around this we need a different, convex Cost() function which means we can apply gradient descent

• This is our logistic regression cost function
• This is the penalty the algorithm pays
• Plot the function
• Plot y = 1
• So hθ(x) evaluates as -log(hθ(x))

위 그래프가 우리가 원하는 cost function이다.
즉, y=1이고 prediction이 1이라면, -log (1)의 값은 0이다. 즉 penalty가 없다.
하지만 prediction이 0에 가까울 수록 -log(x)의 값은 커지므로 penalty 값이 증가하게 된다.

• So when we're right, cost function is 0
• Else it slowly increases cost function as we become "more" wrong
• X axis is what we predict
• Y axis is the cost associated with that prediction
• This cost functions has some interesting properties
• If y = 1 and hθ(x) = 1
• If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 (exactly, not nearly 0)
• As hθ(x) goes to 0
• Cost goes to infinity
• This captures the intuition that if hθ(x) = 0 (predict (y=1|x; θ) = 0) but y = 1 this will penalize the learning algorithm with a massive cost
• What about if y = 0
• then cost is evaluated as -log(1- hθ( x ))
• Just get inverse of the other function

• Now it goes to plus infinity as hθ(x) goes to 1
• With our particular cost functions J(θ) is going to be convex and avoid local minimum
Simplified cost function and gradient descent
• Define a simpler way to write the cost function and apply gradient descent to the logistic regression
• By the end should be able to implement a fully functional logistic regression function
• Logistic regression cost function is as follows

• This is the cost for a single example
• For binary classification problems y is always 0 or 1
• Because of this, we can have a simpler way to write the cost function
• Rather than writing cost function on two lines/two cases
• Can compress them into one equation - more efficient
• Can write cost function is
• cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
• This equation is a more compact of the two cases above
• We know that there are only two possible cases
• y = 1
• Then our equation simplifies to
• -log(hθ(x)) - (0)log(1 - hθ(x))
• -log(hθ(x))
• Which is what we had before when y = 1
• y = 0
• Then our equation simplifies to
• -(0)log(hθ(x)) - (1)log(1 - hθ(x))
• = -log(1- hθ(x))
• Which is what we had before when y = 0
• Clever!
• So, in summary, our cost function for the θ parameters can be defined as

이러한 convexity analysis는 수업의 scope를 넘어선다. 일단 위와 같은 cost function을 cross entropy라고도 부른다.

• Why do we chose this function when other cost functions exist?
• This cost function can be derived from statistics using the principle of maximum likelihood estimation
• Note this does mean there's an underlying Gaussian assumption relating to the distribution of features
• Also has the nice property that it's convex
• To fit parameters θ:
• Find parameters θ which minimize J(θ)
• This means we have a set of parameters to use in our model for future predictions
• Then, if we're given some new example with set of features x, we can take the θ which we generated, and output our prediction using

• This result is
• p(y=1 | x ; θ)
• Probability y = 1, given x, parameterized by θ
How to minimize the logistic regression cost function
• Now we need to figure out how to minimize J(θ)
• Use gradient descent as before
• Repeatedly update each parameter using a learning rate

편미분에 의해서 구해진 식이다.

이해가 안된다면 이전 포스트 참조

• If you had features, you would have an n+1 column vector for θ
• This equation is the same as the linear regression rule
• The only difference is that our definition for the hypothesis has changed
• Previously, we spoke about how to monitor gradient descent to check it's working
• Can do the same thing here for logistic regression
• When implementing logistic regression with gradient descent, we have to update all the θ values (θ0 to θn) simultaneously
• Could use a for loop
• Better would be a vectorized implementation
• Feature scaling for gradient descent for logistic regression also applies here

• Previously we looked at gradient descent for minimizing the cost function
• Here look at advanced concepts for minimizing the cost function for logistic regression
• Good for large machine learning problems (e.g. huge feature set)
• What is gradient descent actually doing?
• We have some cost function J(θ), and we want to minimize it
• We need to write code which can take θ as input and compute the following
• J(θ)
• Partial derivative if J(θ) with respect to j (where j=0 to j = n)

• Given code that can do these two things
• Gradient descent repeatedly does the following update

• So update each j in θ sequentially
• So, we must;
• Supply code to compute J(θ) and the derivatives
• Then plug these values into gradient descent
• Alternatively, instead of gradient descent to minimize the cost function we could use
• BFGS (Broyden-Fletcher-Goldfarb-Shanno)
• L-BFGS (Limited memory - BFGS)
• These are more optimized algorithms which take that same input and minimize the cost function
• These are very complicated algorithms
• Some properties
• No need to manually pick alpha (learning rate)
• Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
• Often faster than gradient descent
• Do more than just pick a good learning rate
• Can be used successfully without understanding their complexity
• Could make debugging more difficult
• Should not be implemented themselves
• Different libraries may use different implementations - may hit performance
• How to use algorithms
• Say we have the following example

• Example above
• θ1 and θ2 (two parameters)
• Cost function here is J(θ) = (θ1 - 5)2 + θ2 - 5)2
• The derivatives of the J(θ) with respect to either θ1 and θ2 turns out to be the 2(θi - 5)
• First we need to define our cost function, which should have the following signature
• Input for the cost function is THETA, which is a vector of the θ parameters
• Two return values from costFunction are
• jval
• How we compute the cost function θ (the underived cost function)
• In this case = (θ1 - 5)2 + (θ2 - 5)2
• 2 by 1 vector
• 2 elements are the two partial derivative terms
• i.e. this is an n-dimensional vector
• Each indexed value gives the partial derivatives for the partial derivative of J(θ) with respect to θi
• Where i is the index position in the gradient vector
• With the cost function implemented, we can call the advanced algorithm using
options= optimset('GradObj', 'on', 'MaxIter', '100'); % define the options data structure
initialTheta= zeros(2,1); # set the initial dimensions for theta % initialize the theta values
[optTheta, funtionVal, exitFlag]= fminunc(@costFunction, initialTheta, options); % run the algorithm
• Here
• options is a data structure giving options for the algorithm
• fminunc
• function minimize the cost function (find minimum of unconstrained multivariable function)
• @costFunction is a pointer to the costFunction function to be used
• For the octave implementation
• initialTheta must be a matrix of at least two dimensions
• How do we apply this to logistic regression?
• Here we have a vector

• Here
• theta is a n+1 dimensional column vector
• Octave indexes from 1, not 0
• Write a cost function which captures the cost function for logistic regression

Multiclass classification problems
• Getting logistic regression for multiclass classification using one vs. all
• Multiclass - more than yes or no (1 or 0)
• Classification with multiple classes for assignment

• Given a dataset with three classes, how do we get a learning algorithm to work?
• Use one vs. all classification make binary classification work for multiclass classification
• One vs. all classification
• Split the training set into three separate binary classification problems
• i.e. create a new fake training set
• Triangle (1) vs crosses and squares (0) hθ1(x)
• P(y=1 | x1θ)
• Crosses (1) vs triangle and square (0) hθ2(x)
• P(y=1 | x2θ)
• Square (1) vs crosses and square (0) hθ3(x)
• P(y=1 | x3θ)
Train a logistic regression classifier h_theta^(i) (x) for each class i
to predict the probability that you = i .

On a new input x, to make a prediction, pick the class i that maximizes:

max h_\theta ^(i) (x)
i

• Overall
• Train a logistic regression classifier hθ(i)(x) for each class i to predict the probability that y = i
• On a new input, x to make a prediction, pick the class i that maximizes the probability that hθ(i)(x) = 1

#### 'MOOC > Machine Learning (python)' 카테고리의 다른 글

 Week 03: Regularization and Quiz  (0) 2016.07.04 2016.06.20 2016.06.20 2016.04.17 2016.01.06 2015.12.20