Problem 1 Represent the derivative of the following scalar functions with respect to X 2 R D ⇥ D (a) f ( X ) = tr ( X 2 ). Here, tr ( A ) is the trace of a square matrix A (b) g ( X ) = tr ( X 3 ). (c) h ( X ) = tr ( X k ) for k 2 N 1 Problem 2 In order to alleviate overfitting in logistic regression, regularization technique can be used with the L 2 -norm of the weight parameter, || w || 2 = P D i =1 w 2 i When we are given x 1 , . . . , x N 2 R D and y 1 , . . . , y N 2 { 0 , 1 } for training set, we want to derive the update rule for w 2 R D in order to minimize the following loss function L ( w ) having L 2 -norm, L ( w ) = N X i =1 ⇣ y i ln f ( x i ; w ) (1 y i ) ln(1 f ( x i ; w )) ⌘ + || w || 2 (1) f ( x ; w ) = 1 1 + exp( w > x ) (2) (1) Derive the update rule for w when we use gradient descent. (2) Discuss the e ↵ ect of L 2 -norm regularization. 2 Problem 3 Consider the following function, f ( x ; w ) = 1 exp ( w T x ) 1 + exp ( w T x ) , (3) having the shape in Fig. 1. Figure 1: f ( x ; w ) Find the gradient descent update rule for w to minimize the loss L = 1 2 N X i =1 ( f ( x i ; w ) y i ) 2 (4) Here, we use N number of x i 2 R D and y i 2 { 1 , 1 } for i 2 { 1 , . . . , N } which are given in advance. 3 Problem 4 For a given data set D = { x i , y i } N i =1 , derive the closed-form solution for w that maximizes the following probability P P = N Y i =1 1 p 2 ⇡ 2 exp ✓ 1 2 2 ( y i f ( x i ; w )) 2 ◆ , (5) with f ( x ; w ) = w > x for x , w 2 R D 4 Problem 5 The Frobenius dot product h A , B i is defined as h A , B i = tr ( A > B ) , (6) for A , B 2 R N ⇥ N using the scalar function tr ( · ) for the trace of a matrix, tr ( M ) = N X i M ii , for M 2 R N ⇥ N (7) Find the derivative of h A , B i with respect to A using the definition of the matrix derivative: d d A h A , B i ij = @ @ A ij h A , B i (8) 5 Problem 6 We are given a scalar function f ( X ) = w > Xw for w 2 R D , X 2 R D ⇥ D . Find the derivative of f ( X ) with respect to the vector w . Use the definition of the vector derivative h df d w i k = @ f @ w k , where w k is the k -th element of w 6 Problem 7 For a dataset D = { ( x i , y i ) } N i =1 with x i 2 R D and y i 2 { 0 , 1 } , Suppose that we have a two-layer neural network with residual connections shown in Figure 2. Each component is given as follows: L ( W ) = 1 2 N X i =1 ( g ( x i ) y i ) 2 (9) g ( x ) = D X d =1 w 2 ,d · h d ( x ) ! (10) h i ( x ) = x i + M X m =1 w 1 ,i,m · z m ( x ) ! (11) z i ( x ) = D X d =1 w 0 ,i,d · x d ! (12) ( x ) = 1 1 + exp ( x ) (13) Here, the residual connection from each x i to the h i node in the second layer is represented in Eq. (11). The weight w i,j,k connects the j -th node of i -th layer to the k -th node of ( i + 1)-th layer. Figure 2: A two-layer network (a) Calculate derivative of L with respect to h j (b) Calculate derivative of L with respect to w 1 ,d,m 7