1. I would prefer using Bayes Classification over SVM if my features are for mostly independent of each other. Also, Bayes Classifier is able to give a good performance even for small training dataset with assumption of independent features holding true. SVM would be preferrred in case of high dimensional data with many interactions between features, but it might be inconvinient to use it on large data set, as training data required would be large and hence take a lot of time.2. Multinomial Naive Bayes is used when the frequency with which words occur in docment is also important. In this case, we will eliminate certain commonly occuring words and represent the ad as subset of the remaining words and their number of occurences in the document.Consider following ad :Description:This gorgeous two bedroom one bathroom apartment is located in Phoenix, AZ. It is centrally located near shopping and dining. This apartment is over 1000 square feet and offers a newly renovated kitchen and bathroom.Building Amenities : Washer/Dryer, Spacious Backyard, Garage Parking,Close to Public TransportationConsider. Here, we will consider only few words, for example the words highlighted in above text and their frequencies.bedroom1bathroom2located2parking1backyard1washer1transportation1kitchen1apartment1 3. Training data for POS would sentemces with their POS tag sequences. Hidden Markov Model is a Markov model where the state are not visible, we only rely on state dependent results. In our case the states are the POS tags and results are the words. The probability of state change is hidden. In POS example using Hidden Markov Model, we find the tag sequence having highest probability given an input. We take in last n items in out sequence and find out for which sequence of tags is the highest probability possible.P(t1,t2,t3..tn|w1,w2,w3..wn) = P(w1,w2..wn | t1,t2..tn) * P(t1,t2..tn)/P(w1,w2..wn) where wi is the ith word and ti is the ith tag sequence. P(w1,w2..wn | t1,t2..tn ) can be written as P(w1|t1)* P(w2|t2)..*P(wn|Tn) since each words depends on its current tag ti. Similary, current value of tag depends on on its previous value of tag and no other values, so only t2 has a say about what value t3 can be hence P(t1,t2..tn) can be written as P(t1)*P(t2|t1)..*P(tn|t(n-1)). Since there are a many permutation of sequences and it is difficult to find out probability for all of them, we use Viterbi algorithm to find the one having highest probability.4. Suppose we have word “rose”. From the training data we can see that either it has been used as a verb noun or adjective. With only this knowledge of training data, we give P(Noun) = 1/3, P(Verb) = 1/3 P(Adjective) =1/3 for word rose. We are intializing probability to equal values not making any assumption about data not given to us. Further if we figure out from training data that rose is a verb 50% of time then P(Noun) = 1/4, P(Verb) = 1/3 P(Adjective) =1/4. Since we have no information out probability of rose being noun or adjective we divide the remaining probability between the two giving them an equal chance and unbiased probability.Maximum entropy reduces overfitting and generalization errors. Advantage of Maximum entropy in Hidden Markov Model would be that they do not assume independence between features which Hidden Markov model without maximum entropy does.5. Backpropagation :Once an input is processed, it goes to loss function to calculate the difference between the desired and actual output. This difference is then used to adjust the weight of neurons so that the actual ouput becomes closer to the desired one depending on value of learning rate. Since it is calculated after seeing the output and again goes back to network layers, it is called back propagation. So calculating errors associated with each neuron and optimizing them are two important steps of this process. This is step is generally used while training of neural networks.Jacobian matrix is composed of partial derivatives of functions which take vector as input and give vector as output. In back propogaation it is used to calculate partial derivative of theeach output with respects to each weight in graph. The size of matrix woule be number_of_output x number_of_input. Matrix help keep track of all the derivations.FeedForward :FeedForward is when current output only depends on current input and there is no dependency on the earlier inputs or future inputs. Also data transfer is transferred in forward direction i.e no data goes back from output to earlier layers. This step is generally used once we are done training the CNN and want to test it.6. Basic difference between Sigmoid Neuron and ReLU neuron is that former gives output value between 0 and 1, while latter gives values between range o to infinity(all negative values are replaced by zero). Sigmoid neuron would be preferrable when we want probability as output or a greater learning curve. But it also causes gradients to vanish during back propogation. ReLU generally works better as it does not cause gradient vanishing and hence nerons can be trained better. Vanishing of gradients occur due to really small value of of gradient, which can happen to Sigmoid (range : 0,1) and not to ReLU(range:0 to infinity) when it has large number of layers in network. Hence for network with large number of layers, ReLU is preferrable.7. In convolution operation, we try to extract specific information from original image using specific filter or kernel. Kernel size is much smaller than the image and activation map is obtained after sliding kernel over image and calculating dot product. Each filter gives us different information about the original image for eg a curve, edge etc We get this information after sliding the filter over the image.I think principle of convolution was transfomational for computer vision tasks because it allowed us to stack differnt filters on top of each other and to extract patterns rather than just a feature in image eg faces, cars etc8. These methods are used for optimization and regularization of neural network which prevents over-fitting of neural networks. Since testing data will not be same as training data, regularization will give better outcome on testing data as will be generalizing and not be tailoring it to needs of training data.They have shown to reduce the difference between observed and predicted values for test data even though they might increase the errors for training data.DropOut :In this method, during prediction phase, we randomly deactivate fifty to seventy five percent of neurons on particular layers, forcing the remaining neurons to adapt. Different neurons are dropped at each iteration.Batch Normalization : Batch normalization is carried on subpart of training set(batch). It helps in more in optimization than regularization. Here we normalize input x and after that scale and shift if networks requires it, but at the same time there is a provision that original mapping of shifting can be recovered. The network can decide whether it wants to used scaled, or original input. Due to batch normalization, higher learning rates can be used.9. RNNs :- Outputs of RNNs are dependent on current as well as previous hidden state of inputs.- Size of input and output can vary- In RNN, there is not particular pattern we are looking form but across time predicting next output- Generally used for language translation and modelingCNN :- It is a feed-forward network i.e its values value depend only on current state and not on its previous or future states.- Each filter in a CNN searches for a specific feature. Combining all the features, we can analyze the overall pattern for example a face or a car.- Size of input and output is fixed- We are looking for same pattern all over the image – Generally used for problems like image recognition 10. The first step to calculate tfidf values for a documents would to find out how many distinct words does it have and their frequencies. For example if the document has following line : “It was a cold, cold night”, then its representation would be v = (2,1) as frequency of word cold is 2 and of night is 1. We do not consider the stop words while representing the document. Since relevance is not represented properly using term frequency, we use log frequency tf = 1+ log(tf(t,d))Inverse document frequency weight is calculated by using formula :idf(t) = log((D)/(1+d))where D is total number of documents in corpus while d is number of documents where appears.tf-idf = tf(t,d) * idf(t)This helps in increasing the weight of words which have highly occur in particular document and but are rarity in other documents. Each document is not represented as list of tf-idf scores of list of all words. The list of words remains same across all documents to be compared.In singular value decomposition, documents are represented as vectors in the space. And queries too are represented in same space and the documents which are closer to the query are our answer with the ones closest having higher ranking than others. 11. In collaborative filtering based recommendation systems, it tries to find out similar users and recommends the products similar users have liked or rated highly. Unlike content-base recommendation, we here take into account the products rated by user itself rather than feature of the product which has been rated. Suppose two users have rated product A, then whenther or not to suggest them products liked by each other would depend on what kind of rating they gave to A, similar rating would increase chances of suggesting products they liked to each other, then comes missing rating and then contradictory rating. Since this depends on how similar or dissimilar choices of two users are, missing values have to be taken into account in collaborative filtering.Predictions are based on assumption that if two users have like similar things before, then they will to in future. So if users A and B have rated movies 1 and 2 highly, they we can see they have similar tastes. Hence when movie 3 comes along, which is similar to 1 and 2 and has only been rated by user A, we can predict what will be the rating from user B.12. Latent factor model helps us identifying the factors on which affect the ratings, but are not easily visible. It helps dealing with problem of sparse data, as instead of focusing on large number of items, we are docusing on factors which are combination of those items.Matrix Factorization : In matrix factorization, we represent each user and each item as a vector in such a way that value as close as possible to rating can be generated by dot product of user and item. So R here is factorized into two matrices U and P of new dimension k which is rank of factorization.R = U X P’where R is mxn, U is mxk and P is nxkSingle Value Decomposition: SVD(A) = U?V’A is mxn matrix U and V are orthogonal matrices having dimensions mxr and nxr? is rxr singular othogonal matrixLow rank approximation is done by varying value of r. This maps all users and items in a k dimensional spaceEven though SVD gives better results than Matrxi factorization, the cost of computation is high. Also all missing values shoule be replace by mean in SVD. Single value decompotion factorizes matrix into 3 values while Matrix factorization factorizes it into two values.13.Betweenness of a graph is number of times a particular edge (a,b) is within the path from any vertex x to vertex y. If there are multiple path from x to y and (a,b) occurs on only one of them, then betweenness for (a,b) in path x to y is 0.5. Similarly we have calculate it for rest of the pairs of vertices in graph. Once we calculate betweenness for all edges in graph, communities can be found by first considering the edges having least betweenness and then expanding it to include edges with betweennesss gradually increasing. The size of clusters and edges depends on the betweenness which we allow in out community. The higher betweenness allowed, larger will be the cluster. Other method would be, to gradually remove edges from original graph having highest betweenness and then move towards edge with betweenness in decreasing order and we can stop when we are satisfied with the clustering after removing edges.(Reference : MMDS Chapter 10)14. Ideal clustering should have roughly equal size of clusters and partioning solely on betweenness may give you a two clusters may result into partitions where one consist of one point and other having remaining onces. Spectral clustering is used to identifying communities roughly having same size using Laplacian matrices.Laplacian matrix L = D – A where D is the degree matrix and A is adjacency matrix. First eigen value is 0 and eigen vector is 1 for L. Second smallest value of eigen vector is used to define the two communities. Second smallest eigen value will have set of positive component and set of negative components since all componenets cannot be zero and is orthogonal to 1. Thus, we can partition the graph in such a way that node which have x component to be positive will be in one partition while those having negative will be in other.15. Combiner is, optionally, used between map and reduce phase to collect all the values according to keys and pass it to reducer for processing. So, if there are 5 keys, then data passed to reducer would be

rddOne = sc.parallelize(listOne).map(lambda element :(element,1))

#Output of above line would be a new RDD containing elements (1,1),(1,2),(1,3),(1,4),(1,5),(1,5) For each input function, one element is returned

rddTwo = sc.parallelize(listOne).map(lambda element :(x+element) for x in 1,2,3)

# Output of above line would be new RDD having following elements : 2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6, 7, 6, 7, 8, 6, 7, 8. For each element it is returning a list and not key value pair

RDD Actions:listOne = 1,2,3,4,5,5

rddOne = sc.parallelize(listOne).count()

#Will return the number of elements in rddOne (integer) which in this case is 6

result = sc.parallelize(listOne).map(lambda element :(element,1)).collect()

#The value of result here is a list containing all elements in rdd (1,1),(1,2),(1,3),(1,4),(1,5),(1,5). It transfers the data from executor to the driver.

17.listOne = 1,2,3,4,5,6,7,8,9,10

dictionary = {1:0, 2:1}

def nthValue(i):

if i in dictionary:

return dictionaryi

else :

value = nthValue(i-1) + nthValue(i-2)

dictionaryi = value

return value

rdd = sc.parallelize(listOne).map(lambda num : (num, nthValue(num)))

print(rdd.values().collect())

18. Sparse vectors are generally used when there are a lot of zeros , while dense vectors are used when there are lot of non-zero values in vectors. Vectors are is an array of numbers, suppose we have 1000 dimensional vectors which means it will be reprsented as 1,2,3….1000th value values. If many of them are zeros, it will cause wastage of storage, hence instead of that we can store it its position and value example 1 3 4 6 meaning value at 1st position is 3, value at 4th position is 6and so one. We are representing it as position and value pair.from pyspark.mllib.linalg import Vectors

#For representing vector 1,0,4,0,8

denseVector = Vectors.dense(1, 0, 4, 0, 8);

#Arguments : Length of vector, Positions at which value is present, values in the order of positions mentioend before

sparseVector = Vectors.sparse(5,0,2,4,1,4,8)

19. No, print(a) will not print the contents of a. It will just give information about tensor a.C represents tensor got after matrix multiplication of a and b and hence will be of dimensions 3×1. Contents of the tensor are printed when we use eval() function. However for eval function to be used we need a session. So contents of c can be printed in following wayimport tensorflow as tf

sess = tf.InteractiveSession()

a = tf.constant(3, 5, 10, 11, -1, -2, dtype=tf.float32)

b = tf.constant(4, 6, dtype=tf.float32)

c = tf.matmul(a,b)

c.eval()

sess.close()

#array( 42.,

# 106.,

# -16., dtype=float32)

20. Dataflow graphs are used by TensorFlow to represent the computation so as to identify depencdecnies of operation on each other. Following are the two advantages of computational graphs :Parallelism : It helps us identify which operations in computation can be run parallel.Portability : It is possible to have data flow graph generated and saved in one language and later open it in other languageReference : https://www.tensorflow.org/