Questions should be asked frequently! Please write them down and give or email them so that I could add more detailed explanations to my lecture notes.

Name: ???

Background: ???

Interest (science only): ???

Expectations: ???

Courses related to this one: ???

Questions, anything else ???

Unfortunately there is not much time during lectures to make a discussion and to answer your question, therefore please send me your questions, this will help me to improve the lectures. Time of the lectures is short and there is always so much more to say ...

Thank you for sending me questions presented below. They are answered below.
Please let me know if you would like me to add your name to the question, I will by default copy only the question.

Giving these lectures I try to find a tradeoff between giving you a solid foundation and an overview of many methods that may be useful in your work, showing their basis and their practical applications, without being too shallow in the analysis of these methods. Unfortunately we cannot spend enough time on a single topic to explore it in depth. For undergraduates it may be better to focus on a few topics and make a very detailed study, some books do it well. For graduate students it may be more important to have some overview and follow on your own in the direction that is useful to you, this is usually much harder to get from books because it would require to read a number if books and papers rather carefully. Perhaps this approach would be more beneficial for you, but please let me know if you have some thought on this issue.

I probably should organize the questions for different lectures, but right now questions newest questions are at the top. Some older questions are below.

- Q: Does P+ and P+(M) have the same significance?
I notice P+ is used for the computation of sensitivity while P+(M) is used for precision.
A: P+ is the a priori probability of class +, calculated from data as N+/N. P+(M) is the predicted percentage of samples in class +, calculated from predictions of the decision system, not directly from data. If the real class proportion is P+=0.8, P-=0.2, but your system is a majority classifier always saying "choose C+" then P+(M)=1, P-(M)=0. So they are very different.

- Q: In the last slide of Lec 16,
AUC = (S
_{+}+ S_{-})/2 and it is identical along S_{+}+(1-S_{-})=Const line.

But if S_{+}+(1-S_{-})=Const, then S_{+}= Const-(1-S_{-}), hence AUC = (S_{+}+S_{-})/2= (Const-(1-S_{-})+ S_{-})/2=(Const-1 )/2 + S_{-}. This should not be identical for all cases right?

Or should this be “S_{+}-(1-S_{-})=Const line”?A: Not quite so .... the equality AUC = (S

_{+}+ S_{-})/2 is true for this one point, it is not a definition of AUC! Calculate the area under the curve as a sum of two triangles and a rectangle to get AUC, as seen in the slide, and then look at the line where it is constant: this line goes through points A and B and is a shiftes y=x or S_{+}= S_{-}+ constant line. - Q: In Lecture 13, slide number 4, the fourth line which says k=arg min.....

In my understanding, when determining which class to predict, we should do the following: Let say there are 3 choices of prediciton A,B,C, calculate Ra,Rb,Rc and choose the minimum:

Ra: the expected risk if we choose to predict A

Rb: the expected risk if we choose to predict B

Rc: the expected risk if we choose to predict C

Based on this, I think the formula will be k= arg min(l) of sigma j=1 to K lambda(jl) ... not lambda(lj) as written in the lecture note (this is because the prediction is l, not j).A: Indeed ... but also in the formula above indices are not correct; please see the corrected slide. thank you

- Q: Another question about SOM, Olive Oil example. In my understanding, SOM is unsupervised learning (for clustering), the training data comes without any class label. Then how to use SOM for classification? As shown in Olive Oil example, how to determine the label 1,2,3,..? The nodes will have 8 dimensional weights (thus they are 8D), how they could be visualized in 2D?
A: Although this is unsupervised we can at the end label each node with the class label of a vector that activates the node in the strongest way (is most similar to the weights of this node). Then the map will have labels, as shown in this example; given a new query vecotr X we find the winner node and use its label for classification.

- Q: In my understanding, SOM's output is a list of nodes each with d-dimensional weight (so they are only points). In lecture 9, slides with title 2D=>2D, the result is a set of points connected with lines, how to determine the lines?
A: Lines connect neighbors on the grid that is in the space of grid coordingates, not in the data space; in 1D case fro example each nde has two neighbors, initially the may have d-D weights that are quite different, but after training they point to nearby areas in the data space, so points are shown close to each other; in 2D case at the end we see that points have 4 (or 2 at the corners) immadiate neighbors, after training nieghbors on the grid, connected with the lines, are also pointing to neighboring parts of space.

- Q: Is it correct to assume that the term Sum_i P(Xi|Yj)= 1?
A: This is obviously true, as for any fix value Yj sum of probabilities over Xi is the probability that Xi has some value, and this is 1.

- Q: In lecture 8 slide 2, does it mean that if the value of the 4th moment is large, then Y is not a gaussian distribution? For e.g, for a super-gaussian distribution, the larger K4(Y)>0, the peak of the distribution is more concentrated, thus it is not a gaussian distribution?
A: Yes; note that Gaussian have all higher moments =0, that's why they are fully characterized by the mean and dispersion!

- Q: In lecture 8 slide 6, why is the variance E{(W^T.X)^2)} ? What about the second term for variance E{(W^T.X))}^2?
A: This for standardized that is zero.

- Q: I'm confused about lecture 7 slide 6, may I know what is the gist of this slide?
A: It is hard to unswer such general questions; try to listen to my comments on the audio track and ask something more specific, otherwise I'll have to repeat a part of the lecture in writing ...

- Q: In Lecture 34, slide 9, you mentioned on the use of feature filters to select the features at hand. However, I am still unsure how this feature filters can be used. Do they use "information-theory" to evaluate/estimate the usefulness of a feature?
A: Very simple: compute mutual information MI index between class and feature values, order (rank) features Xi according to the index value and the most infromative features are going to have high MI(Xi), and those that have nothing to do with your task (class label assignment) will be at the end with low MI(Xi). It is easy to create more such indices, for example take linear correlation coefficient as a ranking index. I'll return to this in next lecture.

- Q: In Lecture 29 Slide 6, last paragraph, you wrote that
"If ||X|| = 1, replace W.X by ||W-X||^2 and decision borders will still be linear".
The requirement that ||X|| = 1 indicates that after standardizing the data vector X, we can replace the scalar dot product by Euclidean distance, and still get the linear decision borders right?
A: ||X||=1 is not the same as standardization: here for each vector the length is 1, in standardization for each feature std is 1. The first sums over all features of a single vector, the second sums over all vectors for single feature. To get ||X||=1 one can simply add an extra component xe to some vector X, so that ||(X,xe)||=1 or xe^2=1-||X||^2; here I assume that all X have been rescaled to have initially ||X||<1.

- Q: In Slide 12 of Lecture 29, for the last equation, you summed over n(c). What is n(c)? Is n(c) the number of data vectors that belongs to class c?

A: Yes.If this is the case, then can we interpret the network as follows: We submit all the data vectors belonging to class 1 to the network. We seek that most data vectors belonging to class 1 will have a strong output in F1(X) of the network. To achieve this, the nodes parameters must be modified. They are modified by training with the data vectors belonging to class 1. Then in the final layer of the network, we sum over all the result (hence summing over n(c)). If we use rectangular functions, we will derive crisp logical rules in F1(X) that can be used to identify the data vectors belonging to class 1. Simplification of these rules will thus give the decision rule to identify data vectors of class 1. We continue the above steps for class 2, ... class k to derive their decision rules. Is the above correct?

A: yes.

- Q: In Slide 8 of Lecture 30, you commented that for PRISM "Rules frequently contain single condition, sometimes two, rarely more conditions".

I am unsure why this is so. Consider the fact that at each iteration of PRISM, we search in class c, "all" features/attributes that can be used to discriminate c from other classes, then shouldn't it be the case that each rule derived by PRISM contains many conditions (limited only by the number of dimension of the data vectors)? Furthermore, you mentioned that PRISM result in large number of rule. However, based on the algorithm in slide 7, it appears that there will be only one decision rule per class although there will be a lot of antecedents in each rules.A: Please run some examples with PRISM using Yale or Weka, it is the best way to see what happens. The results depend very much on data character and on discretization, but typically many rules arise because A=v may have many values, so even though a small number of rule conditions is created this is still the case; also optimizing precision only leads to rules covering small number of examples and thus many rules.

- Q: Lecture 26, slide 8, how can an expert tell expressions about Theta while Theta is the one we want to find?
A: It is quite common that we know some functional dependence but do not know the parameters; for example, if you see a catapult throwing stones from point A and falling at distances X you may try to calculate the angle and the speed, because physics tells you that the trajectory is a parabola. So someone may have measured how the number of different species depends on parameters, but given particular data we do not know the actual parameters.

- Q: What is the difference between Linear Regression and Linear Discrimination? To me both look the same while they are different. And then what is classification via regression?
A: In regression we have y(X)=W*X, and y is a real valued function that is used to fit the data (for example a line going through some data); in classification all we care is what is the class, so y is a two-valued function, for example +1 and -1, so the regression function is binarized. We can use regression methods and then take decisions using some thresholds y(X)>T => C=+1, otherwise -1.

So the two approaches look similar but there is a difference in the condition that are set. For example in regression it could be least mean square error, and for classification SVM conditions that are not suitable for regression. - Q: Lecture 22 slide 2, WEKA example, what those weights assigned to alcohol attribute mean? Here there is only 1 feature selected, alcohol. On slide 3, there is a similar example but there are 3 features and different weights. What does that mean? May be it is some thing simple but I am missing an important point so I do not understand it.
A: I wrote only the largest coefficient, but LDA solution contains more, and some are put to zero. If one non-zero coefficient corresponding to a feature f1 is left than decision is made using the threshold for this feature, for example the alcohol level; if there is one large and few smlal coefficients the value of the feature corresponding to the large coefficient is the most important, when it changes a bit the whole funciton changes significantly so it will largely decide if y(X)>T or not.

- Q: The scalar product of Vectors, Lecture 23, Slide 9, which you said will be replaced by Kernel, is between 2 samples. How is it replaced with kernel while for kernel we have different dimensions as multiple of different features in original space, not vectors?
A: We simply define a kernel function K(X,Y)=X*Y in this linear case, all is in d dimensions. The point here is that if a transformation X->Phi(X) from d to m diemensions is defined we will still need scalar products in the Phi space, and the kernel function will be Kp(X,Y)=K(Phi(X),Phi(Y))=Phi(X)*Phi(Y). Here Kp(X,Y) will be a kernel function in d-dimensional space, giving the same values as simple scalar product in the Phi space.

- Q: Which of our text books explains EM better. I have checked all 3 text books for this topic but could not find a satisfying explaination.
A: Chapter 3 of Duda, hart and Stork has some detailed examples, so I'd recommend that.

- Q: Lecture 25, slide 3, when we write K(X,Y), what do we mean by Y? The example on slide 3 describes a Kernel as 1 +2X1Y1 + 2X2Y2......which later is written as Fi(x), which I understand are new dimensions, but what is the function of Y? Should not it be K(X1, X2)?And why we take square root of 2 in Fi(x).
A: In this 2D example X=(X1,X2), Y=(Y1,Y2) are two vectors; K(X,Y) has the form as in the slide. Phi(X) and Phi(Y) will be defined in 5D space and if you compute their scalar product Phi(X)*Phi(Y) = K(X,Y).

- Q: In Lecture 28, slide 7 of the lecture notes,
You wrote that P(X) can be a linear combinations of some smooth functions. These smooth functions that you mention here are different forms of Parzen Window functions, each defined with different shapes, for example, Gaussian shapes, right?
A: Yes, not quite non-parametric approach where numerical estimation of density is generated, and not quite parametric; but more on that next time.

- Q: In Lecture 25, slide 7 of the lecture notes, you writes "Right: support vectors removed, margin is clear, all vectors inside are SV..." However, I thought in your previous lectures, you mentioned that SV are those vectors that lies on the margins. So technically, those Vectors that lies with the margin are not SV right?
A: SV are all those vectors that have non-zero alpha coefficients and are needed to "support classification" in SVM algorithm; in separable case they are indeed only on the border. If the problem is not separable (in the kernel space) also those vectors between the margins are called "bounded support vectors".

- Q: In Lecture 25, slide 12, you mentioned that if two Xs have the same PC_k(X), then they will lie on the same line in the orginal X-space. Am I right to say that this is because, if the two Xs have the same PC_k(X), then they will lie on the same point of the kth eigenvector of the phi space. Thus they will lie on the same line in the orginal X-space.
A: This is not quite as simple and I could not go into details of different type of kernels, positive and conditionaly positive kernels (see Scholkopf and Smola book "Learning with Kernels", 2001 on that). Roughly it is as follows: principal components PC_k(X) are linear combinations of dot products in the kernel space. If the kernel is the square of Euclidean distance (b=2 on the figure on this slide) we have a combination of ||X^i-X||^2 factors. For which points X will this combination PC_k(X)=constant? Alpha coefficients and X^i vectors are fixed, and ||X^i-X||^2 = ||X^i||^2+||X||^2 -2X*X^i; in the kernel PCA procedure for distance based kernel data X is additionally centered so that ||X||=1, and ||X^i||^2 are all constants; therefore PC_k(X) is constant when X*(linear combination of X^i)=X*Z is constant, and that is true for points X on a hyperplane (or a line in 2D) orthogonal to vector Z.

- Q: In Lecture 25, you mention some Kernels that can be used to map into high/infinite dimensional space. Given that there are no two identical datasets, how do we decide on the Kernel to use? Is there a systematic framwork in which one can narrow down the search of a "good" kernel?
A: Most programs have very few kernels, linear + polynomial + Gaussian, and on most real problems Guassian are best; C and dispersion of Gaussian has to be optimized. Usually one can estimate using crossvalidation which kerenel has a good chance to be the best. In special problems it is worthwhile to design special kernels, for example for text mining or bioinformatics problems.

- Q: In SVM, we minimize W_i subject to the constrains that /alpha_i for support vectors is maximize right? We also set the /alpha_i for data points which are far from the decision surface to be zero. (This reduce complexion complexity.)
In that case, the Lagrange equation in slide 10 of lecture 23 can be seen as L(W, aplha) = 1/2 ||W||^2 - sum^{n}_{1}(....) in which the summation for the second term can be understood as summing over all the support vectors only. Am i right?
If the above is true for separable case, then it should be true for non-separable case also. This ties in nicely in slide 7 of lecture notes 24 in which those for those X_i far from the decision surface we have set their correspoindin /alpha_i to be zero.
A: Yes, the summation is finally over SV only but we first have to find which vectors belong to this category, sa we need an algorithm that does optimization and selection. The formulation for the non-spearable case is the same except that now some vectors are going to violate the conditions for correct classification, so we add to minimization the sum of xi variables that measure the extent of this violation.

- Q: In slide 10 of lecture 23, you wrote the Larange equation L(w,alpha) = ... .Following that, you derived the alpha and W that yield the maximum of alpha and the minimum of W. Thus the minimization-maximization problem can be solved analytically, without the use of iterative procedure right? If that is true, training of the SVM (to derive the W vector) is just a simple method of plugging in the formula. Thus SVM training should be very fast right?
A: No, there is no closed form expression; W is expressed throuhg alpha and known Y, X. Dual form of L(alpha) is quadratic in alpha but there are two external constraints that should also be kept. The only training method I have mentioned, SMO, is iterative. But it is still very fast.

- Q: For those datapoints which are highly overlapping, it appears to me that linear SVM may not be able to separate this data well. In this case, we can transform these data into a higher dimensional space e.g. using kernel. You mentioned that briefly in the last slide of lect notes 24. However, you never mention the technique, if any, of choosing a kernel i.e. is there a turn-the-crank procedure to deriving the best kernel? Or it depend on the designer's knowledge of the data?
A: In the next lecture I will say more about it; design of the kerneks (or similarity functions) is now an interesting field.

- Q: In Slide 16 of lecture notes 17, the last line writes “Missing value? Is treated as any other nominal value”. I am not sure what you mean.
A: If the value is missing x=? or if x=small or any other nominal value we still use the same 1R algorithm. Sometimes this is a reasonble approach, if for example the value in some cases is not measured and all these cases are from the same class; sometimes it may lead to errors.

- Q: In slide 18 of the lecture notes 17, you explain that for 1R Continuous, we will have to first discretize the continuous attribute X into intervals. Then, for each class w_c we find the interval that has the most w_c e.g. interval I(X1) has the most occurrence of w_c . Then if a new vector X arrives, and X is within I(X1), we will assigned it to w_c.

However, this may not work for certain cases. For example, it is likely that we poorly discretize the attribute such that an interval I(Xp) may not be the winner of any class. In this case, if new vector arrives such that X is within I(Xp), there will be an ambiguity in the classification. In this case, I feel that it would be better if we do the followings: For each interval I(Xk), we find which class occurs most in I(Xk) e.g. w_k. Then if the new vector X arrives and X is within I(Xk), we will assign it to w_k. Incidentally, this ties in with the example given in Slide 19 of the notes in which we find that interval “Temp < 77.5” has the most occurrence of class Yes and interval “Temp >= 77.5” has the most occurrence of class No (we find the class that occurs most in every interval)A: I have reformulated this slide a bit to stress that discretization is one idea, and it may be daone in many interesting ways, but 1R uses it own simple discretization just as you write and as is illustrated on the following slide; I have changed it to showing discretization for humidity. 1-R has rather naive discretization, adding to the interval with at least B elements (bucket) until non-majority class sample is found.

- Q:
In slide 11 of lecture notes 18, you wrote in point 1 that “The number of simpler hypothesis that may accidentally fit the data is small …. “
I am unsure why the number of simple hypothesis that may accidentally the data is small. Is this due to the fact that a simple tree will generalize well. Henceforth, since it generalize well, it is unlikely that the fitting of the data to the hypothesis is by pure coincidence.
A: True, but also the number of simple hypothesis is small, while the number of complex ones grows exponentially fast. For example if a simple hypothesis has 3 features and they do not repeat themselves in a tree there is 3! = 6 binary trees that one may construct; but if we have 10 features there will be 3.6 million possible trees, so some of them may by chance fit the data.

This is well recognized in hypothesis testing, where some simple test are used, like t-test. If we try the test millions of times finally something will agree, so the usual confidence that the test should correctly identify chance agreement at something like 95% level, is wrong. 95% confidence level means that the test may be wrong 1 time in 20, but if we try million times than certainly it will be wrong many times. Search for "Bonferroni corrections" to see discussions on this topic and adjustments to confidence levels. - Q: In slide 5 of lecture 14, you wrote:
"Since all training vectors are used, the number of errors for the training data is 0, unless the test vector X is excluded".

However, my understanding is that for this example, there is no cases in which training vector belonging to class Ci will have its nearest neighbor belonging to another class. Hence, the error for training data is 0. In this case, if we include the test vector, the test vector may lie to next to another class. In this case, the error will no longer be 0 i.e. the error will no longer be 0 if the test vector X is included. Is my understanding wrong?A: To make it more clear I've changed this sentence to: "Since all training vectors are used testing this rule on the same training data gives 0 errors (vector X= X

^{(i)}is always nearest). All vectors from little patches in the middle of larger areas that belong to the opposite class, used as the reference vectors by the nearest neighbor rule, will indeed be classified correctly, giving decision borders as in the picture.

If testing on the training data we exclude reference vectors identical with test vectors X= X^{(i)}the error will be non-zero.

Suppose now that only a few prototypes in centers of clusters are used - then borderes will be more smooth and little patches will disapear. - Q: In slide 13 of lecture 14, you wrote that
Bias^2(f, Y_m) = E[ (f_i - E[Y^i_m]) ^2 ].

I think it should be instead Bias^2(f, Y_m) = E[ (f - E[Y^m]) ^2 ] right? This is because we are now considering the bias of the entire model and not the bias for a particular training data poing Y_i.A: The E, or Expectation operation, averages over points X

^{(i)}, therefore f^{(i)}is used. - Q: In slide 5 of lecture 15, last sentence, you wrote "Final model is build using the whole training data with parameters determined from CV..."

I thought that during training, we tune the model parameters with the training sets and test the current model parameter with the validataion sets. Then, after training with all training data, we use the model parameters which give the lowest error (during the validation stage.) The model parameter which give this lowest error will be used.

In this case, if we build the final model with the whole training data, then wouldn't the previously tuned model parameter be reset? (What would the point then in training with the previous training data sets?)A: This is a bit tricky ... in each CV train model M(W|D

_{i}) with some parameters W on subset D_{i}of the training data. After CV we selected best model with some parameters W* that on average gives the lowest error on our validation partitions. This may be the end for some approaches, but in many methods the final model may be created using all data. For example, if we optimize the number of neighbors taken into account with the nearest neighbor procedure (kNN), and find that k=7 is best, our model in each CV is kNN(k=3|D_{i}), but the final model will use all training data kNN(k=3|D_{i}); if in CV we find that decision tree should be prunned to some degree d, we need to create the final tree on the whole data with such prunning. - Q: In slide 7 of lecture 15, your wrote "the average radius covered by a single training point rapidly grows with the number of dimensions...."

However, i thought that the avearge radius covered by a single training point should "shrink" rapidly. For example, suppose the training point cover a circle of radius 0.5^2 units in the 2 Dimension feature space. Then in d-dimension, it will cover 0.5^d in the high dimension space. Therefore, shouldn't it be interpreted that the average radius shrink? (I set the radius to be <1 so that the circle in 2D can fit into a unit square).A: Precisely because of this shrinking volume the radius has to increase! If you have a fixed number of N points in d dimensions than the radius of the Voronoi cell that each covers has to grow. In d=10 dimensions only 10% of volume is in the subcube [0,0.8]

^{d}. So, if you have 10 randomly placed points and one is (0,0..0) in the corner of this subcube others are probably outside; average distance in each dimension is thus around 0.9, so the total distance is sqrt(10*0.9^2)=2.85, while in 1D average distance is 0.1. - Q: In slide 8 of lecture 15, you wrote " The MSE also grows due the the growing bias term..."

Why does this bias term increase with increasing dimensions? Is it because, the current model is too simple to deal with the increasing complexity brought about by the increased dimensions? That is the model is unable to follow the target function well.A: Yes, the assumption that we learn something from nearest neighbor is a bad bias if dimensionality is high.

- Q: In slide 4 of lecture 16, you wrote "Accuracy for class k and balanced accuracy used for unbalanced data:..."
What is meant by "unbalanced data"?
A: Some classes have much smaler number of vectors than the others; "balanced" data has roughly the same number of samples per class. I have added an explanation to this slide.

- Q: In slide 13 (now 14 as one has been added) of lecture 16, you wrote that a large threshold will reduce recall/sensitivity.
However, i thought that with a large threshold, P++ will increase. Given that P+ is constant, therefore, shouldn't it be the case that
S+ = P++/P+ will increase and not decrease?
A: If the threshold is so large that no data is left than P++ has to be zero, isn't it? P++ or the proportion of true positives, is not the probability of correct classification, that grows with the threshold. It is the number of samples from class + classified as class +. Look at the figure in this slide; if the threshold moves to the right there will be a few vectors left, so P++=n++/n is decreasing at the expanse of false negatives P+-.

- Q: In PCA, we need to compute to compute mean and variance of features, and covariances between features. There is no issue computing this for discrete and continuous data. However, when it comes to nominal or ordinal data, the mean and variance does not mean anything. For example, we may represent various chest pain type as follows: 1 = typical angina, 2 = atypical angina, 3 = non-anginal, 4 = asymptomatic. One way is to separate the chest pain type into four features, and represent the presence or absence of each type of chest pain by 1 or 0 respectively. For example, if some one has typical angina, the four features will have the values: 1, 0, 0, 0 respectively. (This will result in a mean which represents the probability of the sample having, say, typical angina). Is this the correct approach to represent the data in PCA? If not, how do we represent such data in PCA? Are there any considerations for representing the data in this way?
A: This is not only about PCA, but more generally about representation of nominal data that most methods cannot handle directly. Of course one way is to use binary encoding for each feature value, as you have noticed. An interesting idea is to use distance measures that are defined by probabilities rather than feature values. Methods that use distances can then handle symbolic features directly. I have tried to define numerical vectors that have Euclidean distances repreducing such probabilitistic distances and use them to analyze data. See: Symbolic features in neural networks. Going along these lines one can invent more sophisticated distance measures for symbolic features or objects that have complex structure, and try to represent them in some metric space. For example, MDS will work with any similarity matrix, also probabilistic, and produce numerical features.

- Q: In L16, how is the liftchart constructed?
A: Perhaps it was not celar that in our example we have Y0=0.2 replies, I have added this to slide, so if there are Y0=1000 respondends N=5000 offers has been send. If all top predictions P(w

_{+}|X) were tru we would thus have a line going from (0,0) to (20%,1000) point, but instead we have a curve. as shown in the slide. How is this used? My return from 10%*5000=500 offers sent to those that have highest P(w_{+}|X) predicted chanses for return, is 400! But to get next 400 I need to send 30% of all offers, that is 1500, and then ther remaining 3000 offers will bring only 200 returns, so knowning my costs I may stop at 2000 most promising cases. - Q: In L16, hot to go from observations N(w,x) to confusion matrix P(w1,w2)?
I have added a new slide 7 and updated 8 to detail the decision procedure, we have to apply MAP Bayesian decision rule.

- Q: Given that there is a set of data points and we used it to generate a SOM, how do we placed the data points' labels on the SOM?
Is the data point's label placed on the neuron that is the best matching unit of the data point? If so, then there is a possiblity that not all data point's label are placed on the SOM, i.e. the SOM does not represent all the data points?
A: Yes, we simply check which nodes is the winner and labels are assigned to nodes for the class that majority of the training data in their Voronoi cell (that is area where these nodes are winners) belongs too. SOM is unsupervised algorithm so there is nothing that will enforce pure nodes, that is Voronoi sets from a single class. So indeed it may happen that a given class with small number of samples will never be used. If the training data represents distribution of the data well and there is enough nodes than it usually works well.

- Q: question to the decision procedure slide 4 of lecture 13.
Can I imply that the design procedure is that we choose the action wl , given observation, that has the lowest risk. If that is the case, I can rewrite this decision procedure as

arg min (k=l) sum_{j=1}^{j=n}(/lamda_jl * Prob(wj|X). This ties in with equation (11) of Duda, Hart and Stork pg 24.A: Yes, we should chooses action that carries with it the lowest risk. We called the number of class K so j=K should be used and then this formula gives conditional risk for decision wl; the lowest risk k is selected.

I have re-written the decision procedure introducing explicitly conditional risk, and added some bibliographical notes. - Q: In slide 12 of lecture 12,
Does R(hat(C),w_k) refers to the risk of using classifier when the true state of nature is w_k?
A: Yes, this is indicated by |C=wk) condition.

- Q:
In Slide 4 of Lecture 13,
Does "g_k(X) refers to the cost of assignment to other classes given that the true state of nature is class w_k?
A: Yes, this is the risk or cost that we had before, but now we call it g(X) function and use it to discriminate between classes, as some other functions on the next page.

- Q: In Slide 4 of lecure 13,
The design procedure should be
w_k for k = argmin sum( lamda jl.....) instead of argmin sum( lamda lj.....), right?
This is because the risk of doing doing action l = cost of doing action l when the true state is k * Prob(wk|x) + cost of doing action l when the true state is m * Prob(wm|x) + ...
And so following your convention that lamda jl = cost of doing action l when the true state if class j, we should thus sum over lamda_jl instead.
A: Thank you for careful reading! In slide 12 of lecture 12 I have left P(Wl|Wk) instead of P(Wk|Wl), although commented there that rows correspond to true clases and kept correct notation later; this confusion arised from different convention used in GhostMiner package. So here the risk of doing action j when the true state is l is calculated, and k=l for the lowest risk is selected. So this slide is correct, and slide 12 in L12 is corrected.

- Q: In slide 6 of lecture 13, you wrote
E = sum sum ||P(wi|X) - Ci(X)||, where Ci(X) = 1 if X is from class i and 0 otherwise. Should it be Ci(X) = P(wi|X) if X is from class i.
In this way, we will not sum over P(wi|x) when i is the correct class.
A: Here P(wi|X) is our prediction, while Ci(X) is the class label, usually 0 or 1; as I have mentioned sometimes we may have probabilities or degrees that we try to learn, then Ci(X) may be a fraction that sums to 1. In any case if P(wi|X) predictions match Ci(X) labels the error is zero, otherwise there is an error that is used to adjust parameters in models generating P(wi|X) probabilities. So for example for 2 classes we may have C1(X)=0 and C2(X)=1, and if P(w1|X) =1 and P(wi|X)=0 they match.

- Q: L6, slide 10, May I ask what does it mean when it says it is the 'matrix of orthonormal eigenvectors'? Where is it transformed from? From X?
A: C

_{x}Z=ZL is en eigenequation involving the covariance matrix C_{x}; Z is thus the matrix of eigenvectors of C_{x}.

As we do not have time to review matrix algebra during this course (it is after all a graduate course), please consult Numerical Recipes book (www.nr.com), chapter on diagonalization methods. - Q: Lecture 6, slide 8, I can follow derivation of covariance matrix Cij elements, but where the derivation of Cx=1/(n-1) XX
^{T}comes from?A: This is just a notation where fat X is a matrix of all data vectors X shifted to their means. Assume that the means are already substracted from data, we can do it before calculation of C matrix. In each column we have X

^{(k)}, k=1..n vectors, first column is for feature 1, last for feature N, the matrix is thus N by n. Transposed matrix has rows corresponding to vectors, and columns to features, it is n by N, so now column 1 contains values of feature 1 for all X^{(k)}vectors. Multiplying first row by the first colum we get C_{11}or in general elements of XX^{T}matrices contain Cij elements. - Q: In slide 7 of lecture 12,
You wrote P(e) = sum_x(P(e|x)).
However, should there be a P(x) term i.e P(e) = sum_x( P(e|x)P(x)
A: Because P(e|X) = min_i P(w_i|X), minimum of posterior probabilities, we assume that if there are several identical X they appear in the data, and therefore in the sum. Therfore there is no P(X) in the sum.

- Q: I tried the software for SOM that you gave, but mostly need money to pay or not user friendly. Is there any tools that allow us to input our data and run the results? Thanks!
A: I keep Software and Computational Intelligence links, where all interesting linkst to software and papers are stored. There are quite a few links to SOM programs, including Matlab version. Such links are unstable, programs vanish, and I can't keep track of all of them - I hope to learn from you about more programs! So please search first in these links and than in the Internet.

- Q: I have a problem in Lecture 7. Are there other methods to generate the second FDA vector? Since the rank of the S_{B} matrix for 2-class is 1. I am confused about that.
A: Unfortunately in this situation higher FDA vectors cannot be obtained easily if the FDA criterion is strictly observed. I have mentioned two methods, based on perturbation and psuedoinverse, but had no time to explain them. They both spoil the FDA criterion a bit. The best summary is in the A. Webb book, chap. 4.3.3. However, in this paper: Cheng, Y.-Q., Zhuang, Y.-M. and Yang, J.-Y. (1992) Optimal Fisher discriminant analysis using rank decomposition. Pattern Recognition, 25(1):101-111 a better method is described.

Perhaps the best paper that discusses numerical problems in FDA is: Efficient Implementation of Optimal Linear Discriminant Analysis. Methods based on Generalized Singular Value Decomposition and other approaches are described. This goes rather far beyond our course and is a good research topic, as different FDA variants give different results and one may go to kernel versions with new variants rather easily, the existing kernel FDA does not work too well.

A good bibliography on discriminant analysis is here. - Q: In Slide 5 of the lecture notes, you wrote "Various normalization factors may be introduced, but those that do not depend on the rij distances have no influence on the MDS map."

Do you mean that normalizing factor that only on rij will change the topological configuration of the target space? I believe that any factors that depend on either Rij or rij will haven an influence. This is because the equation use to measure stress/topological distortion have both Rij and rij components. Since we seek to find new {Y} of the MDS map that will minimze this stress function, any factors that depends on either Rij or rij should have an effect on the MDS map right?A: It menas that if you divide the whole distortion measure by a function f(Rij) only then this is a constant for a given data, only rij distances change when {Y} positions in the target space are optimized. Of course if Rij is used in combination with rij then it has an influence.

- Q: In slide 8 of the lecture notes, you wrote "SOM does classification and visualization, but separating these two functions may improve accuracy of both."

Do you imply that MDS has a clear separation of the classification and visualization steps? It seems to me that MDS do not explicitly separate these two steps. Instead, by the use of the topological distortion measurement, MDS seeks a map which has lower topological distortion (and hence more accurate visualisation). Also, since it create a map of lower topological distortion, it implicitly means that classifiaction of data is better. In this aspect, MDS is superior over SOM simply because it has some meaningful cost function to minimize. Am I right?A: Yes, MDS does not do any classification, just visualization. One can use data in the target space Y for classification, then MDS is used for dimensionality reduction and may be followed by any classification method. In fact results of such procedure are quite good although no-one seems to be using it.

- Q: How to identify that a particular method is apt for an application? Is it a trial and error method of applying various methods and finding out which method brings outs the distinctness? Or is there any quantitative factor which can used to select a visualization method?
A: Unfortunately there is no easy way to identify good method for a given data unless we already know something about the type of data we are going to analyze; this is true not only fro visualization but all data modeling technique. METAL project tried to do it for data mining, defining similarity to previously analyzed cases.

- Q: Will it be effective to apply SOM after MDS of data? (thinking about first assignment)
A: What good it could do? MDS is already the best representation preserving distances, SOM may only spoil it.

- Q: Can you please comment on using S-plus instead of Matlab. I would like to take this first assignment as an opportunity to learn S-plus (practical applications of s-plus are fascinating). Since I am a newbie to S-plus and computational intelligence, I am wondering whether s-plus will be the right tool to finish my assignment. Please advice.
A: This is always good to learn; there are many packages in S or R languages but a few know it, as it is taught only at courses of statistics.

- Q: PCA is an use to represent data without knowledge of class structure while FDA is used to discriminate data. Henceforth, in this aspect, is it right to state that PCA is unsupervised, while FDA is supervised (since it require clustering of data as a preprocessing step)?
A: True, I thought I've stressed this point, FDA needs some labeling of the data, clustering or knowledge of classes.

- Q: In slide 4 of lecture 8 (last paragraph), you wrote the following "...any non-Gaussian distribution after PCA transformation will still have correlated features".
I thought that after PCA, the transformed data will be uncorrelated since the covariance matrix is diagonal?
A: I should be more careful on this slide distinguishing dependence, second-order statistical correlations and higher-order correlations, which is what was meant; although the slide stresses that uncorrelated features may have higher-order dependencies, it is better to say here that any non-Gaussian distribution after PCA transformation will still have some feature dependencies, rather than correlations, although dependencies are in fact higher-order correlations. I've changed it accordingly.

- Q: In slide 6 of lecture 8, you wrote "...Other components are found in the space orthogonal to W1T X".
I thought that after finding an interesting projection, we perform an orthogonal projection on the data to keep components that are orthogonal to the initial projection. Henceforth, strictly speaking, shouldn't it be the case that the other components are found in the space orthogonal to "W1" and not "W1TX"?
A: Indeed it is better; Y

_{1}=W_{1}^{T}X represents new transformed feature, and we commonly speak of feature space thinking about its dimensions as orthogonal vectors defining directions on which feature values are placed, so sometimes we say that Y_{2}is orthogonal to Y_{1}, and we really mean the unit vector on the Y_{1}axis is orthogonal. - Q: In slide 10 of lecture 9 (last bullet), you state that "W(i) point to the center of local clusters in the X feature space".
It seems to me from the mechanics of SOM, that W(i) only approximates the center of the local cluster in X feature space. This is because the each winning node will influence other nodes in its vincinty. Thus theoretically I'm not sure if it can be proven that indeed W(i) points to the ceter of the local clusters.
A: Certainly this is quite approximate algorithm and in fact it is hard to prove anything exactly about it in more than one dimension, have addded approximation sign there.

- Q: Do you know any software (like weka or yale) that can use genetic algorithm as a classifier (rule discovery)?
A: GA is sometimes used for optimization of parameters in various classifiers, rather then directly. Many papers have been published on this subject (ex: EDRL-MD, Evolutionary Decision Rule Learner), but software is not so easy to find. I have a few links to GA systems on my software page; some packages, like SPSS, use GA for optimization of neural parameters. Yale has GA and PSO (Particle Swarm Optimization) version of SVM that probably works better for classification (although I've not see any comparisons with standard SVMs), but it does not generate rules.

- Q: Any GA classifier that can readily be used? I know several PERL/C library for GA optimization.
A: Several projects tried to build rule-based classifiers with GA optimization, for example AI Lab in Strassbourg (J. Korczak) has a number of software packages on heir pages. NeuCom (A Neuro-computing Decision Support) evolves solutions.

Note however, that evolutionary approaches have been around for a long time (20 years) and it is hard to find good comparisons with computationally less expensive mutistart gradient techniques, not to mention standard numerical analysis (branch+bound etc) optimization methods. My experience with PSO and GA in neural parameter optimization has been rather negative, there are better and faster methods for this application. For some reason many people use GA fro optimization for simple problems, and for problems where standard numerical techniques work better; simplest systematic sequential search techniques have been totally neglected, although there is no evidence that they fail, indeed they seem to work quite well, see here. - Q: Are you going to touch upon data discretization (supervised/unsupervised) in your lecture?
A: Only briefly, but I'll try to add some pointers.

- Q: In 4 Gaussian in 8D, what are X1 and X2? Are these 2 of the 8 dimensions of the data? Why are there 4 Gaussian Distributions while each in 4D?
A: In this example 4 Gaussian functions, each defined in 4 dimensions, have been used. Their formula is:

G(X;X0,s) = exp(-[||X-X0||^{2}/2s^{2}]) for X=[X_{1},X_{2},X_{3},X_{4}]

Here ||X-X0|| is Euclidean distance between X and X0 vectors in 4-dimensional space, with and X0, the center of the first Gaussian at [0,0,0,0]. For the next Gaussians the X0 center is shifted, as given in the slide.

Additional dimensions X5 to X8 were added, made from X_{i+4}= X_{i}+ noise. So together these functions have 8 features.

Why? This is an artificial data to show how the scatterograms will look like for such data. It will also be used later to show how selection of important features work: X1 has more information about the classes than X2-X4, and features X5-X8 are redundant, carrying the same information as X1-X4. - Q: What is Gaussian Noise and what is that used for?
A: The multidimensional Gaussian function G(X;X0,s) is used as probability density function to generate points. This simply means that the probability to select a given point (X,G(X;X0,s)) is equal to the value of the Gaussian in this point, therefore many points come from regions around X0 and a few from regions where Gauss is small.

Gaussian distributions are very common and noisy measurements frequently follow such distributions. See the entry on Gaussian noise in Wikipedia. - Q: Lect-5, Slide 8, Invariance requires Standardization + Covariance Matrix for what?
A: Indeed this is to short, I have expanded the sentence now:

To achieve full invariance requires therefore standardization of data (scaling invariance) and should use covariance matrix.

I have also mentioned that Mahalanobis is still alive; evidently this is not true, as Prasanta Chandra Mahalanobis was born in 1893 and died in 1972. I got him mixed with another great Indian statistician, Calyampudi R. Rao, who was born in 1920. - Q: Lec-6, Slide 7, How you got [5 2] for variance of Y? There is only one vector, Y after scaling so mean would be same too. If it is so, the variance should be [1 1].
A: The assumption in the first line is that there are many X vectors, with mean=1 and variance =1; this is the same example as in slide 2. So if the X means are [1,1] and transformation A is given, we can compute both Y means and variances. There is no scaling here, just A transformation. I have changed this slide to make it more explicit.

An elegant way of proving it is to use covariance matrix:

C_{Y}=A C_{X}A^{T}. If C_{X}is diagonal, as spherical data with a unit variance is, then C_{Y}=AA^{T}, and therefore square of variance for each feature is on the diagonal part Diag(AA^{T}). - Q: Lec 6, Slide 9, How got the equation ||X||cx2=XT CX-1 X..? I did not get that....
A: This is a definition of Mahalanobis distance, a norm with CX in the right corner, so there is nothing to derive here!

- Q: During yesterday's lecture (Lect 5), you mention, on slide 11 of Lecture 5 notes, that standardization of data

Z = [X - mean(X)] / stddev(X)

will make the data invariant to scaling.

I am unsure what you mean by that. Can you illustrate with an example, say using slide 12 of Lecture 5 notes?A: I've added some explanation there.

- Q: You have used to Ghostminer software to demonstrate several statistical concepts during your CI lectures. This approach has been especially helpful in helping me to understand your teaching. May I request for a scaled-down version of Ghostminer to repeat some of these presentations for my revision?
A: Just follow the link on my page and get the Ghostminer package from FQS, who will send you a license key valid for a few month.

- Q: In lecture 6 notes slide 9, how the formula C
_{Y}=AC_{X}A^{T}comes out?A: Directly from linear transformation Y=AX and definition of C

_{X}given on the previous slide;

C_{X}=1/(n-1) YY^{T}, and (AX)^{T}= X^{T}A^{T}, so

C_{Y}= 1/(n-1) AXX^{T}A^{T}= AC_{X}A^{T}. - Q: Mahalanobis Distance is just one value for two vectors, but it is one symmetric matrix for two matrices, is it?
A: Distance functions are defined for vectors, we shall not need distances between matrices, although such concept is useful in mathematics.

- Q: In slide 10 of lecture 6, to avoid the correlated feature, we have to diagonalize it using
C
_{X}Z, why it is so? Why it can make the off-diagonal entries become zero?A: This you should know from the course on linear algebra that unfortunately we do not have time to review, as this is a graduate course, I have to assume that you had a linear algebra course. Any textbook on linear algebra should explain it, I can recommend the Numerical Recipes book that may be downloaded from the address http://www.nr.com/, including numerical routines for diagonalization. Also Matlab has build-in diagonalization routines. A short explanation is this:

Solving the eigenequation CZ=ZL for C symmetric, positive definite, and L diagonal matrix containing eigenvalues, leads to an orthogonal matrix Z that contains eignevectors of the matrix C. Orthogonal means that Z^{T}Z=I, therefore multiplying the eigenequation from the left by Z^{T}gives Z^{T}CZ=L; this type of transformation, Z^{T}CZ, is called similarity transformation, and it changes C into a diagonal matrix. Therefore if new vectors Y=Z^{T}X are defined the covariance matrix will be diagonal, as shown in slide 10.

Orthogonal matrix Z defines a rotation of the coordinate system, each new coordinate is a linear combination of the old ones. In this new rotated system all features are decorrelated, because C_{Y}is diagonal. - Q: What is the process of identifying significant features?
A: We shall talk about that a bit, but unfortunately there is no general methodology, as it very much depends on the applications. One way is to collect or measure whatever we can and than use such techniques as: feature ranking, selection, aggregation or dimensionality reduction to find or construct relevant features.

- Q: Presently, I am using hypothesis testing to derive type I and II errors. I use these errors to identify key features. Is this methodology right? Are there any other efficient methods, besides Neyman-Pearson and Bayes risk analysis for reducing errors and maximizing probability of deduction?
A: I'll talk about this in Lecture 15/16.

- Q: Differentiation between AI and CI

You have mentioned in slide 4 of lecture 2 that CI techniques are meant for lower cognitive functions while AI techniques are primarily focused on achieving higher cognitive functions. Lower cognitive function: pattern recognition, motion control, sensory signal processing Higher cognitive function: reasoning, planning, problem solving and learning.

Also, I become rather confused when you mentioned AI is based on symbolic representation of knowledge. To my knowledge there are two schools of thought in AI: Connectionist approach and Symbolism approach. Connectionist approach focuses on non-tractable and mathematical techniques such as ANN and EA while symbolic approach like logic and behavioral-based approach.A: Indeed some people called the connectionism "a new AI", but traditional "good old fashioned AI" (GOFAI) has been focused on symbolic representation of knowledge, as is quite evident in the AI textbooks and courses. It is still true that AI people do not come to neural or connectionist conferences, and vice versa. So, even though some AI people would like to see their field expanding and cover also what CI people do. But it is enough to look at the societies, conferences and journals to see this distinction, with very little mixture between the two groups.

Unfortunately there are no commonly accepted divisions and definitions. Some questions addressed below have already addressed this problem. I am giving you my own definitions based on the problems addressed, rather then methods used. From this point of view CI is broad and covers all subdisciplines that help to solve problems for which effective algorithms do not exist. This includes AI which has traditionally been focused on a subset of these problems that could be solved with symbolic representation.

Ultimately it does not matter how we label the field, what matters is the type of problems we may solve. - Q: Applicability of CI

You have mentioned that the classification of the technique is based on the problem being addressed. For example, techniques that solve lower cognitive functions are primarily considered as CI. However, from slide 9 to 12, you have included various higher cognitive functions where CI may participate in. Then, in this case, does the CI definition in slide 4 still holds?A: Some of these problems indeed require high and low level processes. I have changed slide 4 in Lecture 2 to make it more clear: in my broad sense definition of CI it includes AI, but is not restricted to problem solving, going also in the direction of optimization and pattern recognition.

- Comment: I felt overwhelmed by the amount of things you said last night. It was also not clear, and I could not get the gist of what you were trying to emphasize. I understand that perhaps there is a lot of materials and historical information in CI, but I hope that in the lectures ahead, we can be more focused.
The first lecture is to see what are the problems requiring CI and what are the sources of inspiration to look for solutions. I will come back to these sources of inspiration talking about different methods. This is quite a broad field so indeed you may be overwhelmed, but other lectures are focused on the methods.

- Q: B.T.W, after the exam, can I also write email to you about some questions?
A: Sure.

- Q: There is a question:'Define information gain and prove that it is equal to K-L divergence..'. How to answer it?
I can only find the relation of MI and KL divergence, but couldn't see the relation of IG and KL.
A: The question in L34 is: prove that Mutual Information = Information Gain when binary feature is used to split the data. This should not be difficult.

- Q: For Parzen window. You mention in the slides that
"Use as H(u) a smooth function, such as Gaussian, normalizing final density: P(x)=\int{H(u)du}". what do you mean by this. If H(u) is a kernel, this should be always 1.
A: I have updated the slide to make it more clear; if this integral is 1 than also the final density is normalized to 1. Thank you for pointing it out.

- Q: Standardize data: the reason we use 1/(n-1) but not 1/n is because 1/(n-1) will make it unbiased? If so, unbiased means the expectation value of it is equal to the true variance. How to prove it mathmetic?
A:I have showed the explanation briefly during the lecture, but it belongs to elementary statsitics, not to this course; see for example "Population variance" at this Encyclopedia of Math page: http://mathworld.wolfram.com/Variance.html

- Q: For question "How to generate second and higher discriminant component". Is the following answer correct?

Because of the within-class scatter matrix and between-class scatter matrix may be ill-conditioned. So we need to either find the pseudoinverse of the within-class scatter matrix or find the perturbations of the within-class scatter matrix. For this part, I can't understand the detail (why the rank is K-1 for K classes) and how we actually calculated (where pseudoinverse/perturbation should be put into)?A: For 2 classes recall the definition of S

_{B}in CIS-07 s.11, it is made from one vector, the difference of two means, call it D12. Therefore the rank of this matrix, by definition equal to the number of non-zero eigevlaues, will be 1, becuase if we take transform the coordinates to D12 and orthogonal the only non-zero component will be along D12. If we have K classes there will be K-1 independnet vectors Dij obtrained as differences of the means, for example {D12, D23}, but D13=D12+D23 (in a vector sense). Another way of understanding this is that K vectors of means may be represented in K-1 dimensional space. If we have 3 classes, the 3 means Xk are on the triangle, and thus a 2D space exist, so a matrix build from these vectors will have rank 2. - Q: For the notes CIS-13. page 6, you say "minimization of error may be done in a number of ways". For the second error function
E=\sum_{X}\sum_{i=1}^{K}||P(\omega_i|X)-C(X)||. I don't understand this will measure the error.
A: The norm of the differnece ||P(\omega_i|X)-C_i(X)|| should measure how far are the predictions P(\omega_i|X)\in [0,1] from the class labels 0 or 1, so to be a good measure of error we have to assume that the class label C_i(X)=0 for X that do not belong to the class \omega_i, or C_i(X)=1 if it belongs ot his class. I have changed C(X) to C_i(X), then it is easy to seee that C_i() is as an indicator function or the class, 1 if in class \omega_i, 0 otherwise.

- Q: Oftenly, we say if we want to estimate the density P(X_i|\omega), we can fit a single gaussian or a combination of gaussians to the histogram to estimate. How to do this in practice? My current understand is as follows:
one way is we can assume the distribution underlying the histogram is in some parametric form, then we can use EM to estimate its parameters and then fit it.
The other way is to use non-paremetric method like parzen window which estimates the density as the sum of the kernel values within the window.
Is it right?
A: Yes; if it is a single Gaussian than it is of course quite simple, but there are several numerical methods to fit parametric functions to given points from the histogram, see Numerical Recipes for example.

- Q: For logistic DA, the training phase are usually iterative procedure based on maximization of likelihood of generation of the observed data are used. But I don't understand the formula of L(W,W_0). How does it represent the meaning of maximum likelihood of observed data. The posterior probability is used but not likelihood probability, from my point of view, its MAP but not ML. But if we think about the assumption of logistic DA, in the case of equal priori and identical covariance matrix, does the formula represent both MAP and ML?
A:

- Q: For SVM derivation, the reason why we need to maximize \alpha is not clear for me. I understand large \alpha will reduce more on L on misclassification samples.
But why \alpha need to be maximized because of this?
A: Just like in the boosting procedure, if there is a missclassification for X it will put more weight on this sample X, trying to influence the decision borders to handle it correctly.

- Q: For non-linear case of SVM, why the lagrarian multipliers \alpha_i is bound in [0,C]? How is this constraint be derived. Can you explain more about this?
A: We had no time to go through details here as this will require few extra slides, but you may find the derivation in the chapter 4.2.5 in the Webb book, for example.

- Q: About the mechanical analogy, why the force F_i is defined as
\alpha_i*Y^(i)*W/||W|?
Can I understand the two zero conditions
(\sum_{i=1}^{n}\alpha_i*Y^{i}=0 and W x W= 0)
corresponds to the constraint of
\sum_{i=1}^{n}\alpha_i*Y^{i}=0 and the minization of ||W||?
A: If we assume that each support vector exerts force on the membrane then its direction has to be along Y^(i)*W/||W|, or the unit vector perpendicular to the membrane, directed in positive or negative way, depending on Y^(i). Now if \alpha_i is taken as the magnitude of the force exerted by X^(i) then we can say that the sum of all forces in the quilibrium is zero, and this is the same condition as the auxilliary condition that we have got from the SVM derivation. The second condition for the torques shows that if W is expressed as a sum of \alpha_i*Y^(i)*X^(i), as we have found, then all torques are also zero.

- Q: When we use cross-validation to auto-select algorithm parameter, does it mean we use each possible parameter value, build a model and evaluate the model on the validation set. Finally, the parameter value that achieve the highest accuracy on the validation set is selected. Is it so?
A: Yes. Of course in some cases we may try to compute gradients to fine good parameters faster, but for large number of paramters this is a difficult job.

- Q: For maximum likelihood density estimation. In the page 7 of the slide CIS-26, the last paragraph is not clear. It says "Likelihood estimation can be carried for samples from a given class P(\omega|X;\theta)". But it's the posterior probability but not likelihood. How about P(X|\omega,\theta)? or here we can use posterior probability as another form of likelihood. "assuming that the probability of generating n such samples is equal to P(\omega)^n with a prior class distribution estimated from the frequencies".
A: Indeed you are right, posterior probability is always harder to compute, I have changed that sentence, thank you for careful reading.

- Q: How to find the optimal parameter k for kNN. This part is not clear when I follow your lecture. The efficient way is to take one vector out, try different value of k on the remain n-1 vectors. select the optimal value of k which minimize the error on (n-1) vectors? After that, what to do? evaluate this k on the vector that are taken out before. It's a leave-one-out test. The final value of k should be the one that is optimal for validation.
If such, then the optimal of k can be taken as two step process. First, for each possible (n-1) vectors, select the optimal k. Then by evaluating on validataion set (one vector), select agian the optimal one from the set of optimal k from the training data. Is it right? And how to explain why k will not change much? Because the effect of removing one vector is local?
A: In practice we simply run the leave-one-out procedure on the training set for different values of k, and select the best k; then we test with this k on the test set. If we want to find the best k this is what we should do.

If there is no test set the procedure that you have described above should be used to estimate the expected accuracy in an unbiased way. The test set is then either the validation part from the crossvalidation or just a single vector from the leave-one-out. This is obviously expensive as we need now for each of the N vectors to do N-1 runs, or a total of N(N-1)evaluations. But after the estimation of expected accuracy is done we still need a good k for new samples, and this should be derived from the whole training data. So the fact that different k may be optimal from different runs does not matter. - Q:You mentioned in the notes chapter 32 that 1-NN result on the training set are 100%; and you also showed a 2 Gaussians 1-NN figure in chapter 14 to figure out this.
But I am a bit confused on how can you achieve this 100% training accuracy, especially for some isolated points surrounded by points in other classes.
In chapter 32 IB1 can only achieve 80.9±3.6% after normalization.
So can you show me the reason?
A: If we take all the training set as reference vectors, and then use the same set for testing, then 1NN will give 100% becuase for the test vector Xi the training set will contain as the nearest the same vector Xi which is at the distance 0. The only exception is when two or more vectors from diferent classes have exectly the same features, then the result will depend on how we handle ties, but will be in general below 100%.

But if we take crossvalidation, as we did with IB1, or if we use the leave-one-out procedure, then of course accuracy will not be 100%, because the test and the training sets are different. - Q: We have a list of sample questions. Will you take some from this list as questions in the final exam?
A: Do not expect me to put the examination questions on the web! This list has been shown already in previous years, so please look at last year's examinations and judge for yourself.

- Q: May I know the meaning of notation in your Lecture notes for CIS-13, slide 6:
P(X|w1)=Gauss(0,1), P(X|w2)=Gauss(2,2).
May I know what the meaning for Gauss(0,1),? Is it for mean=0, var=1?

If so, since the two var are different, how can to get linear discriminants?Q: Slide 6 is an excerice, asking you to write down the risk for 2-class 1D problem, and the meaning here is Gauss(x0,sigma), as you have noticed.

Slide 7 is not a continuation of 6, it explicitely assumes the same variance for both classes, so the two variances are identical. - Q: I am starting to do the assignment 2. My initial thought is to do the bayesian data analysis for some data. But after two days investigation and thinking, I have some questions about bayesian method.
Please give me some instructions.
Q1. I will say bayesian approaches include, naive bayesian classifier, maximum likelihood classifier, maximum a posterior classifier and bayesian network. Can I categorized these methods as bayesian approaches?

A: Yes; general philosopy in all these approaches is Bayesian: look for priors, class conditional distributions, evidence, and compute posteriors. Probability experts have some other philosophical approaches, see for example the discussion of Bayesian perspective in an on-line book by ET Jaynes http://bayes.wustl.edu/etj/etj.htmlQ2. The learning phase of bayesian approaches is to learn the probability (prior and conditional) from training data. The simplest way to estimate these probabilities is to calculate the frequency of the training data. But I don't know any other methods? (If we assume the probabilities is in some parametric form, then parameter estimation methods like ML and EM can be used. But for my data has 13 attributes, how to model it in some parametric form?) any non-parametric method can be used for density estimation (prior and conditional) for bayesian approach? Very confused about this.

All these approaches for probability estimation are fine. For some data we have parametric description, especially in engineering, physics, chemistry etc. In your applciation you may not have it so you need to use non-parametric models.Q3 My plan is to analysis a data set of nba player to estimate his free throw level (high and low). What kind of analysis can I do by using bayesian methods?

A: Looking for factors that have influence on this level using Bayesian belief nets for example. See my software links, section on belief networks. - Q: For using cross-validation to decide the stop splitting criterion, how to know the error on the validation data is minimized and stop the splitting? What I mean is unless you get the 0 error, you have to split the successive layers. Am I right?
A: Splitting the tree nodes or adding more nodes to a basis expansion network is done using the training part of the data, but testing is done on the validation part, unseen by the training algorithm. When the complexity will be too high and the training algorithm will start to assign small regions of feature space to a given class the test results are going to degrade, as showing in the last slide in CIS-14. Adding additional split we will notice that although the training results improve the results on the validation part will degrade, so we will not make this split. In decision trees backward pruning is more common, first trees are trained to separate all data (this is possible only if there are no identical vectors with different labels), and then pruned to minimize error on the validation set.

- Q: For tree pruning method based on rule. In Duda's book, he said "Eliminating the irrelevant precondition rules simplifies the description, but has no influence on the classifier function, including its generalization ability. .... In this case we therefore eliminate rules so as to improve accuracy on a validation set.". How to understand this? Why it can improve accuracy on validation set?
A: If the condition is irrelevant for some training data it can be removed without changing the validity of the rule. On the new, validation or test data, simpler rules should work better, as the irrelevant conditions may only decrease rule accuracy, in a similar way as random conditions.

- Q: For regression and model tree. From your slides, my understanding is that regression generate a tree to approximate a function. It's piecewise constant such that the approximation is somehow rugged. And model tree also can approximate, but it is piecewise linear because at each node it approximate the data by fitting them on a line. Is this understanding about regression and model tree correct? Does model tree use residual (error) as the criterion to select split and stop split?
A: Yes, this is a good summary. There are many approaches to model trees, some based on residual errors. See here for some software and papers on regression trees with linear and non-linear nodes. See also:

DTREG, Classification and Regression Trees And Support Vector Machines (SVM) For Predictive Modeling and Forecasting, providing nice demo for regression trees.

M5 implementation in WEKA is described here. - Q: How to understand linear discrimination WX=0 is a special case of Bayesian approach (with identical covariance matrices)?
A: We have derived it in CIS13, slide 10; if probability distributions for two classes are given by multivariate Gaussians and covariance matrices are identical comparing two discriminant functions quadratic terms on both sides cancel and linear discriminants are obtained.

- Q: Although intuitively, I can understand when \alpha = 0 for Regularized Discrimination, RDA will be reduced to LDA, but how to prove it in mathematical way. How to write RDA function in the form of wx when all \sigma_i = \sigma?
A: The assumption behind this is that probabilities have normal distribution so if there are two with the same sigma we are back at the same problem as above. RDA is discussed in details in Webb, 2.2.2, and chapter 4.3.1 of Friedman, Hastie and Tibshirani book.

- Q: You mentioned in the last slide of CIS-15 two example of prediction error (both 0.5). So no criterion to say which one is better. It's up to the application and other additional information? Am I right?
A: If accuracy is identical one of the confusion matrices may still be preferred. The worse situation is when we do not learn anything, and this is the first case. Although a random guess is true in 50% of cases it is just random guess. The second situation is also true 50% of the times, but if class 2 is predicted (in 1/4 of all cases) we are sure it was really class 2, while if class 1 is predicted (in 3/4 of the cases) there is probability 2:1 that it is in fact wrong, so in this case we know much more. You may play yourself setting probabilities in another way but keeping the accuracy at 1/2. Please note that the class distribution may change from 1/2, 1/2 to something else, for example to (1/4,3/4) in the second case.

- Q: For the problem of "Curse and Bias". What do you mean for "bias term" in "MSE will grows due to the growing bias term"? And why the curse problem can be solved when selecting model with appropriate bias for the data?
A: I have added more comments here. Inputs are randomly distributed in [-1,+1]

^{d}, for d=1 ..10, the task is to approximate the target functions f(X)=exp(-8||X||^{2}). MSE is the error of approximating f(0), and the Bias^{2}show on the right side is calculated using the formula from Lecture 14. - Q: How to understand the figure in the page 6 of CIS-15. How does it show the distance of 1 neighbor in 2-D is larger than that in 1-D?
A: To cover 10% of volume in 1 D we need a line of 0.1 length, in 2D we need a square with side 0.31 = 0.1

^{1/2}, and in 3D it is 0.47 = 0.1^{1/3}, in 10 D it is 0.79 ... That means that in 10D the sub-cube with the side of 0.8 contains approximately 10% of data if it is uniformly distributed. If there are just 10 points uniformly distributed the average distance of neighbors will therefore grow from 0.1 to approximately the length of the diagonal of the hypercube, in 10D it is 2.5.

Projecting points on a single line we get relatively large density, but the distance between these points in 10D is large and thus highly-dimensional spaces are almost empty. - Q: In the page 5 of CIS-15 (crossvalidation), what do you mean for "If a number of numerical experiments is large this may still overfit the validation data"?
How to understand "CV selects the validation partitions randomizing the data"?
A: Parameters of the model are set to optimize the results on the validation data. So, if a number of numerical experiments using validation set is large the resulting model may overfit the validation data, even though this data is not directly used for training. If 1000 models are generated, with slightly different sets of parameters, by pure chance one may have a correct bias for this particular validation set, but that does not mean that it will generalize well.

For that reason "Crossvalidation selects the validation partitions randomizing the data", that is the data set is first permuted to have a random order of the vectors, and then partition into k parts, each of which is used for testing in the k-fold CV procedure. This gives a better estimation of generalization error than just validation on a single partition, because now it is impossible to find parameters that will by chance lead to good results on a selected validations set; the model has to perform well on k different sets that it has not seen. - Q: I am a bit lost for the equation in CIS-16, slide 7:
May I know whether there is some mistakes for that? As from my understanding, it should be

A: Going from contingency tables N(w,x) to confusion matrices is not so straightforward ... top row in N(w,x) is class w+, bottom w-.

Element*P*_{11}of the confusion matrix is the fraction of times we had class w+ cases and they have really been classified as w+; we have 10 w+ cases for x=0 and 40 for x=1; if x=1 the MAP model selects always class w+, so all 40 cases are assigned to w+, and*P*_{11}=0.40; the other 10 cases for x=0 that belong to w+ class are going to be assigned by the MAP model erroneously to class w- because all cases with x=0 are assigned to w-, therefore*P*_{12}=0.1.

So, this confusion matrix is correct as it is.

To generate it first find the MAP classification rule. Here it is: if x=0 then w-, if x=1 then w+.

Then use these rules to predict how many cases are correctly identified. For class w+ and x=0 there are 10 cases assigned to w-, and for x=1 there are 40 cases assigned to w+. So,*P*_{12}=*P*(w+,w-)=0.1 and*P*_{11}=*P*(w+,w+)=0.4. - Q: For generalized linear discriminant function, what I am considering is that it is somehow related to the kernel-based method. Although in the kernel-based methods, the y(x) should be the kernel functions that satisfy some properties for inner product, they both transform original data x into a high-dimension space and try to separate data using hyperplane in this transformed space. For example, is using polynormal discriminant function equal to using polynormal kernel?

A: In the example in 2D with quadratic kernel we saw that this should be like LDA in the space of {1, x1, x2, x1^2, x2^2, x1*x2}. So roughly for polynomials this is true (although I have not seen more detailed analysis of equivalence). Since the results depend on the numerical methods chosen for the solution, criterion that is used to define the problem, results may differ.

For other kernels, such as Gaussian, things are not that simple, Taylor expansions into polynomial component series is infinite, and SVM with such kernels is not the same as RBF with some Gaussian functions. so the relation here is non-obvious. As I have mentioned, I have not seen explicit transformations that would do the same as non-polynomial kernels. - Q: If we write discriminant function g(x) in homogeneous form a'y=sum(w_i * x_i), in the y space, the distance from y to the decision hyperplane H is given by |a'y|/||a||. Because ||a||>=||w||, does it mean in this transformed space, the margin will decrease?

A: I understand that y=phi(x) is the transformed variable and a'y is a scalar product of the hyperplane coefficients a and the transformed vectors y (Duda and Hart notation). If so, what is the meaning of a'y=sum(w_i * x_i)? Should it be a'y=sum(a_i * phi(x_i))? Then why should ||a||>=||w|| be in general true? Have you assumed some special transformation here?

The solution region in a space with larger number of dimensions is going to be larger, and thus if we write w'.x > b1 and a'y > b2 then the margins are mx=b1/||w|| and my=b2/||a||, and we are searching for smallest ||w|| and smallest ||a|| that still fulfils the inequalities. Let's assume that b1=b2=1, so only the length of w and a matters.

It should be easy to see that in higher number of dimensions ||a|| may be smaller, as we have more components at our disposal, for example if x and y are binary vectors then in 2D we need wi=1/2 to reach w.x=1, and the length will be ||w||=Sqrt(1/2), but in 10D components of a will be ai=1/10 and the length a=Sqrt(1/10), so a larger margin is guaranteed. - Q: Because for general linear discriminant function, the solution vector (weight vector) is not unique, additional requirement can be added to constraint the solution vector. One possibility is to seek a unit-length weight vector that maximizes the minimum distance from the samples to the separating hyperplane. The other is to find the minimum-length vector that satisfy the specific margin condition. What's the difference between these two requirements? From my point of view. The first one considers the distance to the decision hyperplane of all sample points. In other words, the first one consider the sum of distance to the decision hyperplane of all sample data. So it doesn't require all sample data to have a margin larger than b. For some samples, the margin can be smaller if the sum of distance is maximized. But for the second one, the requirement of minimum margin is ensured. Am I correct?

A: I think so.

But I cannot prove what is said in Duda's book: The solution region of the second one lies in that of the first one and being insulated from the old boundaries by the distance of b/||y_i||. Can you explain this for me?

A: This is explained in the chapter 5.4.1, and fig. 5.9 in Duda/Hart textbook, and is intuitively rather obvious. When b =0 any vector with length >0 that is in the solution regions is sufficient, but when b>0 for all vectors a'y > b has to be true, so this will not allow very small values of a elements, as shown in Fig. 5.9. - Q: When I read notes CIS-13, pp. 9-10, I am a bit puzzled for the equations for g_i(X).
May I know how to get linear from the quadratic form?

A: We need discriminant functions to compare them; using formula shown in slide 9 two quadratic functions g_i(X) and g_j(X) for identical covariance matrices have identical quadratic terms X^{T}Sigma^{-1}X, and these terms may be dropped because in any comparison, like g_i(X) < g_j(X), they are identical on both sides; thus what we need for comparison are discrimination functions with linear terms given on slide 10. - Q: I found a bit difficulty to understand the PCA and FDA , may I know some recommended book or papers which related to those mentioned in your notes?

A: Unfortunately Duda, Hart and Stork do not describe it well.

The best, very detailed description of PCA is in the Chapter 9.3.1 of Andrew Webb book. Fisher's criterion (FDA) is described in chapter 4.2.3, and 4.3.3, and some recent development in the use of this criterion are in 4.3.9.

K. Fukunaga, Introduction to Statistical Pattern Recognition, Chapter 9 of the book has detailed PCA exposition. Although visualization is not mentioned in these

S. Theodoridis, K. Khoutroumbas, Pattern Recognition (4th ed), chapter 6.3, which is followed by singular value decomposition and numerical considerations.Some tutorials on PCA are available in the net, for example:

http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

http://www.fon.hum.uva.nl/praat/manual/Principal_component_analysis.html

For more visualization methods based on discriminant analysis see:

http://www.blackwellpublishing.com/content/BPL_Images/Journal_Samples/ANZS1369-1473~43~2~158/158.pdfFor more, just write "Principal Component Analysis tutorial", and "Fisher Discriminant Analysis tutorial" in Google.

- Q: What linear transformation preserves distances between data points in the best possible way?

A: Strange, but there does not seem to be a general answer to this question. It is related to a procedure called "classical scaling", described in: Jan de Leeuw and Willem Heiser. Handbook of statistics, volume 2, chapter: Theory of multidimensional scaling, pp. 285–316. North-Holland, 1982.

A newer book discussing classical scaling and MDS is: T.F. Cox, M.A. Cox, Multidimensional Scaling, CRC Press 2001 which you can read on-line here. Also Chapter 9.4.1 of Andrew Webb book has a nice short introduction to classical scaling. Algebraic equations that should be solved are similar to PCA and the approach indeed is reduced to PCA is original distances are Euclidean.

Strictly speaking this method is used to find the space of such dimension that points will have distances exactly equal to the original distances, as is for example the case from a distance matrix between the cities. Unfortunately software for classical scaling is hard to find.

"Preservation of distances" assumes some kind of measure; if we use Euclidean measure the stress function becomes non-linear function. I have not seen any solution to this problem but it is worth trying yourself, it should not be so difficult. - Q: For the first lecture, here is my first question:

The problems of CI involve not only pattern recognition, data mining ... but also philosophy, psychology and neurosciences. The guys in computer society often design better algorithms to solve problem based on some assumptions of the problem. it is often the fact that the problems has not been defined well and the assumptions about the problem are still being argued. Without the definition and assumption of the problem, how can we design computational solution for this problem? You may design your method based on one assumption but it maybe For example, because what is the mechanism of visual attention is still not clear in neuroscience and psychology, I always confused what reasonable assumption I should based on to design better computational algorithms How to solve or partially solve this problem?A: Indeed a major problem is formulation of good questions; we can learn methods to solve problems once we are able to formulate them. It took Summerians hundreds of years to go from a concept of "10 jars of oil", to an abstract concept of "10 of something". This is one of the absolutely non-algorithmic problems and the progress in formulation of new points of view is rather slow, most people just repeat the same thing slightly improving it, without questioning the basics. So there cannot be an easy solution to this problem.

Personally I find cognitive inspirations, studying how brains solve problems, how to simplify complex ways of describing the neurodynamics, very useful. I will go back to such inspirations during lectures many times, even though a formal theory may exist now that seems to be purely statistical. We should try different assumptions and see what our models are able to explain. In visual system there are many assumptions that people make, and grossly simplified versions of models based on such assumptions find their ways to autofocus in the electronic cameras. Divide the vision field into many regions and then focus on the one with the big variance, for example fast movement or strong contrasts, exploring all such areas. - Q: I have a question about the four Gaussians in 8D example. If the data is generated by Gaussian distribution in 8D, how can two dimensions show some co-linear characteristics? Is it because the variance along X5 is very small? I think we can only control the mean to have a relation like X5=a*X1, but not the variance becasue the data are randomly generated according to the distribution.
You are of course right, sorry I have not been too clear here: 8D means that there are 8 features; 4 Gaussian distributions, each in 4D, have been generated, the first centered at (0,0,0,0), the second at (1,1/2,1/3,1/4), the third at 2(1,1/2,1/3,1/4) and the fourth at 3(1,1/2,1/3,1/4); so the first 4 features x1-x4 are made from overlapping Gaussians, presented in different colors; the remaining x5-x8 features are made from these first four, x5=2*x1+eps, where eps are random numbers evenly distributed in [-1,+1] interval. I have added some remarks to the lecture presentation.

- Q: Dear Sir, indeed I have learned much from this course, and from you. The most important gain for me may be a perspective about the CI problems. But as you told us, there are a lot of more things that you cannot teach us in such a short-term course. So my question is: may I be able to continue asking some CI questions after the end of this course? I know you are very busy here...
A: I am glad to hear that, showing you some perspectives was important part of my plan and of course I feel that I have not done enough ... but then it is always very subjective, I can tell you what in my opinion is worthwhile to pursue, and of course others will have different ideas. I will try to continue adding some answer and advices to this page.

- Q: There are several fundamental kernels, such as polynoimal kernel, gaussian kernel, sigmoidal kernel, distance kernel, and you listed the formular of K(X,Y) of those kernels in the lecture.
However, the fomular of Phi(X), which is the vector in the high space corresponding to those kernels, are not mentioned. (except for that of ploynomial kernel). Would you please give the form of those Phi(X). Thanks.
A: The beauty of the kernel-based approaches is that we do not need to write this transformation explicitly, all we need are scalar products, and to be sure that kernels have properties that scalar products in some target spaces have. I do not know about any book or paper where this transformation is directly written and it may be difficult to find it. Although this may be an interesting excercise it is not clear if it could be useful for anything.

- Q: It seems that there’s no relation between the Naive Bayesian estimation (in your lecture notes 13) and the Bayesian estimation approach (in the book of Duda):

- NB: assume all features conditionally independent, estimate the posterior probability

- Bayesian estimation (in the Duda book): assume the form of the density, some prior probabilities about its parameters, then estimate its parameters.

These two things are different, I guess? And in our lectures we learn only about the NB and maximum likelihood for density estimation, not the above Bayesian estimation?A: Indeed, Bayes name is attached to many things, and NB is one of many of them and is a method of probability estimations useful in classification, while Bayesian estimation is commonly used to estimate parameter distribution functions more than single parameters.

- Q: Maximum likelihood estimation:
I’m not sure why we have multiplicative constants in slide 8, lecture notes 26, and why they are not important in the example?
In solving the derivative equation in slide 9, we use the given (real) values of n1, n2, etc., for solving quadratic equation to get , is this correct? So where are the expected (not real) values of n1, n2, etc. come from?
A: Taking the log of likelihood will make from these constants an additive term. We were given n1 - n4, so they do not change, and the derivative will be zero. After optimal value of theta is found P(Class1)=(2+theta)/4 is the probability of Class1, times n=125 samples =

=129

I have made it a bit more explicit updating the slides, but I am afraid that this type of problems come from not making your own notes ... of course the lecturer will finally spell out all the details on the slides, but making notes and thinking about them actually facilitated understanding and forced the motor cortex of the brain (and it is quite a bit of the total cortex) to participate in thinking. Having everything spoon-fed is not good ...

- Q: could we bring our own summary of computational intelligence course instead of book to the exam hall?
A: I talked about this quite a bit: only one book, no notes, self-produced or self-annoted material. It is final to highlight sentences and equations, or even add small comments in derviations, but please do not write other things there. On the second thought I understand why there are almost no open book exams ...

- Q: sample question for L8: if several unknown signal sources are linearly mixed usign unknown coefficients and only the results are given, how can the sources and the mixing coefficients be calculated?

so, may i know, how detail should we go about answering this question? is it enough if we just say Y = WX, find local maxima of non gaussian distribution .....? or we have to come up with the full procedure of ICA?A: I did not talked much about ICA, so I cannot expect the full algorithm for ICA to be given as an answer. The main idea what to do is sufficient, but of course a few more variants will be better.

- Q: In Lecture 18 (decision trees), slide 13 (random data example):
In this example, we use a randomly generated data label (in which the data features are meaningless), only the prior probability is known.
So, I think, in this case, any classifier will also perform worse than the majority classifier, not only the overfitted classifier?

So, this example will not be able to demonstrate the badness of the overfitted tree?A: It should not be difficult to construct other classifiers that get results close to the optimum, that is the prior probabilities. For example, density estimation methods such as kNN with larger k on average will get optimum result. A tree with a single level (tree-stump) should also be close to optimum, it may be slightly lower or higher in accuracy than the majority classifier due to random fluctuations in data dsitribution. The full overfitted tree has clearly the worst performance.

- Q: I’m still not sure how SOM can reduce the dimensionality of data vectors. As in the lecture notes, the winner neuron will adjust its parameters W (there are d parameters for each neuron, corresponding to the dimension of data vectors) to be more similar to the input vector. How that change of parameters W affects the shape of the grid of neuron (assume 2D grid of neuron)? I mean, is there any formulation to calculate the change in the location of the neuron in the gird, according to the change of its parameters W? We cannot plot the parameters of all neurons in a 2D space (since they are d-dimension vectors), so the visualization result is the shape of the 2D neuron grid, is this correct?
A: Not the shape of the grid, becuase in SOM it is rectangular, fixed grid. We do plot node paramters if d=1 or 2, as in the Peano curves example. You can make simplest example doing hand calculations and see what will happen. Let the grid be just 2 nodes, with 1 parameter each, initially w1=0.1 and w2=0.2. Now start showing them data X=+1 and X=-1, move your w1, w2 using SOM update rule once in the direction of +1 and another time -1. Because w2 is closer to +1 it will be the winner and will eventually move to +1, moving back to larger and smaller values, while w1 will be a winner for -1 and will finally become -1. Nothing is changed in physical space, but node parameters may start to point to different areas of the input space.

So what is visualized if d=10? Only the activity of nodes. Seeing that a node is a winner we know that in 10D space there is a cluster somewhere, and the nearby cluster should excite nearby node in 2D space. Similarity between clusters in the input space and winning nodes in on the gris is preserved, this is why SOM is called topographical mapping, although it is not a perfect mapping.

- Q: In slide 16 of lecture note 9 (the Peano curve), is the 2D input data generated from the inside of a triangle? How about the shape of the resulting curve if the 2D data is generated from the inside of a rectangle, or a hexagon, using 1D neuron array?
A: This is not hard to imagine: Peano curve again! Try it with SOM software, like the one I showed during the lecture.

- Q: What is the use of the Voronoi diagram and Delaunay triangulation in GCS algorithm? It seems that they are just for aiding representation? The Voronoi set in the lecture notes contains all the input vectors which have the node in the corresponding Voronoi cell as the winner, is this correct?
A: In all compettive learning algorithms I showed you Voronoi diagram showed which data vectors are assigned to a given prototype in the center of the cell. This is useful in real applications and may be analyzed to understand what this prototype means. Delaunay triangulation helps to see how this space is partitioned, aiding presentation.

- Q: Could you please explain some more about the gradient iterative minimization process for MDS, in the easy case in slide 7, lecture note 11.
A: This is a longer story, I did not tell you how to implement MDS in practice, but if you are interested read about it and other visualization methods in this PhD thesis.

- Q: Lecture 9, Sample Question 5: Is only when reducing the neighbor radius too quickly leading to the map distortion? or is there any other reason contributed to the map distortion?
A: What other reasons can you see for distortions in such algorithm? Is it always possible to represent N-D relations in 2-D? What type of distortions may arise? Does "the best" 2D representation always exist?

- Q: Lecture 20 Why there is a "2" in the test condition in SSV?
And in the Lecture Notes, it reads SSV need to be minimized to find the best split? However, I'm still not clear why it is to be minimized but not maximized. To be frank, I think it should be maximized to find the best split. So could you please explain more about the SSV criterion?
A: Factor 2 is just to be sure that the first term measuring separability dominates over the second term. Several alternative formulations of this criterion exist and indeed in this form obviously the bigger SSV is the better, so you are completely right. I have added some remarks to the slides.

- Q: Lecture 26:
Sample Question 2: How many PDF be used to discuss psychological phenomena?
A: I am not sure that I understand that question. Probability density functions P(C|X) are used to categorize some phenomena; if you want to apply it to cube with faces in Slide 3 you have a choice in modeling: P(face|X) may have local maximia corresponding to X in corners of the cube and may vanish for some radiculous combinations; you may also mdoel just one face expression, ex. P(smiling_face|X).

- Q: Lecture 29, Sample Question 3: what type of contours may be created using mixed activation functions? I think the shape of the contours should be same as the distance-based activation with shift determined by the linear combination. Is it correct?
A: Yes, but can you characterize them a bit better? Can it be a hyperplane, a circle, a hyperboloidal surface?

- Q: Lecture 29, Sample Question 6: what does it mean by saying linear parameter? And what is the simplest solution for the linear parameters?
A: Linear parameters influence the result in a linear way, so if y(x;w,b) = w*x+b then w is a linear paramter. If RBF networks are used then the Gaussian width of their nodes are non-linear parameters, but the output weights are linear. So, what is the simplest way of finding linear parameters in this case? It is stated in lecture notes.

- Q: Lecture 31, Sample Question 4: Which variable should be considered first? I think we should use the variable that with less possible value (i.e. with more restricted constraint) first. So does it mean that we should use variable with lower density first?
A: Density may be high but focused in small area, so say only around +1 it will be non-zero. Density may be low but non-zero for all values and then it is less useful.

- Q: Lecture 32. Sample Question 3: Why is it easy to calculate leave-one-out estimate using kNN and difficult using other methods?
Is it because other methods need to take n-1 vectors, while kNN only look at 1 vector (if k=1)?
A: Just try it: in most methods you will need to repeat the training procedure n times nad this is usually slow. What do you have to do in k-NN?

- Q: Sample Question 5: why k-NN is stable while LDA or decision tree are not? Is it due to the greedy nature of LDA or decision tree, i.e. in LDA or DT, they always try to find a best partition (or border) to split the data even when there is a small pertubation of the data?
A: What happenes when one of the training vectors is slightly changed with borders from the decision tree? This has been clearly illustrated. Now what happens with the kNN borders?

- Q: Lecture 35: Sample Question 8: For mutual information of discretized features, why is it difficult to select continuous features?
A: I am not sure what you mean? How would you calculate MI for continuous features?

- Q: Lecture 36: According to Bagging algorithm in notes, it use SAME model to train on all subset Di from bootstrap. While I read through book, says different component classifiers used to train on Di. So, which one shall I follow?
A: I am not sure which book you have; of course you can use bagging with a committee of different classifiers, but it is commonly used with decision trees, creating slightly different tree for each bootstrap subset. In this case different data is the source of model variability, but it may also be differnet types of models and stochastic learning that will create difference models.

- Q: Lecture 6: Sample Question 3: What is the most important property of principal component?
A: So far I have answered all specific questions I've got but for a few questions such as this one it looks like you want to get easy answers to learn. If you read Lecture 6, you will find description of principal components, and you should be able to answer this question easily, giving me your opinion. The point is not to learn what I want you to answer but to understand what the topic is about and make your own judgements. I am sure that you have a good memory, but this is not a poetry or spelling contents. Questions below are more or less specific, and if you speculate on important PCA properties and send me a ranking list I may point out a few more.

Asking questions like:

Sample Question 7: Describe how semantic map may be constructed and Why SOM and MDS semantic maps preserve natural classification?

Lecture 27, Sample Question 2: What alternatives for parameter optimization exist?

Lecture 28, What are the difficulties in non-parametric density estimation?

What may I say? Repeat the lecture?

- Q: Is it a mistake in the formulations on page 12 in lecture 6 (PCA):

C_{Y}=ZC_{X}Z^{T}should be C_{Y}=Z^{T}C_{X}Z ?

ZAZ^{T}=C_{X}should be Z^{T}AZ=C_{X}?A: Indeed, two slides earlier it was still fine ... please watch for such errors, there are different formulations depending on what is defined as column and what as a vector. I have added explicit dimensions now to remember what is a column and what is a vector.

- Q: Can we reduce both bias and variance of a model by reducing its MSE (such as by some ensemble methods)?
If yes, why do we say that there is a “tradeoff” between bias and variance?
A: This is the goal, good models should generalize well and for that we need a small bias, that is a very flexible model, and small variance, to follow only real trends in the data, not the noise. If we take one model we face the tradeoff, but ensambles are not single models, they are a series of models that converge to a final model that is better in both respects. The tradeoff does not prevent invention of new, better models, it just says that if you take a fixed model with small bias and overtrain it to get 100% correct answers it will have a high variance and will generalize poorly on novel data.

- Q: As you mentioned "2D histograms in 3D is still useful, but hard to analyze." why is that? Is it because the number of datapoints in each bins will be very small? why for more than one class will be difficult? is it because of hard for visualization?
A: Anyone how tried to view 3D histograms will know it; in rare cases situation may be clear, but usually one has to view it from different angles, trying ot rotate to see anything. It is mainly a problem with visualization, for a small amount of data number of points may also be an issue.

- Q: What are i and j in the error function, Lecture 16. Is it considering multiple models or single model?
A: Minimization of the error function means changing model parameters (and thus generating multiple models) in such a way that the error function is minimized. I have added a slide with an explicit example to show how confidence in the decisions may be increased and errors reduced at the expense of the rejection rate and decrease in accuracy.

- Q: What is a Lift chart? What is the uses of it?
A: In some books cumulative gains are called lift charts, but marketing experts are careful to distinguish them; I have updated lecture 16 slides. You may also look at detailed examples on the page:

http://www2.cs.uregina.ca/~hamilton/courses/831/notes/lift_chart/lift_chart.html

- Q: about Manhattan distance.

The line that is equidistance between the two prototype points that are in the corners (0,1) and (1,0) of the square in Manhattan metric seems to be non-unique?A: Amazing, but true! For symmetric arrangment of prototype points on y=x+b or y=-x+b line Manhattan metrics leads not to a single line, but gives rectangles that contain points that are equidistant from the two prototype points (for restricted data range, otherwise it gives quater-spaces in the upper and lower corner)!

Q2: How to draw the lines that are at the same distance from two points (x1,y1) and (x2,y2) using Manhattan metric? We need to solve the equation |x1-x| + |y1-y| = |x2-x| + |y2-y| to find a set of lines from the result of x and y. Is this correct?

A: Yes, but it may be easier to use geometrical symmetry arguments when drawing these points than solving such equations, as is evident form the example given above. Try to see what happens ...

- Q: When the covariance matrices are identical, we obtain the linear classifier

When the covariance matrices are identical and diagonal, and the prior probabilities are equal, we obtain the distance-based classifier.

Is this correct?A: Yes, but there are two points to note: 1) this is true only if we take Gaussian distributions, 2) if they are diagonal then Euclidean distances are obtained, for non-diagonal covariance matrix decisions are still based on distance to the Gaussian center, but Mahalanobis distance should be used, with inverse of the covariance matrix.

- Q: In Lecture 28, slide 9 on convergence properties. The rate of convergence with n functions, here n is refered to m in slice 2 of basis set functions.
Indeed, it is better to use the same letter m as a few slides before, thank you, corrected.

- Q: Why we are not satisfied with conditional probability, but want the posterior probability to make a decision? My answer is: This is because conditional probability alone is not enough, we need the prior probability in order to make decision. Is this correct?
A: True, but note also that in slide 6 we show that the posterior probabilities minimize average errors of the decision procedure.

- Q: Write all normalization conditions involving a priori, conditional, posterior probabilities. What do you mean normalization conditions?
A: What they should sum to, for example sum of all prior probabilities should be 1. Sometimes we do not have proper esitmaiton of all probabilities, but know that class 2 is observed twice as frequently as class1, then we have to sum p1+p2=3 and divde (normalize) each estimation so that p1+p2=1 as it should, therefore sum rules are called "normalization conditions".

- Q: What is the risk matrix in this question? I could not find it anywhere in the lecture notes.
A: Indeed I did not write it clearly, it is the same as the cost or loss matrix lambda.

- Q: Bias-variance trade-off: From the lecture notes, I’m not sure how linear models are trained on data. Could you please give me some more explanation?
A: I understand that this question reading the lectures sequentially, becuase this topic is then explained in several lectures, mainly 25 and 26 on linear discrimination.

- Q: Can you explain a bit more about the curse of dimensionality?
A: In the Duda and Hart book, chapter on non-parametric techniques, where more details may be found, including two problems that illustrate nicely how growing number of dimensions leads to difficulties in estimation of probability densities.

I gave some examples in the lecture notes, showing that to keep the same density - and thus roughly the same accuracy of approximation of the probability density functions - the number of points should increase exponentially with the growing dimensionality. High-diemensional geometry is very non-intuitive and surprising, but we had no time to discuss it. For example, with binary strings of dimension 1000 and randomly created 1 million strings, searching for the neighbors will find 98% of all strings different by 411 to 430 bits, surprisingly far, and only 1 in 10.000 times a string closer than 400 bis or further than 432 bits. Thus using a nearest neighbor strategy in highly dimensional spaces leads to quite wrong bias of the algorithm and hence larger errors, as shown in slide 8.

- Q: There are 4 types of classifiers: majority, maximum likelihood, MAP and minimum risk. You told us that the minimum risk classifier is the best, since it takes into account the cost of different types of error, correct?

How can we compare the majority (the worst), likelihood and MAP classifiers, which one is better, and when they are used?A: It all depends whether we have enought reliable information. If we do not know anything we freqently go by the majority rule, using implicitly majority classifier; for example, this leads to stereotypes that many people use in thinking about other people. If we have sufficient data to calculate class-conditional probability densities then maximum likelihood predictor may be used, or even better MAP; if risks of different decisions are known or may be estimated then an extension to minimum risk is possible.

Note that in the usual case comparing two decisions maximum likelihood and MAP are equivalent because the evidence p(X) cancels. There are more complex cases when rejection of low-confidence decisions and additional constrains on decisions may introduce a difference between these two fomrulations, but we have not discussed them.

- Q: As we have learned various classification models in CI, it seems that most of them assums the forms of "model+parameter+learning method"; which means they first assumed the model (statistic, linear discriminant, neural network, decision rules,etc) and then found some parameters to stand for these model, then used a learning algorithm to get the opitmal values of these parameters...

So my question is: Is there any guideline to determine how many free parameters are needed for a certain kind of problems? This problem may be equivalent to say 1) how many individual PDFs should we assume to mix together to form our dataset or 2) what is the number of neurons in the hidden layer in a NN?

For an analogy in physics, in the case of uniform linear movement, the velocity and time will give us all; while in the case of constant accelerated movement, we may need another parameter of acceleration.... Can we also categorize the classification problems like this way? What is the opitimal number of parameters for a problem? Is there any theory about that?A: Physics is using parametric models in the description of simple phenomena, and I it is possible to use guidance from physics our model should incorporate it, it would be an appropriate bias. Unfortunately many phenomena are too complex to describe them by parametric models. Take protein folding as an example. The problem is simple: a chain of aminoacids folds into 3-dim structure that has minimum energy (free energy, to be precise). We can write the equations and they are useless, too many parameters to optimize. Assume that a group of molecules is dscribed by a single potential and introduce some parameters to model it. Now the number of parameters describing the aminoacid chain is much smaller, although it still may be of the order of a million. Try to fit these parameters using the known protein structures - Nature has done some computations for you - and use it to predict new structures.

What is the optimum complexity of your model? I have made some remarks very briefly talking about applicaiton of infromation theory to determine optimal model complexity: AIC (Akaike Information Criterion), BIC, Minimum Description Length, etc. These methods try to estimate generalization error analyzing the training error, frequently using some form of bias-variance decomposition. In computational learning theory this is an important branch. A good intro is in the Hasti et al book, Chapt. 7, that I recommended at the beginning. In practice, however, crossvalidation is the easiest if you can afford computational costs. It is a good topic for theoreticians, but I would not recommend it if engineering applications are the main goal, becuase it is not our biggest problem and we should not expect great breakthrough that will allow us to set easily much better models than we already have.

COLT (Conference on Learning Theory) conferences, http://learningtheory.org/colt2005 are devoted to learning theory, the first 3 topics they mention are: Analysis of learning algorithms and their generalization ability; Computational complexity of learning; Bayesian analysis. - Q: As you mentioned in the lecture, Sir. There are several sources of the model's variability... Would you please give us some advices on how do you think of this kind of variances? Do you think it is worthwhile to find a model that will always generate the same result independent of the presentation orders of the input, given a fixed dataset? Do you think this kind of input order-independency is possible? Since there is a tradeoff between the result's accuracy and variance, will this kind of order-independent model have to finally sacrifice its accuracy for the order-independence? For a more essential question, is it possible for a model to always generate a constant knowledge representation for a given dataset, should it?
A: Stability of the models is very important, and it is reflected in the variance of the model during crossvalidation training or bootstrap sampling. If the model gives more or less the same answers it has a low variance. Decision trees will give the same answer independently of the order of presentation, but they are not stable and have large variance. Averaging the answers of many trees in bootstrap procedure gives you quite stable models.

This is an important issue: if you want to have a robust, simple description of the data you should find a method that has good bias for the data analyzed (unfortunately different data require different biases, and thus methods) and find the simplest model that is not significantly worse than others, this may have a small variance. We have tried to pick up rules from trees that are the most common in the CV procedure to achieve it, giving a very simple logical description that does not change when we sample the data many times creating new training sets. Sometimes it really works, but it is a a challange to find such rules.

One way we have tried to solve it was to introduce hetereogenous systems, based on functions of different types, for example different types of tests for decision trees in each node, different neural node functions, or differnt local distance functions in similarity based methods. This type of models may find best bias for the data but are rather hard to train and the issue how to create them is still open. Some papers that attempt to create heterogenous neural, decsion trees and similarity models are linked to the page:

http://www.fizyka.umk.pl/kmk/publications.html- Jankowski N, Duch W (2001). Optimal transfer function neural networks.
- Duch W, Jankowski N (2000). Transfer functions: hidden possibilities for better neural networks.
- Duch W, Adamczak R, Diercksen GHF (2000) Constructive density estimation network based on several different separable transfer functions..
- Grabczewski K, Duch W (2002) Heterogeneous Forests of Decision Trees.
- Duch W, Grabczewski K (2002) Heterogeneous adaptive systems.
- Duch W (2000) Similarity based methods: a general framework for classification, approximation and association.

This topic is not yet popular but I consider it to be one of the most important directions in computational intelligence, finding simplest models that can solve the problem. Consider the parity problem: a string of n bits B=000110...1 has even or odd number of 1 bits. XOR is just the parity with n=2 and it is already not easy for some models, but parity with 10 bits is terribly dififcult for RBF or MLP neural networks, similarity models and decision trees. It may be solved in a trivial way by taking as class c=cos(S

_{i}B_{i}), that is a single node network with all weights equal one, if an appropriate function is found.

- As you introduced to us, many filter methods employ some information distances between two distributions, such as MI for KL distance. Is there any other applications in which this kind of distances can be used? Is there any principle about selecting which kind of distance for which problem? Can you recommend us some study materials for that?
I have meany references to filters and distances between distributions here, in a paper still in draft form, it should be corrected soon. This is an interesting topic that has applications in many adaptive systems, people compare for example histograms of images trying to find something that looks similar.

- Q: You showed an example of LDA using WEKA in the lecture notes 22. But I can not find the LDA classifier in the WEKA anyway. So does WEKA include the LDA or not?
This has changed in the latest version. LDA is accessible now only by choosing from Metalearning models "Classification via Regresion" and then choosing from "Functions" Linear Regression insted of M5P; to make standard LDA switch off feature selection and elimination of colinear features, and put ridge regression to 0. These parameters allow for selection of features and ridge regression coefficient C controls margin by adding C||W||

^{2}, like in SVMs.

- We have learned much about the CI fundamental ideas in your class, and would you please give us some perspectives about the CI developments in the future? I mean, what are the currently popular directions, what problems are promising to consider?

Through your class, we have a feeling that almost all existing methods share some basic things--as you answered me in a previous question:

For example, for the classification problem, Bayesain Theory provides a theoretical framework for almost all the methods. And the other existing methods seem to do the following things: (1) they first assume a model, or an interpretation of the data, and (2) then convert these interpretations/models into some quantifiable parameters; after that, (3) they "learn" these parameters by some searching method (such as iterative gradient-descent, or back-propagation, etc.), and finally get the whole model and interpretation of the classification problem.

For example, (1) ML, Bayesian Estimation, Parzen Window, K-NN assumed their parameters have statistics meanings and in fact learned the parameters as PDF. (2) LDA assumed a simple geometrical interpretation--linear separatable data; (3) Decision tree made their parameters represented as rules; (4) Fuzzy neural network also assumed some parameters that can describe the "if-then" rules and learned them by the neural network's backpropagation ability... (5) for 3-layer neural network, it learns both an nonlinear feature extraction process and the linear classification, as mentioned in Duda & Hart 's great book! Moreover, so many similar things exist in these methods: for k-nn, Infinite-NN may approximate the Bayesian Classification directly, while 1-NN with only one reference vector for each class may obtain the LDA result--as you mentioned in the class.

So dear Sir, what else can we do in this area? I think of some ideas: 1) how to evaluate the tradeoff between the accuracy and generalization ability, I mean the tradeoff of "bias" and "variance" of the result--SVM seems have done much on this, but I don't know is there any theorical results about this topic? 2) how to find the associations between not only the inputs & outputs but also in the inputs themselves--as you mentioned in the answer to my last quesiton... 3) is there any other problem rather than classification, that have a strong connection with learning, and are worthwhile to condider?

I cannot find more...and for these I have thought of, they may also be studied well, just I don't know about them. So pls give me some instructions... What should a freshman in this field focus on first?Q: This is an important question and it deserves more time. I will try to answer it in the last lecture summarizing and outlining new directions.

- Can we say the mutual information of two features is a measure of the independent rate of them?
Since MI(x,y)=D
_{kl}( P(x,y) || P(x)P(y) ). Is there any other measure of the independent RATE like this?Q: I am not sure what you mean by the "independent rate", because this term is not commonly used. It is a measure of mutual information, or how much can we predict the distribution of values of one variable by knowing the second. There are many ways to measure this: by correlation coefficients, by distances between distributions and other ways; a review of some methods may be found at:

http://www.fizyka.umk.pl/publications/kmk/04-filters.html

- Sorry I havn't found the principle to calculate the number of consistent cases in a general way, I tried some methods and felt the most difficult thing is to deal with the "0 cases" (non-changing cases) in the counting. Can you give me some referrences about this problem? How others calculate this? Is there some closed forms of this calculation?
I have tried to do it once but have no notes around; it depends on the assumptions about the way your laws are interrelated. For example if A1=F(A2,A3), A2=F(A3,A4),.. An-2=F(An-1,An), then only 4n+1 solutions are possible.

- 2) When we have the "may-conflicting" laws, such as :
(1) A=B+C; A=B-C; or
(2)A=B ;A=-B,
For the second case, the only consistent solution may be (0,0), but my calculating way cannot deal with this (it will give (0,0) (+,+),(+,-),(-,+),(-,-) as results). How can we deal with this problem correctly?
It may be difficult to find general relations, only in special cases this is possible. Conflicting laws are unlikely to appear. From observations of situations (1) or (2) we will infer that there is no relation between A, and (B,C). The density describing associative relations should reflect this.

It is an interesting direction for research that - as far as I know - has not yet been explored. There are some mehtods to extract associative rules from data (WEKA has some programs for that, but I have no time to talk about it), looking at correlations between different features, but this has not been extended to systematic reasoning in more complex situations.

- As you mentioned in the introduction of SVM, parameter C is determined as a tradeoff coefficient between Accuracy and Generalization Ability. Would you please give me some introduction or references about how Ghost Miner determines this parameter C optimally? For me to guess, is the criteria chosen as the combination (e.g. sum) of both the recall and test accuracy? Can parameter C be determined also in an online learning mode? I am interested in it since it seems to achieve a balance between the “bias” and “variance” of a classifier—can I say so?
A: Both C and the Gaussian dispersion paramter b (called "bias/slope" in the GM configuraiton for SVM) are determined by crossvalidation. A mesh of C and b paramters is created, crossvalidation run for each (C,b) value, and maps constructed showing accuracy and the number support vectors. This may be done in the off-line case only, but paramters C/b should not be so critical, once they are roughly determined on some subset of data SVM should work well unless the new data is very different. In the on-line case we will first gather some data, determine reasonable C/b parameters and then may proceed on-line. There is no direct attempt to balance bias and variance of SVM (no formulas that measure it are optimized), but indeed, high C allows for small bias at the expense of variance, and small C sets a high bias (requiring large margins) that reduces the variance, so implicitly bias/variance is optimized.

- As you introduced to us, Sir, the main power of SVM seems to be that it employs some kernels or nonlinear transformations to map the original problem into a roughly linearly separatable one. So what is the principle for us to choose these nonlinear transformations? Can we always find these transformations to get the linearly separatable problem? It seems to me the set of polynomials may be a good choice because every discriminate function can be expanded into a Taylor series. Can we always choose polynomials as the transformation?
A: Gaussian kernels usually work better, they also can separate complex distributions, polynomials have oscillations between known data clusters and are therefore rather dangerous to use. This is trial and error game, and although data that do not contain identical vectors with different labels may alsways be separated (for example by putting Gaussians with small dispersions at each data point) this is not what we want, becuase generalization will be poor. The "Cleveland heart" example showed it clearly, it is better to aim at similar accuracy on the training and test partitions in crossvalidations.

- Following your recommendation, I am reading the great book of Duda and Hart now. Fortunately I just come to the point introducing the density estimation—coincided with our class progress, J. I have a question that:
As introduced in Duda & Hart’s book, there are mainly two kinds of estimation methods: maximum likelihood and Bayesian estimation. According to the book, the main difference of the two is that the first method assumes the parameters as fixed numbers while the second one assumes a distribution of the parameters (just like a “fuzzy ” number? J). But I still wonder why ML selects the likelihood criteria while Bayesian selects the posterior probability. Intuitively both criteria seems good, is there any other reasons for selecting the different criteria of the two methods?
A: Both are good and one can come up with more approaches ... in practice differences are frequently small, so one should go with the approach that is simpler, computationally less demanding and converges better for a given taks.

Conceptually the differences are significant. In maximal likelihood approaches we assume fixed values of parameters that are estimated, in Bayesian approaches some prior distributions of parameters are assumed, and these distributions are changed ("learned") as a result of new observations. There is a simple realation between maximum likelihood (ML) and max. a poasteriori (MAP) estimation of parameters; MAP is reduced to ML if a uniform distribution for paramters theta that are estimated is assumed.

- For the three kinds of classification methods introduced in our class--"Bayesian Classification", "Minimum distance classification using Mahalanobis distance" and "Linear-decision boundary classification", could I say that:

For pairwise classification problems:

1. "Bayesian classification" is the most general form, the other 2 methods could be viewed as the special cases of it, it can obtain any kinds of decision boundaries;

2. "Minimum Mahalanobis distance classification" can be viewed as a special case by assuming P(x|wi) are “normal” distributions, generally it will result in the quadratic decision boundaries; here the minimum distance from x to a prototype of wi determines x’s class label;

3. “linear decision boundary classification” is a special case of the “Minimum Mahalanobis distance classification”, since the former assumes that the two classes have the same covariances in “Mahalanobis classification”; here the linear decision boundary may be the line derived by Fisher discriminate analysis.Dear Sir, can you give me some general ideas about all the existing classification methods? Such as what is the hidden criteria in all of them? If there are too many things, can you recommend to me some books or papers talking about these kinds of topic?

Q: Yes, I am trying to show you how all types of classifiers may be derived from Bayesian approach as special cases, and that sometimes it may be worthwhile to optimize parameters of simplified classifiers directly becuase in the Bayesian formulation the number of parameters may be too large and thus estimations would not be realiable.

There are very many approaches to adaptive systems that may be used for approximation, classification, discovery of rules, correlations or associations. The books I have mentioned in the first lecture - Duda, Hart and Stork, and Webb - contain excellent theoretical introduction to the problem, but no book is complete and becuase there are so many good methods it is probably not worthwhile to spend much time on their improvement, just select what you need and apply it to interesting problems. Solving classification problems may be done in 1000 different ways with 10 pre-processing, 10 classification methods each with 10 parameters there is 10x10x10 combinations! To make progress we should know what types of methods exist, what type of information they provide and when they will be useful, learn how to analyze the data, as some of you have done quite well in the assignement 1.

A good book with little theory comparing different classification methods may be found at this link.

- Q: I would like to study in detail C4.5 algorithm to find some ideas for my project. Could you introduce me some useful links to study it (such as any source codes and tutorials)? Is it easy to build it by myself or I should use the source code from the others? I already browsed the internet but there were not many information about it.
A:Please check my links first, http://www.ntu.edu.sg/home/aswduch/software.html Link to C4.5 tutorial is there.

- Note: some equations do not print correctly even after asking the printer to make black/white pages. I have chcnage them now to yellow and they should print fine, so please let me know if there is still a problem.

- Q: There are two kinds of data preprocessing: normalization and standardization, in what condition should I use normalization or standardization?
A: Normalization will simply shift and rescale all data to [0,1] interval; as a result if there are a few very large or very small data the variance becomes very small, with most data concentrated around the mean and a few samples near 0 or 1. Standardization will make zero mean and keep most of the data withing a single standard deviation from the mean, with the unusall values far from zero. In methods that use similarities or distances - for example in the nearest neighbor methods - standardization usually works better, but in some methods - for example in most neural networks - inputs should be rescaled to [0,1] or even [0.1,0.9], so normalization should be used.

In general optimal scaling of features is neither normalization nor standardization, but should be a part of data model that includes data transformation and classification. For example, some features may be removed (scaled by zero factor) in feature selection. Some methods - decision trees or linear discrimination methods, for example - do not require any data transformations, while other methods are very sensitive to such transformations.

- Is Fig. 2.3 in Duda and Hart correct? It is. If
q
_{a}= (l_{12}-l_{22})*P*(w_{2})/(l_{21}-l_{11})*P*(w_{1}) for l_{12}=l_{21}then increasing l_{12}will make the threshold larger and this will make the*R*area smaller._{1}

- The first question is about the decision function derived from Bayesian Theory by assuming the Gaussian distribution of the conditional probability. As you told us, Sir, it seems that the assumption of the “identical covariance” plays a key role in simplifying the quadratic decision function into the linear one. So my question is: is there any existing linear or nonlinear transformation that can make the covariance of two or several distributions the same? Or is it possible to do this?—I mean will this kind of transformation lose too much information (the covariance information) so that it will finally become trivial? Is it true that we can always do some preprocessing (feature selection/extraction) so that we can use only the linear decision function?
If several Gaussians with differnet covariance matrices form your data there will be no way to linearly transform this data to get idetnical covariance matrices. Finding non-linear transformation that could achieve this in non-trivial and would require first identification of these clusters, esitmation of their parameteres and than making non-linear transformations, a rather hopless case. But, as we shall see soon talking about SVMs, adding non-linear combinations of features to inputs may create transformations extending the feature space in a way that linear discrimination will be sufficient.

- My second question is also about the decision function of the Bayesian classification. As you told us in the class, Sir, by assuming the Gaussian distribution, the general form of decision function is quadratic. My question is: is there any guarantee of how many orders of the function are enough for a general assumption of the distribution? I mean, do we know some upper bound (and even lower bound) of the orders of the function that we need to do the Bayesian classification?
Quadratic decsion functions come from the fact that we are taking log of the ration of two such distributions; decision functions in d-dimensional space are therefore described as some multiquadratic forms. If more general distributions are taken, or even if the probabilitiy is modeled as a mixture (a sum) of several Gaussians, decision borders will not be polynomial at all.

- The third question is about the Naive Bayesian theory. You told us that the assumption of NB is the “INDEPENDENCY” of the features. And in practice, we use NB by naively assuming this independency or do some transformation to UN-CORRELATE the features. And you also told us before that INDEPENDENCY means for any functions E{f1(yi)f2(yj)}=E{f1(yi)}E{f2(yj)}; while UNCORRELATION means E{yiyj}=E{yi}E{yj}. So can we say UNCORRELATION is just a special case of INDEPENDENCY and the latter implies the former? If yes, how can we guarantee the independency only by un-correlating the features?
Yes, independency implies uncorrelated feature in the sense of diagonal covariance matrix; we cannot guarantee independency by uncorrelating features, that is why PCA, creating uncorrelated features by diagonalizing the covariance matrix which is based on the second-order statistics, is not the same as ICA, that seeks independent features, for example by using higher order statistics, like kurtois. Using features from ICA we should get much better NB results if features were initially correlated.

- Q: May I know why is the risk of a action (e.g. assigning sample X to class Wk) defined as the average cost or loss of performing all possible actions (e.g. average cost of assigning X to any class)? In my opinion, shouldn't the risk of doing something be defined as the average gain you sacrificed by doing it?
We are trying to estimate the risk of some decision procedure, such as the risk of using Bayesian classifier (including the way data is preprocessed and probabilities are estimated). If there is no loss, accuracy is perfect, there is no risk. Risk of individual decision is therfore connected with loss or costs we shall pay if we make error of particular type. Accuracy = 1-error; losing accuracy means making error. Let's say that making correct decision you could gain 1, but your system gives you aonly accuracy A; then 1-A is the error, and times Cost of this particular decision (or loss) ti will give you risk, so it does not matter if we talk about gain, accuracy or error.

We could calculate the risk for individual decision: if probability of correct decision for vectors similar to X is close to 1 than the risk is small. Although such local estimation is useful if we are trying to compare different decision-making procedures we have to estimate total risk by averaging over all possible actions.

- Q: My question is about quantization errors used in GCS algorithm.

By comparing SOM, GCS and K-means, I find there are some interesting characteristics that SOM and GCS shared with other unsupervised clustering algorithms. The following are some of my thoughts and they will introduce the questions.

(1) It seems that the goal of SOM and GCS and other clustering algorithms are the same: all of them try to use some representative prototypes to represent the original data points. Since the number of these prototypes is usually less than the number of original points, a compressed representation of data set is usually obtained.

A: This is true, but in addition the goal is to visualize data; clustering algorithms reduce the number of data vectors but do not show the relations in visual form.(2) And the optimization indices of these compressed representations seem the same, too. All of these methods try to reduce the inner scatter distances or the quantization errors: the total dissimilarities between each original point and their corresponding prototype.

A: Of course, although many variants of such indices have been invented in clustering.(3) In K-means clustering, it seems the algorithm use a fixed number (k clusters) of cluster centers as the prototypes, and the following iterative optimization step tries to reduce the total inner dissimilarities in each cluster. SOM seems better than k-means. Although it also uses a fixed number of prototypes to represent the original data, it is more flexible since it will allocate more prototypes (the nodes) in the more complex (more diverse-as you showed us in the Italian Olive Example, Sir) regions. SOM obtains this flexibility by connecting each node with its neighbors, so the more diverse (complex) region will update its representative node more frequently, and thus pull over more points into this region to get a more detailed presentation. Moreover, SOM is more powerful than k-means since it can also reduce the dimensionality at the mean time. On the other hand, GCS seems more powerful and flexible than SOM. It not only allocates the number of prototypes dynamically according to the diversity/complexity of the region, but also creates more prototypes dynamically. So the number of its prototypes is not fixed, which is more flexible than both k-means and SOM.

A: As a clusterization method K-means algorithm is not bad, SOM may not be so great because of the constraint to form topographical representation. LVQ (Learning Vector Quantization) method (also introduced by Kohonen) works like SOM but does not try to achieve topographical representation and therefore it is more accurate.(4) So my question comes here, :-). (Sorry for such a long introduction). In the dynamics of GCS algorithm, Sir, you told us there are two alternative criteria of finding the “worst” node-the most possible node that needs adding new neighbors to help it to represent the diverse/complex region. The first criterion is the standard quantization error as you mentioned in the slides, it seems more like the classical inner dissimilarity standards. By this criterion, the goal of GCS is the same to SOM and other clustering algorithms, as mentioned above. The other criterion is the (normalized) frequency counter used in the GCS algorithm. Sir, my question is, the second criterion seems very different from the first one, since the large inner dissimilarity (first criterion) needs not to be implied by the high updating frequency of the node (the second criterion). The high updating frequency may also happen when the density of the points are very high while the diversity (such as covariance) of them is not very large. So, Sir, why can we use these two criteria alternatively? Do we choose the second criterion just because of its computation simplicity? Is there any other interesting thing behind the idea we discussed here?

A: The two criteria may give slightly different results. The original algorithm used the frequency counter but soon other versions were tried. GCS algorithm guarantees that either quantization errors are similar for each node, or the frequency of winning is roughly the same. The total quantization error is a sum over all vectors in the Voronoi cell of the node. High quantization error is achieved when the cell is large and has a few vectors in it; then it is worth splitting using the quantization error criterion. By the frequency counter criterion the prototype in the cell is selected rarely, so it will not be split. The maps for strongly non-uniform density of data points will be rather different, but if the points have similar density in most areas differences should not be large.

Many variants of GCS exist, including a supervised version (which happens to be a constructive RBF network). It is a good method of visualization and classification, but unfortunately there are no good programs to experiment with, except for the demo I've showned, and the GCSVIS package (see my software links). - Q: My first question is about the form of the
neighborhood function:
*h(r,r*. It seems that if we replace this definition of the neighborhood with the simplest one (Euclidean distance):_{c},t)=h_{0}(t)exp(-||r-r_{c}||^{2}/o_{c}^{2}(t))*||r-r*, we will get a direct map from the original space to the space we are interested in. So, Sir, can we say this complex function of neighborhood maps_{c}||__the neighborhood relations in original space__(*||r-r*) to_{c}||__the neighborhood relations in another space__(*h*)? So SOM is a kind of nonlinear transformation of the neighborhood relations in different spaces? And is it the_{0}(t)exp(-||r-r_{c}||^{2}/o_{c}^{2}(t))*exp*function that implements the nonlinearity part of this transformation? Why do we use such a special form of neighborhood function (or nonlinear transformation of neighborhood) rather than others (such as*||r-r*instead)? Can this form of definition be related to some Gaussian distribution assumption?_{c}||^{3}A: It is much simpler than that! There is no assumption about Gaussianity of the d-dimensional data in the feature space. We simply need a function that falls gradually with a distance, otherwise all nodes in the mesh will be pulled in the direction of the presented data vector X, and we want only those that are close to the winner to be affected. A Gaussian function simply falls to zero with a distance, but a function

*h*or anything similar would also be good; functions that grow with the distance are bad because the influence on the weight vectors of neurons that are far from the winner is growing in an unbounded way. We could also invent a simple rule saying that only the winner and the nearest neighbors are going to be pulled in the direction of X. In practice it does not matter much, if the speed of_{0}(t)/||r-r_{c}||^{2}*h*decreasing is suffiicently slow the map has a chance to organize itself, otherwise it may not reflect the data distribution._{0}(t)

- Q: The second question is about the dynamics of SOM, that is, its updating way. As indicated in your slides, Sir, SOM seems a
fully unsupervised algorithm in its updating ways: W
^{(i)}(t+1)=W^{(i)}(t)+h(r_{i},r_{c},t)[X(t)-W_{(i)}(t)]. And you also told us that the main advantage of DCA over PCA is that DCA uses the supervised information of class labels and thus can find projections that are more interesting. So, Sir, can we here also convert SOM algorithms into a supervised learning method to improve it (will it improve?)? Maybe one of the possible ways is adding a new factor in the neighborhood function h(r_{i},r_{c},t). This factor will specify whether or not W^{(i)}(t) and X(t) have the same class label-such as |Class(W^{(i)}(t))-Class(X(t))| for the two-class problems. Then the winner node will update itself according to X(t) only when they have the same class labels, otherwise the winner will give up its updating priority to the runner-up… until we find a sub-winner node who has the same class with X(t), then we will update it. This method sounds very naïve anyway, can you give me some suggestions about this idea, or is it worth to be considered?A: It is an interesting idea and I am glad to hear that you are thinking in this direction. The literature on SOM is now really extensive, with more than 5000 papers published up to 2000 (the list is here http://www.cis.hut.fi/research/refs/. Many variants have been considered, and supervised SOM among them. In the Kohonen book (1995 version) and in the SOM Toolbox there is a very simple approach to supervised SOM, based on adding class information to the data. After training this class information has to be removed, otherwise new vectors of an unknown class will have different number of components.

The idea of updating the winner only if it is from the correct class is used in some algorithms in different variants. Supervised SOM is not very popular, although sometimes it could improve the quality of maps. Most people think that it should be used to visualize distance realations in the feature space, and adding class labeles changes the maps in a way that is hard to understand. DCA (or FDA, as it is usually called) simply gives another angle of view on the data, but SOM creates non-linear mappings that are not so easy to understand if manipulation with the original data is allowed. We already have a problem with data preprocessing, such as normalization or standardization, that change the SOM maps.

- Can you put a voice recording or phonetic on how to pronounce your first name in your webpage? Didn't quite catch your pronounciation on the first lesson.
My first name - Wlodzislaw - is unfortunately quite difficult to pronounce. V-wo-gee-s-wa-v, where V- like in V-incent; wo - like in wo-man, gee, s, wa - like in wa-nt. Last name - Duch - is easy, with like Bach, but with D.

- My first question is about the "distance invariance under linear transformation". Sir, as you told us in the class, there may be two main ways to make the distance invariant under linear transformations: 1) standardize original variables into reduced variables (zero mean and unit deviation), or 2) use the Mahalanobis distance instead.

And it seems to me that the first method just discards the VARIANCE information (they unify them all as unit ones) to get the DISTANCE INVARIANCE and the second method, on the hand, is trying to keep the information of VARIANCE in the distance metric itself (Mahalanobis metric includes covariance information). So both methods guarantee the DISTANCE INVARIANCE by doing something to VARIANCE (or covariance matrix instead). My question is: Is there some relation between the VARIANCE (or covariance) and DISTANCE INVARIANCE? Is their relation just as simple as their names indicate? :-) Why we must discard variance information (rather than others, such as mean value) or keep it together with the linear transformation, so that we can get distance invariance? Does variance mean something special?There is much more to these transformations than I was able to tell you. Variance of variable X tells us only how much we can expect that it will deviate from the mean E(X). If X has Gaussian distribution than we can expect that 2/3 of all data is E(X)± STD(X), and 95% in E(X)± 2STD(X) where STD is the square root of variance. Standardization is needed in various situations, for example if the data is transformed by changing the units (so X is no measured not in meters but centimeter). If we have two variables, and one has large values becasue it is measured in small units, and the other small values because the units are large, it may be hard to compare their influence on the distances, because the first will dominate and the second will change little. If any clustering of the data is done and similarity is measured using distances only large values will determine how data is grouped, but the second variable may be also useful to distinguish between different groups of data. Therefore it may be better to standardize the data first, making it invariant to scaling transformations.

There may be other linear rescaling of features, for example normalization, shifting all data to the [0,1] interval, so Z=(X-E(X)/(max(X)-min(X)). Unfortunately results of visualization, clusterization and classification depend on how we are transforming the data and there is no best way to define scaling factors. The goal is to balance contribution of different features to the distance evaluation. Normalization and standardization are not ideal methods, they may unfortunately decrease the influence of some useful features. This is a viscious circle: withouth knowing what we want to see in visualization we cannot set our priorities (scaling factors) correctly; to see data structures we need to know how to scale features, but to scale features properly we need to know what is interesting in the data ...

In addition if different features are correlated than rescaling of features may not be sufficient. Suppose that Y=2X, then if we take both X and Y into account we are going to take the same information twice, and other independent features will have relatively lower contribution. Therefore it is good to use Mahalanobis distances, that make the distances invariant agains the rotations, but not the rescaling. We may perform the transformation to the Principal Component space and then standardized the transformed features. Again it may not always be a good thing to do: we may want to reduce the dimensionality of the space, but the may want to reject first some features. Note that for a gourp of linearly dependent features only one will have non-zero eigenvalue. Rejecting features corresponding to small eigenvalues (that is small PCA variance of transformed features) may also be a bad idea becasue they may high discriminative value; I will talk about the discriminant feature analysis next time.If the data cluster has approximately a multivariate Gaussian shape Mahalanobis distance provides a natural metric to measure the distance from the mean of the cluster to the data: the unit in each direction is the variance (for uncorrelated features C

^{-1}is diagonal with the inverse of the variances), and the measurements are along the pricipal axis.

In signal processing a common practice is to transform data to principal components and then standardize them; this is called "the whitening transformation". After such transformation we have full scaling and rotation distance invariance, that is if the data has been linearly transformed it will be brought to a standard form after whitening.

- Second question is about the difference between covariance and correlation coefficient. Is correlation coefficient just a normalization of covariance? And what’s the meaning of these two things (you told us -1 or 1 value of correlation implied pure linear relation)? I mean, if two features are highly correlated, what can we say about them? Does correlation implies dependence? What is the difference between correlation and dependence?
Correlation coefficient is simply standardized covariance; if we standardize the data first and then compute the covariance matrix then all diagonal elements will be 1 and off-diagonal will be equalt to correlation coefficients. If features are higly positively correlated than if one grows we should expect that the other will also grow; if correlation is 1 that implies linear dependence (please check this).

Surprisingly, independence of features is not the same as zero correlation! I will talk about it next week discussing Independent Component Analysis (ICA).

- My last question is about the diagonalization of a matrix, which you talked about in introduction of PCA. So, Sir, is the diagonalization of a matrix unique or not? If the result is unique, maybe it is more interesting. Because in my opinion, if the diagonalizing result of a covariance matrix is unique, it implies that we can have a unique measure of information in the data structure? I have this idea because I see you use fraction of variance as an estimation of the rough description of data set, on the slide about PCA. Can you tell me more about the idea of unique diagonalization (if it is) of a matrix? Or maybe you can recommend to me some useful books about this interesting topic? I mean, what it intuitively implies that the diagonalization of covariance is unique.
This is just the point: becase diagonalization of covariance matrix is unique (it is well behaved, positive definite), it gives as uniquely definied eigenvectors. The eigenvector that corresponds to the largest eignevalue transforms our data in such a way that the variance in this direction is largest.

Please read about matrix diagonalization in Chapter 11 of the Numerical REcipes book,

http://www.library.cornell.edu/nr/cbookcpdf.html

- Q: In parallel representation of high-dimensional data, the order of dimensions
seems arbitrary, but does it affect the cognitive experience of a human observer?
Do some particular sequences of parallel dimensions produce perceivably
simpler patterns (measured by number of intersections of line segments etc.)?
A: Yes, it does, but unfortunetely the best ordering depends on the data; in some cases it may be worthwhile to permute the order of parallel axes to avoid line crossing and see the structure better. You can find some examples on the pages linked to my slides, including some oder techniques, like clusterization to make the whole bundle of lines fuzzy. Parallel coordinate techniqe is used frequently in bioinformatics, with each coordinate representing a gene activity level, where the order of genes is arbitrary.

Q: On slide #21, the picture of a 3d sphere does not look symmetrical, which is contrary to my intuition. Is it because of the artefacts introduced by image scaling?

This may be one factor but symmetry is also affected by the choice of the points; it is not easy to select points from a sphere to preserve perfect symmetry.

- Q: Is bioinformatics = techniques in other fields + application in genomic datasets ? Since I find many people doing research in bioinformatics try to implement and find the application of current techniques to apply to the biological datasets. And how a Phd student in bioinoformatics expected to fullfil?
This is indeed what many people do. Development of computational tools is important and it will for some time be sufficient to make your PhD and write a few papers in technical journals. However, to really make progress in bioinformatics knowledge of molecular biology and the importance of biological functions of genes and proteins is the most important, and that goes way beyond technical knowledge. If you try to read papers in Nature and Science you will find that technical part of bioinformatics is usually hidden, and yo make important discoveries one needs to know medical and biological importance of various processes.

- The problem is about how CI differs from AI.
Although you have referred it, I may need more information about it. For instance, the neural network, should it belong to CI or AI?
AI = higher level cognitive functions, problems solving, thinking, reasonin, planning, therefore symbolic methods, expert systems, knowledge representation, natural language processing. Search-based algorithms are used to find the best solutions by looking at combinations of different moves, like in board gamess (chess, checkers).

CI=lower level cognitive function, signal analysis, object recognition, discovery of interesting patterns in data. Neural networks, statistical methods, pattern recognition methods, are all part of CI. Optimization methods are used to adapt parameters of these systems to account for data structures. - More questions on this topic:

Q: There is a textbook http://www.cs.ubc.ca/spider/poole/ci.html on computational intelligence, which also dwells in constraint search, agent etc. that are topics in AI. According to the authors of this book, the term Computational Intelligence, is a modern term for Artificial Intelligence, which induces more confusion between the two.A: I know that it is confusing! As I told you there are no universally accepted definitions even of the term Artificial Intelligence, although it has started in 1956. Pool et al in their book decided to steal the term and use it for a typical AI book; this adds to the confusion. Instead of going on for an hour on the topic I gave you my definitions which seem logical to me but are by no means used by everybody.

Q: For the term soft computing, we can generally refer to Zadeh [1] who coined the term. Some web sites claimed that Soft Computing is sometimes referred to as Computational Intelligence http://library.gsfc.nasa.gov/SubjectGuides/SoftComp.htm. Some authors claimed that Bezdek [2] coined the term Computational Intelligence http://ci.uofl.edu/zurada/ciil_book.html.

[1] L. A. Zadeh, 1994, "Fuzzy Logic, Neural Networks and Soft Computing" Communications of the ACM, 37(3), 77-84.

[2] J. C. Bezdek, 1994, "What is computational intelligence?," in Computational Intelligence Imitating Life. 1-12, New York: IEEE Press.A: My definition agrees with Bezdek, leaving knowledge-based system for AI. Zadeh initiated joint conferences on neural, fuzzy and evolutionary topics calling them soft computing, but this type of division, combining several branches of science and excluding others that solve exactly the same problem (for example statistical methods, decision trees and optimization theory) is not good because science is defined by the problems, not by the tools that someone arbitrarily chooses. Soft Computing still apears in the names of conferences/books but CI becomes more popular, especially after IEEE has changed the name NNS (Neural Network Society) to CIS (Computational Intelligence Society). Noone seriously proposed the name Soft Computing Society.

Q: Take for instance, two topics which have vague boundaries are Machine Learning and Data Mining, but the according to some authors, eg. Jiawei Han and Micheline Kamber [3], the clear distinction between these two topics are that the latter deals with the discovery of patterns in large databases.

[3] J. Han and M. Kamber, 2001, “Data Mining: Concepts and Techniques”, San Fancisco: Morgan Kaufmann.A: There is also no clear distinction here, ML methods may be applied to Data Mining (DM) problems and not all DM is about huge data. Traditionally ML people were solving simple problems requiring symbolic representations and genreated many interesting algorithms. DM people came from the databases and on-line analytical tools, stressing the importance of methods that work on large databases. I would say that data mining is a part of computational intelligence specializing in analysis of large datasets, and incorporating many steps related to cleaning and acquiring the data that CI does not focus on.

Q: Now back to the question regarding CI and AI, here are more additional questions:

1. Is there someone who coined the term Computational Intelligence?

A: The term has been floating at the begining of 90-ies, book [2] was not the first to use it; look at CI bibiliography:

http://www.computelligence.org/download/cibiblio.pdf

World Congress on CI was held in 1994, bringing NN, FL, and EA people together; Robert Marks has discussed it in 1993 in a paper:

http://www.ecs.baylor.edu/faculty/marks/REPRINTS/1993_IntelligenceComputationalVersus.pdf

Probably the first paper discussing CI is by J.C. Bezdek, On the relationship between neural networks, pattern recogniton adn intelligence. Int J Approximate Reasoning 6 (1992) 85-107

I have encouraged myself IEEE officials to form CI Society instead of forming IEEE Neural Network Society, but this change of name was finallized only this year and IEEE CIS was formed.

2. What is your opinion on the authors of http://www.cs.ubc.ca/spider/poole/ci.html who claimed that Computational Intelligence a modern term for Artificial Intelligence?

A: They wanted to sell AI textbook and called it CI ... just adding to the confusion.

3. If CI is distinct from AI, is there literature which one can quote that indicated the distinction as mentioned in Q/A No. 2?

A: Yes, for example Bezdek paper. I have also made some remarks on that in these two recent reviews:

http://ci.uofl.edu/zurada/papers/ZuradaSetionoDuchPIEEE2004.pdf

http://www.fizyka.umk.pl/publications/kmk/04-Challenges.pdf

Unfortunately terminological discussions go on for ever. It is true that AI and CI people do not meet at conferences and their societies do not talk to each other, problems they solve are quite different. I hope that my proposal to disctinguish low and high-level cognition will some day be accepted.

- What are the differences and relations between Pattern Recognition and Machine Learning? Just curious to know.
ML has developed in AI to automatically find the rules that a reasoning system could use. These were mostly inductive methods based on covering examples with logical rules of some form, such as "version spaces". The goal was to understand the data in the way humans do, using logic and rules that could be useful to reason in expert systems. Later ML developed universal methods that could be also used in pattern recognition, such as the AQ family of algorithms, COBWEB conceptual clustering etc. Some people refer now to all adaptive methods as "machine learning".

Pattern recognition has developed in engineering from multivariate statistics, probability theory, signal analysis, analysis of real, continous valued data. It has never tried to derive logical rules, just to model data, predict, cluster and classify.

ML and PR methods have some overlap now, but the two communities are still rather distinct, ML people are connected to AI communitiy and their conferences, while PR people to engineering.

- Q: How to create membership functions in fuzzy modeling? How do fuzzy controller work?
A: This is explained very well in Kecman's book that I have recommended, unfortunately I do not plan to cover fuzzy modeling in this course, thre are separate courses on this subject. In essence it is very easy: instead of using intervals to turn on or off the controler when temperature is in or out of the range [a,b] a membership function (for example, of trapeozidal shape) is used turning it gradually on and off.

- Q1: Besides [5], are there any papers or books that provide a more comprehensive survey on the algorithms to define and optimize fuzzy membership functions?
Q2. What kind of uncertainty exists, besides input uncertainty, in a given training data in the context of pattern recognition problem?

Q3. If the uncertainty besides input uncertainty exists, how does fuzzy membership function capture and represent this uncertainty? In order words, how to create the membership function that represent this uncertainty?

Q4. Does the handling of uncertainty improve classification accuracy? You mentioned in [6] that accuracy of crisp rules extracted for a number of real world datasets proved to be higher than of any other classification methods, including fuzzy-rule based systems. But you also mentioned in [1] that the introduction of fuzziness is not only desirable, but in most cases it is also unavoidable. Thus, in your opinion, does the introduction of fuzziness improve accuracy in pattern recognition? If not, what is the advantage of introducing fuzziness in pattern recognition?

Q5. In your opinon, which can handle uncertainty in Pattern Recognition problem more: Probability Theory or Fuzzy Theory?

[1] Duch, W. (2004). Uncertainty of data, fuzzy membership functions, and multi-layer perceptrons. Fuzzy Systems, IEEE Transactions on, to appear.

[2] Klir, G. J., & Wierman, M. J. (1998). Uncertainty-Based Information. Heidelberg: Physica-Verlag.

[3] Klir, G. J., & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic Theory and Applications. Upper Saddle River, NJ: Prentice Hall.

[4] Sarkar, M. (2002). Rough-fuzzy functions in classification. Fuzzy Sets and Systems, 132(3), 353-369.

[5] Medasani, S., Kim, J., & Krishnapuram, R. (1998). An overview of membership function generation techniques for pattern recognition. International Journal of Approximate Reasoning, 19(3-4), 391-417.

[6] Duch, W., Adamczak, R., & Grabczewski, K. (2001). A new methodology of extraction, optimization and application of. Neural Networks, IEEE Transactions on, 12(2), 277-306.A: This is indeed enough questions for separate set of lectures or for a book! I am not such an expert on fuzzy systems and since the literature on the uncertainty is vast there may be books that I do not know about.

A1: I do not know any good book that really reviews all methods used for determination of MFs; there are of course many papers on different approaches that one may find typing "fuzzy membership functions" in Google. You may read Fuzzy Logic FAQ, although it is old now,

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/ai-repository/ai/html/faqs/ai/fuzzy/part1/faq.html

You may also pose question on fuzzy logic discussion groups

http://groups.google.com/groups?group=comp.ai.fuzzyA2: I have not seen any papers summarizing it but at least these 4 type of uncertainty exist:

Stochastic uncertainty: dice, accident, insurance risk - requires probability theory

Measurement: 3 cm; 20 points - distribution of results, dealt with by statistics.

Errors in data: we cannot be certain if the values are correct, some may be rejected, some corrected.

Uncertainty due to many factors, lack of information which is hidden in large amount of data: biomedical, commercial, financial databases, insufficient medical tests - data mining, searching for information in data.

Linguistic imprecision: small, fast, low price - fuzzy logic tries to makes these concepts more precise.Some more information is in my tutorial http://www.is.umk.pl/~duch/ref/kdd-tut03/04tutorial.html

Lotfi Zadeh also in his talks recently says that the proper area of fuzzy logic applciation is linguistic imprecision.A3. Many papers on application show how to create membership function that represent various uncertainties; linguistic concepts are used, such as fast, slow, moderate, and then some simple functions like smeinlinear or trapezoidal fitted. I cannot discribe the whole process here, please consult some books on fuzzy systems, for example Kecman's introduction.

A4. Yes, it may be very useful and lead to more natural answers. Adding Gaussian-distributed noise to inputs helps to find better answers (there is mathematical proof of this in the book of C. Bishop on Neural Networks). Even if fuzzy rules are sometimes more accurate they still give yes/no answers where gradual anser would be more appropriate. Are you sick or healty? This for people who are a bit sick is not yes/no, but gradual, there is a degree of being sick, or a degree of "high" fever. It woudl be very interesting to learn how to calcualte with distributions given as inputs, instead of single numbers. Frequently we know this distributions, for example in medical tests accuracy is given by Gaussian distributions, in other situations by Poisson, and sometimes by uniform distributions. These distributions hsould propagate through our computing system and give distributions as outputs. In control processes we may be explicitly interested in reaching values that are within some tolerance in a process where we can regulate some input parameters and one to obtain some precision at the output. Fuzziness may be useful in many ways.

However, frequently introduction of fuzziness is done in a naive way and does not contribute to improvement of pattern recogniton systems. I have quotd some examples in [6].A5. In Pattern Recognition probability concepts are more fundamental; in fact there is little advantage to use fuzzy logic for pattern recognition if the number of rules is too large to give understandable description of data. If the goal is to understand the structure in data and visualizaiton or crisp logic is insufficient fuzzy rules may be attmepted, but if too many are created by neurofuzzy systmes they do not teach us anything new and usually more accurate methods may be used. However, in rare cases it may happen that simple assumptions about data are

- Now, I am studying the theory of Evolution Algorithm. In order to do that, researchers explain 2 questions:
Does EA converge to global optimum?
Running time complexity of the algorithms: how many steps the algorithm can obtain an optimal solution?
The researchers have mainly relied on 2 approaches to explain their EA algorithms:

1. Schemata Theory (Holland, 1975): this approach just explains that EA can converge to global optimum

2. Markov Chain (from 1991): this approach can prove the convergence of Simple Genetic Algorithm (simple version of Genetic Algorithm with binary encode and 1 point crossover) and find some lower bounds in some predefined conditions.

In my project, I use integer encode, therefore, the problem I want to prove is more complex than binary one. Could you suggest me any directions to prove the EA with higher cardinality (integer or real encode)?

I want to combine Evolutionary and Learning. I hope that it makes the algorithm faster. Do you know any sources or papers about that?This is a question for the EA course, rather than CI ;-).

EA are not optimal as far as I know. Best results are obtained when operators that know about specific problem structure are used - see recent papers of Zbigniew Michalewicz. Combination of EA and learning is very promissing, Ryszard Michalski had a good paper on this subject. Use Citeseer to search for these papers - it is linked to my page

http://www.ntu.edu.sg/home/aswduch/CI.html or search on my page for AI/ML

http://www.ntu.edu.sg/home/aswduch/ai-ml.html

My own opinion is that people have put too much effort into EA and too little in comparison with classical techniques. See for example this survey:

http://www.fizyka.umk.pl/publications/kmk/99globmin.html

- Could you please explain the comparison between existing EAs? Thanks!
I will not talk about optimization in this course at all, EEE has a course E6002-Genetic Algorithms, where this will be very appropriate. We cannot fit everything in one course ...

- I don’t know why most optimization techniques are based on the criteria of least square error. I mean why we don’t use least cube error or something else? What principle is hidden in these criteria of choosing criteria?
It has to be bounded from below! Square error has a minimum =0, but cube error can decrease to - inifinity. There are many other error functions, for example based on entropy, but sqare error is the simplest, it is easy to compute gradients, and it can be proven that neural networks using this function approximate correctly probabilities. This is an interesting question though and I have once devised a method to accelerate learning by using initially high exponent a in the error function Sum ||Output(x)-Target(x)||^a. Unfortunately this idea has never been published so if you would like to pursue it please see me!

- You talked about the difference between nominal and ordinal features. But can you explain a little more about what differences these two types will cause in the considerations of using them, ^_^.
We will see it in some examples later, but it is clear that for ordinal features intervals may be used, for example [yelow-red], while for nominal features only sets, for example {taste_mango, taste_orange ... taste_apple} = test_fruit. The number of subsets that may be generated from N symbols grows like 2

^{N}. In data mining to find a subset that can be used in a simple rule will therefore be difficult. It is easier to convert ordinal values to numerical values and thus easier to deal with them. Most classification and approximation methods work only with numerical values that are the easiest to handle.

- One more example may be your explanation of the Bayesian formula. I think it is cool!! Since you told us that P(x,c)=P(x|c)*P(c) just means the evidence (or certainty?) of the coexistence of x and c is only determined by the certainty of c’s existence and conational existence of x given c. This explanation sounds very good, since it seems just like the certainty (uncertainty) of c is propagated to (x,c) by a bridge of x|c. And maybe the uncertainty propagation is just what modern logic deals with—that is what I have not thought of before. Maybe you think this is too simple, but I am excited since that is my first time to see something that are shared by logic paradigms. I just wanna find something to explain them all as the special cases—too silly, ^_^?
This is true for graphical models that are very fashionable now, so you are on a good track; I will say a bit more about Bayes formula in lecture 4.

- Is it good to take more courses for my research at the beginning?
If your topic is already well defined than it may be good to take relevant courses to learn the background; please look at some introductory books too.

- It is said that mathematics is very important, may I know what level of mathematics knowledge is appropriate for gradute research please?
In most cases not too high, some linear algebra and analysis may already be sufficient. Computer algorithms are frequently based on rather simple mathematics. Neural networks require mainly good skills in calculation of gradients and partial derivatives. But it is its hard to say anything general, because some reasearch goes into specialized subjects, optimization benefits from operational research courses, if you work with signal analysis than there are courses covering specialized methods in this field. More math is always better because it gives in ways of thinking about the problem.

- Now I have a question on a paper of Hopfield:
H. S. Seung and T. J. Richardson and J. C. Lagarias and J. J. Hopfield, "Minimax and Hamiltonian Dynamics of Excitatory-Inhibitory Networks", NIPS, 1998.

In this paper, I have a question about the convergence proof of this neural network. Moreover, I noticed another type of neural networks whose convergence proof is similar to this one, but still a little different from it by using normalization for each iteration. So would you like to discuss with me about it before the opening of the course?Unfortunately I am not familiar with this paper and do not actively follow what happens in this area of neural networks. Computational intelligence covers so many things that it is impossible to known everything! It is always better to find someone who works on similar subjects, otherwise your poor lecturer will have to spend a lot of time trying to understand papers he or she has little interest in!

- How do you do research when you are in Singapore for half year and in Poland in the other half year? Does the geographical difference affect research?
This is easy knowadays with the use of Internet; I have many students in Poland and spend some part of the day advising them ... In Nicolaus Copernicus University we had a New Zealander with our faculty, collaborating with his friends in NZ and Australia and some USA people, most of his papers were co-authored by people from 3 continents.

Great thinkg about research - at least in engineering or natural sciences - is that it is almost totaly independent of your language and culture. You may find a common language with people from any country and ethnic background.

2003 results.

By 10 Jan 2003 I've got 11 answers on paper + 1 by email:

Some would like to hear about neural networks:

EID-Virtual Classroom for Artificial Neural Networks (E279-E032-02),
Instructor(s): Mdm . Tsai Flora S, A/Prof. Chen Lihui;

SC436-NEURAL NETWORKS (SC436-SC4-02S1), Instructor(s): A/Prof. Rajapakse Jagath Chandana.

Evolutionary algorithms and genetic programming are a subject of:

E6002-GENETIC ALGORITHMS (E6002-02S2)
Instructor(s): A/Prof. Ponnuthurai Nagaratnam Suganthan, Prof. Narasimhan Sundararajan, A/Prof. Wang Han

Data mining

H6405-DATA MINING (H6405-02S1)
Instructor(s): A/Prof . Ng Wee Keong

Fuzzy systems:

H6405-DATA MINING (H6405-02S1);
Instructor(s): A/Prof . Ng Wee Keong

E485-NEURO-FUZZY CONTROL (E485-LE-02S1);
Instructor(s): A/Prof . Wijerupage Sardha Wijesoma, A/Prof. Paramasivan Saratchandran, A/Prof. Song Qing

E6224-NEURAL & FUZZY SYSTEMS (E6224-02S1)
Instructor(s): Asst Prof . Mao Kezhi, A/Prof. Paramasivan Saratchandran, A/Prof. Song Qing

Some would like to see specialized topics.

Dempster-Shaefer possibilistic theory is sometimes mentioned in AI courses:

E483-ARTIFICIAL INTELLIGENCE (E483-LE-01S2)
Instructor(s): A/Prof. Chen Lihui, A/Prof. Chan Chee Keong, A/Prof. Wang Lipo

SC430-ARTIFICIAL INTELLIGENCE (SC430-SC4-02S1)
Instructor(s): A/Prof. Michel B Pasquier
Introduction to Artificial Intelligence

Time series analysis for finacial application may be more appropriate for:

FE8905 - Artificial Intelligence Techniques in Finance (FE8905)
Instructor(s): A/Prof. Lim Meng Hiot
This course explores some of the AI techniques used in financial applications. Participation in a significant project is mandatory in the course.

More specialized bioinformatics course

MSC in BIOINFORMATICS (BI-MSC)
Instructor(s): A/Prof . Kwoh Chee Keong

2003: Thank you for sending the questions. Few are answered below:

- I am getting a little bit confused with cross-validation. After a k-fold CV, there will be k possible solutions, e.g. k decision trees or k planes, how are we going to come out with the final solution? simple average may only work for some cases but not for others (e.g. decision trees). or do we pick the best one, with the highest precision, recall, etc?
Crossvalidation is used for two reasons: to determine the expected statistical accuracy of classification models, and to find the best parameters of the model that do not lead to overfitting of the data. The final model is created on all available data using optimal parameters determined in CV, and then applied to the new data.

For example, the optimal number of nodes in the decision tree is determined using CV, along with the expected statistical accuracy, and then such a tree is created using all data; it may have slightly better accuracy than those achieved in CV test on the new data, since it has been created using all available data. In models based on basis function expansions, like RBF, the number and some parameters of these functions may be determined as an average number that maximized the validation accuracy in the CV test.

For some models, like LDA, CV will be used only to estimate statistical accuracy, since there are no model parameters that may be set.

In most systems this is rather confusing, but in Ghostminer this is rather clear. For example, after running the SSV tree which is trained by CV optimal parameters are found, and the following information is reported: train result from a final tree, and average train/test results from CV, and individual results on each CV partition.

- Would we be able to handle the exam if we can answer only about 75% of the sample questions in all the lectures?
You may pass, but I hope that you will be more ambitious than that ...

- Is it necessary to read up the reference texts for additional knowledge or are the notes sufficient for the exam?
No, this is just for people that will find applications of some of the methods that I have talked about, these books are good start to get deeper into the subjects.

- There are so many formulas to memorize ... Can we pass the exam if we only understand the topics but do not memorize the formulas?
If you understand the topics than there are no formulas to memorize! Everything should be obvious. True positives, flase negatives, formula for information, joint information, SOM update rule, or the margin and SVM criterion, if the concept is understood than there is no need to memorize formulas, they can be written in a natural way.

- Could you help us by telling us which particular topics/lectures we should focus on because there is simply too much to learn, understand and remember?
There were very few longer theoretical derivations, SVM and bias-variance decomposition are the two main ones, all others fit on one-page. Good part of these lectures were demonstrations that I obviously will not require during the exam. What is left is really not that much, some visualization techniques, Bayesian fundations, decision trees, linear discrimination, kernel methods, density estimation, information theory, some applications and model improvements.

Unfortunately telling you which lectures should be focused upon will break the fundamental rule that all information related to examinations whould be kept secret.

- Could you show us some exam-type questions on Friday?

See the sample questions notes, containing sample questions for all lectures.

- How concise should the answer of algorithm question be? (eg. describes PCA, etc.)
Similar to the lecture notes. I do not expect very detialed long derivations that you may find in the textbooks. I am sure that those of you who will really need it will go through them anyway.

- Is the question 4 out of 6 (or) something else?
This year any 4 questions out of 5 should be answered. If you answer all of them the best questions are selected, but there is no way to give you A+ as a reward. Please note that each question may cover different lectures, so it is not sufficient to learn half of the course material !

- Question no.4 in Lecture 4. We are not clear about the Contingency Table. What is trying to use with it?
Page 11 in Lecture 4 shows the contigence table, it is simple joint probability distribution P(r,C), where r are the objects in some bin (for example in a histogram), and C are classes. You should know how to compute these numbers, what will you get summing row, columns, all elements, how to get conditional probabilities from joint probabilities. Use the histograms on a precious page: there are 2 classes and 20 bins. Histograms are showing the number and the class of the fish, so the table is easy to construct, 20 rows with 2 columns.

- Question no.4 in Lecture 5.
How do we draw spheres in parrallel coordinates?
Like anything else in parallel coordintes! Start from a circle or 2-D sphere. Select some points (x,y) on the circle and for each point plot the value of its coordinates on two parallel axes, one representing x, the second y, join the x, y points on the axis, since they come from the same point on a circle. Result is presented on slide 16. In 3-D again take some points on equator and a few other circles, this time they have (x,y,z), so 3 parallel axis are used. Result is again on slide 16 of Lecture 5. You may do it with 8-D sphere if you know how to calculate the values of the points on such a sphere. More segments, similar to 3-D case, will apear.

- Question no.4 in Lecture 22.
Is that always be binary tree created by C4.5 algorithm. Is it the unique feature of it?
Usually binary trees are created but original algorithm allows to split nodes to more than two branches.

- Question no.3 in Lecture 36.
We are confusing with the answer for it.
We have concluded with the following answer. Is it correct?
Large variance -> Low entropy(uncertainty) -> More information

Slide 7 of lecture 36 shows an example, and indeed you see that narrow peaks carry little information (becuase average suprise is low), while broad peaks, large variance distributions, contain more infromation. Think about some lottery in which a flat distribution over one milion people taking part in it is assumed; if you win, you are suprised, gain large information, and hopfully a huge prize. If on the other hand you know that only the family of the lottery organizer may win, the amount of information after announcing the winner is low ...

Sample questions that have not been used:

- Draw a sample distribution of data when first PCA component explains large portion of variance, but is useless for visualization.
- Define Voronoi tessellation and Delunay triangulation.
- Provide examples of at least two measures of topographical distortions used in multidimensional scaling procedures and describe how maps generated using these measures differ.
- Explain how decision tree algorithms avoid overfitting the data.
- Explain the difference between feature ranking and feature selection.

Back to the course page, and back to
Wlodzislaw Duch