Questions should be asked frequently! Please write them down and give or email them so that I could add more detailed explanations to my lecture notes.
Name:  ???
Background: ???
Interest (science only): ???
Expectations: ???
Courses related to this one: ???
Questions, anything else ???
Unfortunately there is not much time during lectures to make a discussion and to answer your question, therefore please send me your questions, this will help me to improve the lectures. Time of the lectures is short and there is always so much more to say ...
Thank you for sending me questions presented below. They are answered below.
Please let me know if you would like me to add your name to the question, I will by default copy only the question.
Giving these lectures I try to find a tradeoff between giving you a solid foundation and an overview of many methods that may be useful in your work, showing their basis and their practical applications, without being too shallow in the analysis of these methods. Unfortunately we cannot spend enough time on a single topic to explore it in depth. For undergraduates it may be better to focus on a few topics and make a very detailed study, some books do it well. For graduate students it may be more important to have some overview and follow on your own in the direction that is useful to you, this is usually much harder to get from books because it would require to read a number if books and papers rather carefully. Perhaps this approach would be more beneficial for you, but please let me know if you have some thought on this issue.
I probably should organize the questions for different lectures, but right now questions newest questions are at the top. Some older questions are below.
A: P+ is the a priori probability of class +, calculated from data as N+/N. P+(M) is the predicted percentage of samples in class +, calculated from predictions of the decision system, not directly from data. If the real class proportion is P+=0.8, P-=0.2, but your system is a majority classifier always saying "choose C+" then P+(M)=1, P-(M)=0. So they are very different.
A: Not quite so .... the equality AUC = (S+ + S-)/2 is true for this one point, it is not a definition of AUC! Calculate the area under the curve as a sum of two triangles and a rectangle to get AUC, as seen in the slide, and then look at the line where it is constant: this line goes through points A and B and is a shiftes y=x or S+ = S- + constant line.
A: Indeed ... but also in the formula above indices are not correct; please see the corrected slide. thank you
A: Although this is unsupervised we can at the end label each node with the class label of a vector that activates the node in the strongest way (is most similar to the weights of this node). Then the map will have labels, as shown in this example; given a new query vecotr X we find the winner node and use its label for classification.
A: Lines connect neighbors on the grid that is in the space of grid coordingates, not in the data space; in 1D case fro example each nde has two neighbors, initially the may have d-D weights that are quite different, but after training they point to nearby areas in the data space, so points are shown close to each other; in 2D case at the end we see that points have 4 (or 2 at the corners) immadiate neighbors, after training nieghbors on the grid, connected with the lines, are also pointing to neighboring parts of space.
A: This is obviously true, as for any fix value Yj sum of probabilities over Xi is the probability that Xi has some value, and this is 1.
A: Yes; note that Gaussian have all higher moments =0, that's why they are fully characterized by the mean and dispersion!
A: This for standardized that is zero.
A: It is hard to unswer such general questions; try to listen to my comments on the audio track and ask something more specific, otherwise I'll have to repeat a part of the lecture in writing ...
A: Very simple: compute mutual information MI index between class and feature values, order (rank) features Xi according to the index value and the most infromative features are going to have high MI(Xi), and those that have nothing to do with your task (class label assignment) will be at the end with low MI(Xi). It is easy to create more such indices, for example take linear correlation coefficient as a ranking index. I'll return to this in next lecture.
A: ||X||=1 is not the same as standardization: here for each vector the length is 1, in standardization for each feature std is 1. The first sums over all features of a single vector, the second sums over all vectors for single feature. To get ||X||=1 one can simply add an extra component xe to some vector X, so that ||(X,xe)||=1 or xe^2=1-||X||^2; here I assume that all X have been rescaled to have initially ||X||<1.
If this is the case, then can we interpret the network as follows: We submit all the data vectors belonging to class 1 to the network. We seek that most data vectors belonging to class 1 will have a strong output in F1(X) of the network. To achieve this, the nodes parameters must be modified. They are modified by training with the data vectors belonging to class 1. Then in the final layer of the network, we sum over all the result (hence summing over n(c)). If we use rectangular functions, we will derive crisp logical rules in F1(X) that can be used to identify the data vectors belonging to class 1. Simplification of these rules will thus give the decision rule to identify data vectors of class 1. We continue the above steps for class 2, ... class k to derive their decision rules. Is the above correct?
A: yes.
A: Please run some examples with PRISM using Yale or Weka, it is the best way to see what happens. The results depend very much on data character and on discretization, but typically many rules arise because A=v may have many values, so even though a small number of rule conditions is created this is still the case; also optimizing precision only leads to rules covering small number of examples and thus many rules.
A: It is quite common that we know some functional dependence but do not know the parameters; for example, if you see a catapult throwing stones from point A and falling at distances X you may try to calculate the angle and the speed, because physics tells you that the trajectory is a parabola. So someone may have measured how the number of different species depends on parameters, but given particular data we do not know the actual parameters.
A: In regression we have y(X)=W*X, and y is a real valued function that is used to fit the data (for example a line going through some data); in classification all we care is what is the class, so y is a two-valued function, for example +1 and -1, so the regression function is binarized. We can use regression methods and then take decisions using some thresholds y(X)>T => C=+1, otherwise -1.
So the two approaches look similar but there is a difference in the condition that are set. For example in regression it could be least mean square error, and for classification SVM conditions that are not suitable for regression.
A: I wrote only the largest coefficient, but LDA solution contains more, and some are put to zero. If one non-zero coefficient corresponding to a feature f1 is left than decision is made using the threshold for this feature, for example the alcohol level; if there is one large and few smlal coefficients the value of the feature corresponding to the large coefficient is the most important, when it changes a bit the whole funciton changes significantly so it will largely decide if y(X)>T or not.
A: We simply define a kernel function K(X,Y)=X*Y in this linear case, all is in d dimensions. The point here is that if a transformation X->Phi(X) from d to m diemensions is defined we will still need scalar products in the Phi space, and the kernel function will be Kp(X,Y)=K(Phi(X),Phi(Y))=Phi(X)*Phi(Y). Here Kp(X,Y) will be a kernel function in d-dimensional space, giving the same values as simple scalar product in the Phi space.
A: Chapter 3 of Duda, hart and Stork has some detailed examples, so I'd recommend that.
A: In this 2D example X=(X1,X2), Y=(Y1,Y2) are two vectors; K(X,Y) has the form as in the slide. Phi(X) and Phi(Y) will be defined in 5D space and if you compute their scalar product Phi(X)*Phi(Y) = K(X,Y).
A: Yes, not quite non-parametric approach where numerical estimation of density is generated, and not quite parametric; but more on that next time.
A: SV are all those vectors that have non-zero alpha coefficients and are needed to "support classification" in SVM algorithm; in separable case they are indeed only on the border. If the problem is not separable (in the kernel space) also those vectors between the margins are called "bounded support vectors".
A: This is not quite as simple and I could not go into details of different type of kernels, positive and conditionaly positive kernels (see Scholkopf and Smola book "Learning with Kernels", 2001 on that). Roughly it is as follows: principal components PC_k(X) are linear combinations of dot products in the kernel space. If the kernel is the square of Euclidean distance (b=2 on the figure on this slide) we have a combination of ||X^i-X||^2 factors. For which points X will this combination PC_k(X)=constant? Alpha coefficients and X^i vectors are fixed, and ||X^i-X||^2 = ||X^i||^2+||X||^2 -2X*X^i; in the kernel PCA procedure for distance based kernel data X is additionally centered so that ||X||=1, and ||X^i||^2 are all constants; therefore PC_k(X) is constant when X*(linear combination of X^i)=X*Z is constant, and that is true for points X on a hyperplane (or a line in 2D) orthogonal to vector Z.
A: Most programs have very few kernels, linear + polynomial + Gaussian, and on most real problems Guassian are best; C and dispersion of Gaussian has to be optimized. Usually one can estimate using crossvalidation which kerenel has a good chance to be the best. In special problems it is worthwhile to design special kernels, for example for text mining or bioinformatics problems.
A: Yes, the summation is finally over SV only but we first have to find which vectors belong to this category, sa we need an algorithm that does optimization and selection. The formulation for the non-spearable case is the same except that now some vectors are going to violate the conditions for correct classification, so we add to minimization the sum of xi variables that measure the extent of this violation.
A: No, there is no closed form expression; W is expressed throuhg alpha and known Y, X. Dual form of L(alpha) is quadratic in alpha but there are two external constraints that should also be kept. The only training method I have mentioned, SMO, is iterative. But it is still very fast.
A: In the next lecture I will say more about it; design of the kerneks (or similarity functions) is now an interesting field.
A: If the value is missing x=? or if x=small or any other nominal value we still use the same 1R algorithm. Sometimes this is a reasonble approach, if for example the value in some cases is not measured and all these cases are from the same class; sometimes it may lead to errors.
A: I have reformulated this slide a bit to stress that discretization is one idea, and it may be daone in many interesting ways, but 1R uses it own simple discretization just as you write and as is illustrated on the following slide; I have changed it to showing discretization for humidity. 1-R has rather naive discretization, adding to the interval with at least B elements (bucket) until non-majority class sample is found.
A: True, but also the number of simple hypothesis is small, while the number of complex ones grows exponentially fast. For example if a simple hypothesis has 3 features and they do not repeat themselves in a tree there is 3! = 6 binary trees that one may construct; but if we have 10 features there will be 3.6 million possible trees, so some of them may by chance fit the data.
This is well recognized in hypothesis testing, where some simple test are used, like t-test. If we try the test millions of times finally something will agree, so the usual confidence that the test should correctly identify chance agreement at something like 95% level, is wrong. 95% confidence level means that the test may be wrong 1 time in 20, but if we try million times than certainly it will be wrong many times. Search for "Bonferroni corrections" to see discussions on this topic and adjustments to confidence levels.
A: To make it more clear I've changed this sentence to: "Since all training vectors are used testing this rule on the same training data gives 0 errors (vector X= X(i) is always nearest). All vectors from little patches in the middle of larger areas that belong to the opposite class, used as the reference vectors by the nearest neighbor rule, will indeed be classified correctly, giving decision borders as in the picture.
If testing on the training data we exclude reference vectors identical with test vectors X= X(i) the error will be non-zero.
Suppose now that only a few prototypes in centers of clusters are used - then borderes will be more smooth and little patches will disapear.
A: The E, or Expectation operation, averages over points X(i), therefore f(i) is used.
A: This is a bit tricky ... in each CV train model M(W|Di) with some parameters W on subset Di of the training data. After CV we selected best model with some parameters W* that on average gives the lowest error on our validation partitions. This may be the end for some approaches, but in many methods the final model may be created using all data. For example, if we optimize the number of neighbors taken into account with the nearest neighbor procedure (kNN), and find that k=7 is best, our model in each CV is kNN(k=3|Di), but the final model will use all training data kNN(k=3|Di); if in CV we find that decision tree should be prunned to some degree d, we need to create the final tree on the whole data with such prunning.
A: Precisely because of this shrinking volume the radius has to increase! If you have a fixed number of N points in d dimensions than the radius of the Voronoi cell that each covers has to grow. In d=10 dimensions only 10% of volume is in the subcube [0,0.8]d. So, if you have 10 randomly placed points and one is (0,0..0) in the corner of this subcube others are probably outside; average distance in each dimension is thus around 0.9, so the total distance is sqrt(10*0.9^2)=2.85, while in 1D average distance is 0.1.
A: Yes, the assumption that we learn something from nearest neighbor is a bad bias if dimensionality is high.
A: Some classes have much smaler number of vectors than the others; "balanced" data has roughly the same number of samples per class. I have added an explanation to this slide.
A: If the threshold is so large that no data is left than P++ has to be zero, isn't it? P++ or the proportion of true positives, is not the probability of correct classification, that grows with the threshold. It is the number of samples from class + classified as class +. Look at the figure in this slide; if the threshold moves to the right there will be a few vectors left, so P++=n++/n is decreasing at the expanse of false negatives P+-.
A: This is not only about PCA, but more generally about representation of nominal data that most methods cannot handle directly. Of course one way is to use binary encoding for each feature value, as you have noticed. An interesting idea is to use distance measures that are defined by probabilities rather than feature values. Methods that use distances can then handle symbolic features directly. I have tried to define numerical vectors that have Euclidean distances repreducing such probabilitistic distances and use them to analyze data. See: Symbolic features in neural networks. Going along these lines one can invent more sophisticated distance measures for symbolic features or objects that have complex structure, and try to represent them in some metric space. For example, MDS will work with any similarity matrix, also probabilistic, and produce numerical features.
A: Perhaps it was not celar that in our example we have Y0=0.2 replies, I have added this to slide, so if there are Y0=1000 respondends N=5000 offers has been send. If all top predictions P(w+|X) were tru we would thus have a line going from (0,0) to (20%,1000) point, but instead we have a curve. as shown in the slide. How is this used? My return from 10%*5000=500 offers sent to those that have highest P(w+|X) predicted chanses for return, is 400! But to get next 400 I need to send 30% of all offers, that is 1500, and then ther remaining 3000 offers will bring only 200 returns, so knowning my costs I may stop at 2000 most promising cases.
I have added a new slide 7 and updated 8 to detail the decision procedure, we have to apply MAP Bayesian decision rule.
A: Yes, we simply check which nodes is the winner and labels are assigned to nodes for the class that majority of the training data in their Voronoi cell (that is area where these nodes are winners) belongs too. SOM is unsupervised algorithm so there is nothing that will enforce pure nodes, that is Voronoi sets from a single class. So indeed it may happen that a given class with small number of samples will never be used. If the training data represents distribution of the data well and there is enough nodes than it usually works well.
A: Yes, we should chooses action that carries with it the lowest risk.
We called the number of class K so j=K should be used and then this formula gives conditional risk for decision wl; the lowest risk k is selected.
I have re-written the decision procedure introducing explicitly conditional risk, and added some bibliographical notes.
A: Yes, this is indicated by |C=wk) condition.
A: Yes, this is the risk or cost that we had before, but now we call it g(X) function and use it to discriminate between classes, as some other functions on the next page.
A: Thank you for careful reading! In slide 12 of lecture 12 I have left P(Wl|Wk) instead of P(Wk|Wl), although commented there that rows correspond to true clases and kept correct notation later; this confusion arised from different convention used in GhostMiner package. So here the risk of doing action j when the true state is l is calculated, and k=l for the lowest risk is selected. So this slide is correct, and slide 12 in L12 is corrected.
A: Here P(wi|X) is our prediction, while Ci(X) is the class label, usually 0 or 1; as I have mentioned sometimes we may have probabilities or degrees that we try to learn, then Ci(X) may be a fraction that sums to 1. In any case if P(wi|X) predictions match Ci(X) labels the error is zero, otherwise there is an error that is used to adjust parameters in models generating P(wi|X) probabilities. So for example for 2 classes we may have C1(X)=0 and C2(X)=1, and if P(w1|X) =1 and P(wi|X)=0 they match.
A: CxZ=ZL is en eigenequation involving the covariance matrix Cx; Z is thus the matrix of eigenvectors of Cx.
As we do not have time to review matrix algebra during this course (it is after all a graduate course), please consult Numerical Recipes book (www.nr.com), chapter on diagonalization methods.
A: This is just a notation where fat X is a matrix of all data vectors X shifted to their means. Assume that the means are already substracted from data, we can do it before calculation of C matrix. In each column we have X(k), k=1..n vectors, first column is for feature 1, last for feature N, the matrix is thus N by n. Transposed matrix has rows corresponding to vectors, and columns to features, it is n by N, so now column 1 contains values of feature 1 for all X(k) vectors. Multiplying first row by the first colum we get C11 or in general elements of XXT matrices contain Cij elements.
A: Because P(e|X) = min_i P(w_i|X), minimum of posterior probabilities, we assume that if there are several identical X they appear in the data, and therefore in the sum. Therfore there is no P(X) in the sum.
A: I keep Software and Computational Intelligence links, where all interesting linkst to software and papers are stored. There are quite a few links to SOM programs, including Matlab version. Such links are unstable, programs vanish, and I can't keep track of all of them - I hope to learn from you about more programs! So please search first in these links and than in the Internet.
A: Unfortunately in this situation higher FDA vectors cannot be obtained easily if the FDA criterion is strictly observed. I have mentioned two methods, based on perturbation and psuedoinverse, but had no time to explain them. They both spoil the FDA criterion a bit.
The best summary is in the A. Webb book, chap. 4.3.3. However, in this paper:
Cheng, Y.-Q., Zhuang, Y.-M. and Yang, J.-Y. (1992) Optimal Fisher discriminant analysis using rank decomposition. Pattern Recognition, 25(1):101-111
a better method is described.
Perhaps the best paper that discusses numerical problems in FDA is:
Efficient Implementation of Optimal Linear Discriminant Analysis. Methods based on Generalized Singular Value Decomposition and other approaches are described. This goes rather far beyond our course and is a good research topic, as different FDA variants give different results and one may go to kernel versions with new variants rather easily, the existing kernel FDA does not work too well.
A good bibliography on
discriminant analysis is here.
A: It menas that if you divide the whole distortion measure by a function f(Rij) only then this is a constant for a given data, only rij distances change when {Y} positions in the target space are optimized. Of course if Rij is used in combination with rij then it has an influence.
A: Yes, MDS does not do any classification, just visualization. One can use data in the target space Y for classification, then MDS is used for dimensionality reduction and may be followed by any classification method. In fact results of such procedure are quite good although no-one seems to be using it.
A: Unfortunately there is no easy way to identify good method for a given data unless we already know something about the type of data we are going to analyze; this is true not only fro visualization but all data modeling technique. METAL project tried to do it for data mining, defining similarity to previously analyzed cases.
A: What good it could do? MDS is already the best representation preserving distances, SOM may only spoil it.
A: This is always good to learn; there are many packages in S or R languages but a few know it, as it is taught only at courses of statistics.
A: True, I thought I've stressed this point, FDA needs some labeling of the data, clustering or knowledge of classes.
A: I should be more careful on this slide distinguishing dependence, second-order statistical correlations and higher-order correlations, which is what was meant; although the slide stresses that uncorrelated features may have higher-order dependencies, it is better to say here that any non-Gaussian distribution after PCA transformation will still have some feature dependencies, rather than correlations, although dependencies are in fact higher-order correlations. I've changed it accordingly.
A: Indeed it is better; Y1=W1TX represents new transformed feature, and we commonly speak of feature space thinking about its dimensions as orthogonal vectors defining directions on which feature values are placed, so sometimes we say that Y2 is orthogonal to Y1, and we really mean the unit vector on the Y1 axis is orthogonal.
A: Certainly this is quite approximate algorithm and in fact it is hard to prove anything exactly about it in more than one dimension, have addded approximation sign there.
A: GA is sometimes used for optimization of parameters in various classifiers, rather then directly. Many papers have been published on this subject (ex: EDRL-MD, Evolutionary Decision Rule Learner), but software is not so easy to find. I have a few links to GA systems on my software page; some packages, like SPSS, use GA for optimization of neural parameters. Yale has GA and PSO (Particle Swarm Optimization) version of SVM that probably works better for classification (although I've not see any comparisons with standard SVMs), but it does not generate rules.
A: Several projects tried to build rule-based classifiers with GA optimization, for example AI Lab in Strassbourg (J. Korczak) has a number of software packages on heir pages.
NeuCom (A Neuro-computing Decision Support) evolves solutions.
Note however, that evolutionary approaches have been around for a long time (20 years) and it is hard to find good comparisons with computationally less expensive mutistart gradient techniques, not to mention standard numerical analysis (branch+bound etc) optimization methods. My experience with PSO and GA in neural parameter optimization has been rather negative, there are better and faster methods for this application. For some reason many people use GA fro optimization for simple problems, and for problems where standard numerical techniques work better; simplest systematic sequential search techniques have been totally neglected, although there is no evidence that they fail, indeed they seem to work quite well,
see here.
A: Only briefly, but I'll try to add some pointers.
A: In this example 4 Gaussian functions, each defined in 4 dimensions, have been used. Their formula is:
G(X;X0,s) = exp(-[||X-X0||2/2s2]) for
X=[X1,X2,X3,X4]
Here ||X-X0|| is Euclidean distance between X and X0 vectors in 4-dimensional space,
with and X0, the center of the first Gaussian at [0,0,0,0].
For the next Gaussians the X0 center is shifted, as given in the slide.
Additional dimensions X5 to X8 were added, made from Xi+4 = Xi + noise. So together these functions have 8 features.
Why? This is an artificial data to show how the scatterograms will look like for such data.
It will also be used later to show how selection of important features work: X1 has more information about the classes than X2-X4, and features X5-X8 are redundant, carrying the same information as X1-X4.
A: The multidimensional Gaussian function G(X;X0,s) is used as probability density function to generate points. This simply means that the probability to select a given point (X,G(X;X0,s)) is equal to the value of the Gaussian in this point, therefore many points come from regions around X0 and a few from regions where Gauss is small.
Gaussian distributions are very common and noisy measurements frequently follow such distributions.
See the entry on
Gaussian noise in Wikipedia.
A: Indeed this is to short, I have expanded the sentence now:
To achieve full invariance requires therefore standardization of data (scaling invariance) and should use covariance matrix.
I have also mentioned that Mahalanobis is still alive; evidently this is not true, as Prasanta Chandra Mahalanobis was born in 1893 and died in 1972. I got him mixed with another great Indian statistician, Calyampudi R. Rao, who was born in 1920.
A: The assumption in the first line is that there are many X vectors, with mean=1 and variance =1; this is the same example as in slide 2. So if the X means are [1,1] and transformation A is given, we can compute both Y means and variances. There is no scaling here, just A transformation. I have changed this slide to make it more explicit.
An elegant way of proving it is to use covariance matrix:
CY=A CX AT. If CX is diagonal, as spherical data with a unit variance is, then CY=AAT, and therefore square of variance for each feature is on the diagonal part Diag(AAT).
A: This is a definition of Mahalanobis distance, a norm with CX in the right corner, so there is nothing to derive here!
A: I've added some explanation there.
A: Just follow the link on my page and get the Ghostminer package from FQS, who will send you a license key valid for a few month.
A: Directly from linear transformation Y=AX and definition of CX given on the previous slide;
CX=1/(n-1) YYT, and (AX)T = XTAT,
so
CY= 1/(n-1) AXXTAT = ACXAT.
A: Distance functions are defined for vectors, we shall not need distances between matrices, although such concept is useful in mathematics.
A: This you should know from the course on linear algebra that unfortunately we do not have time to review, as this is a graduate course, I have to assume that you had a linear algebra course. Any textbook on linear algebra should explain it, I can recommend the Numerical Recipes book that may be downloaded from the address http://www.nr.com/, including numerical routines for diagonalization. Also Matlab has build-in diagonalization routines.
A short explanation is this:
Solving the eigenequation CZ=ZL for C symmetric, positive definite, and L diagonal matrix containing eigenvalues,
leads to an orthogonal matrix Z that contains eignevectors of the matrix C.
Orthogonal means that ZTZ=I, therefore multiplying the eigenequation from the left by ZT gives ZTCZ=L;
this type of transformation, ZTCZ, is called similarity transformation, and it changes C into a diagonal matrix.
Therefore if new vectors Y=ZTX are defined the covariance matrix will be diagonal, as shown in slide 10.
Orthogonal matrix Z defines a rotation of the coordinate system, each new coordinate is a linear combination of the old ones. In this new rotated system all features are decorrelated, because CY is diagonal.
A: We shall talk about that a bit, but unfortunately there is no general methodology, as it very much depends on the applications. One way is to collect or measure whatever we can and than use such techniques as: feature ranking, selection, aggregation or dimensionality reduction to find or construct relevant features.
A: I'll talk about this in Lecture 15/16.
A: Indeed some people called the connectionism "a new AI", but traditional "good old fashioned AI" (GOFAI) has been focused on symbolic representation of knowledge, as is quite evident in the AI textbooks and courses. It is still true that AI people do not come to neural or connectionist conferences, and vice versa. So, even though some AI people would like to see their field expanding and cover also what CI people do.
But it is enough to look at the societies, conferences and journals to see this distinction, with very little mixture between the two groups.
Unfortunately there are no commonly accepted divisions and definitions. Some questions addressed below have already addressed this problem.
I am giving you my own definitions based on the problems addressed, rather then methods used. From this point of view CI is broad and covers all subdisciplines that help to solve problems for which effective algorithms do not exist. This includes AI which has traditionally been focused on a subset of these problems that could be solved with symbolic representation.
Ultimately it does not matter how we label the field, what matters is the type of problems we may solve.
A: Some of these problems indeed require high and low level processes. I have changed slide 4 in Lecture 2 to make it more clear: in my broad sense definition of CI it includes AI, but is not restricted to problem solving, going also in the direction of optimization and pattern recognition.
The first lecture is to see what are the problems requiring CI and what are the sources of inspiration to look for solutions. I will come back to these sources of inspiration talking about different methods. This is quite a broad field so indeed you may be overwhelmed, but other lectures are focused on the methods.
A: Sure.
A: The question in L34 is: prove that Mutual Information = Information Gain when binary feature is used to split the data. This should not be difficult.
A: I have updated the slide to make it more clear; if this integral is 1 than also the final density is normalized to 1. Thank you for pointing it out.
A:I have showed the explanation briefly during the lecture, but it belongs to elementary statsitics, not to this course; see for example "Population variance" at this Encyclopedia of Math page: http://mathworld.wolfram.com/Variance.html
A: For 2 classes recall the definition of SB in CIS-07 s.11, it is made from one vector, the difference of two means, call it D12. Therefore the rank of this matrix, by definition equal to the number of non-zero eigevlaues, will be 1, becuase if we take transform the coordinates to D12 and orthogonal the only non-zero component will be along D12. If we have K classes there will be K-1 independnet vectors Dij obtrained as differences of the means, for example {D12, D23}, but D13=D12+D23 (in a vector sense). Another way of understanding this is that K vectors of means may be represented in K-1 dimensional space. If we have 3 classes, the 3 means Xk are on the triangle, and thus a 2D space exist, so a matrix build from these vectors will have rank 2.
A: The norm of the differnece ||P(\omega_i|X)-C_i(X)|| should measure how far are the predictions P(\omega_i|X)\in [0,1] from the class labels 0 or 1, so to be a good measure of error we have to assume that the class label C_i(X)=0 for X that do not belong to the class \omega_i, or C_i(X)=1 if it belongs ot his class. I have changed C(X) to C_i(X), then it is easy to seee that C_i() is as an indicator function or the class, 1 if in class \omega_i, 0 otherwise.
A: Yes; if it is a single Gaussian than it is of course quite simple, but there are several numerical methods to fit parametric functions to given points from the histogram, see Numerical Recipes for example.
A:
A: Just like in the boosting procedure, if there is a missclassification for X it will put more weight on this sample X, trying to influence the decision borders to handle it correctly.
A: We had no time to go through details here as this will require few extra slides, but you may find the derivation in the chapter 4.2.5 in the Webb book, for example.
A: If we assume that each support vector exerts force on the membrane then its direction has to be along Y^(i)*W/||W|, or the unit vector perpendicular to the membrane, directed in positive or negative way, depending on Y^(i). Now if \alpha_i is taken as the magnitude of the force exerted by X^(i) then we can say that the sum of all forces in the quilibrium is zero, and this is the same condition as the auxilliary condition that we have got from the SVM derivation. The second condition for the torques shows that if W is expressed as a sum of \alpha_i*Y^(i)*X^(i), as we have found, then all torques are also zero.
A: Yes. Of course in some cases we may try to compute gradients to fine good parameters faster, but for large number of paramters this is a difficult job.
A: Indeed you are right, posterior probability is always harder to compute, I have changed that sentence, thank you for careful reading.
A: In practice we simply run the leave-one-out procedure on the training set for different values of k, and select the best k; then we test with this k on the test set. If we want to find the best k this is what we should do.
If there is no test set the procedure that you have described above should be used to estimate the expected accuracy in an unbiased way. The test set is then either the validation part from the crossvalidation or just a single vector from the leave-one-out. This is obviously expensive as we need now for each of the N vectors to do N-1 runs, or a total of N(N-1)evaluations. But after the estimation of expected accuracy is done we still need a good k for new samples, and this should be derived from the whole training data. So the fact that different k may be optimal from different runs does not matter.
A: If we take all the training set as reference vectors, and then use the same set for testing, then 1NN will give 100% becuase for the test vector Xi the training set will contain as the nearest the same vector Xi which is at the distance 0. The only exception is when two or more vectors from diferent classes have exectly the same features, then the result will depend on how we handle ties, but will be in general below 100%.
But if we take crossvalidation, as we did with IB1, or if we use the leave-one-out procedure, then of course accuracy will not be 100%, because the test and the training sets are different.
A: Do not expect me to put the examination questions on the web! This list has been shown already in previous years, so please look at last year's examinations and judge for yourself.
Q: Slide 6 is an excerice, asking you to write down the risk for 2-class 1D problem, and the meaning here is Gauss(x0,sigma), as you have noticed.
Slide 7 is not a continuation of 6, it explicitely assumes the same variance for both classes, so the two variances are identical.
Q1. I will say bayesian approaches include, naive bayesian classifier, maximum likelihood classifier, maximum a posterior classifier and bayesian network. Can I categorized these methods as bayesian approaches?
A: Yes; general philosopy in all these approaches is Bayesian: look for priors, class conditional distributions, evidence, and compute posteriors. Probability experts have some other philosophical approaches, see for example the discussion of Bayesian perspective in an on-line book by ET Jaynes
http://bayes.wustl.edu/etj/etj.html
Q2. The learning phase of bayesian approaches is to learn the probability (prior and conditional) from training data. The simplest way to estimate these probabilities is to calculate the frequency of the training data. But I don't know any other methods? (If we assume the probabilities is in some parametric form, then parameter estimation methods like ML and EM can be used. But for my data has 13 attributes, how to model it in some parametric form?) any non-parametric method can be used for density estimation (prior and conditional) for bayesian approach? Very confused about this.
All these approaches for probability estimation are fine. For some data we have parametric description, especially in engineering, physics, chemistry etc. In your applciation you may not have it so you need to use non-parametric models.
Q3 My plan is to analysis a data set of nba player to estimate his free throw level (high and low). What kind of analysis can I do by using bayesian methods?
A: Looking for factors that have influence on this level using Bayesian belief nets for example. See my software links, section on belief networks.
A: Splitting the tree nodes or adding more nodes to a basis expansion network is done using the training part of the data, but testing is done on the validation part, unseen by the training algorithm. When the complexity will be too high and the training algorithm will start to assign small regions of feature space to a given class the test results are going to degrade, as showing in the last slide in CIS-14. Adding additional split we will notice that although the training results improve the results on the validation part will degrade, so we will not make this split. In decision trees backward pruning is more common, first trees are trained to separate all data (this is possible only if there are no identical vectors with different labels), and then pruned to minimize error on the validation set.
A: If the condition is irrelevant for some training data it can be removed without changing the validity of the rule. On the new, validation or test data, simpler rules should work better, as the irrelevant conditions may only decrease rule accuracy, in a similar way as random conditions.
A: Yes, this is a good summary. There are many approaches to model trees, some based on residual errors. See here for some software and papers on regression trees with linear and non-linear nodes. See also:
DTREG, Classification and Regression Trees And Support Vector Machines (SVM) For Predictive Modeling and Forecasting, providing nice demo for regression trees.
M5 implementation in WEKA is described here.
A: We have derived it in CIS13, slide 10; if probability distributions for two classes are given by multivariate Gaussians and covariance matrices are identical comparing two discriminant functions quadratic terms on both sides cancel and linear discriminants are obtained.
A: The assumption behind this is that probabilities have normal distribution so if there are two with the same sigma we are back at the same problem as above. RDA is discussed in details in Webb, 2.2.2, and chapter 4.3.1 of Friedman, Hastie and Tibshirani book.
A: If accuracy is identical one of the confusion matrices may still be preferred. The worse situation is when we do not learn anything, and this is the first case. Although a random guess is true in 50% of cases it is just random guess. The second situation is also true 50% of the times, but if class 2 is predicted (in 1/4 of all cases) we are sure it was really class 2, while if class 1 is predicted (in 3/4 of the cases) there is probability 2:1 that it is in fact wrong, so in this case we know much more. You may play yourself setting probabilities in another way but keeping the accuracy at 1/2. Please note that the class distribution may change from 1/2, 1/2 to something else, for example to (1/4,3/4) in the second case.
A: I have added more comments here. Inputs are randomly distributed in [-1,+1]d, for d=1 ..10, the task is to approximate the target functions f(X)=exp(-8||X||2). MSE is the error of approximating f(0), and the Bias2 show on the right side is calculated using the formula from Lecture 14.
A: To cover 10% of volume in 1 D we need a line of 0.1 length, in 2D we need a square with side 0.31 = 0.11/2, and in 3D it is 0.47 = 0.11/3, in 10 D it is 0.79 ... That means that in 10D the sub-cube with the side of 0.8 contains approximately 10% of data if it is uniformly distributed. If there are just 10 points uniformly distributed the average distance of neighbors will therefore grow from 0.1 to approximately the length of the diagonal of the hypercube, in 10D it is 2.5.
Projecting points on a single line we get relatively large density, but the distance between these points in 10D is large and thus highly-dimensional spaces are almost empty.
A: Parameters of the model are set to optimize the results on the validation data. So, if a number of numerical experiments using validation set is large the resulting model may overfit the validation data, even though this data is not directly used for training. If 1000 models are generated, with slightly different sets of parameters, by pure chance one may have a correct bias for this particular validation set, but that does not mean that it will generalize well.
For that reason "Crossvalidation selects the validation partitions randomizing the
data", that is the data set is first permuted to have a random order of the vectors, and then partition into k parts, each of which is used for testing in the k-fold CV procedure.
This gives a better estimation of generalization error than just validation on a single partition, because now it is impossible to find parameters that will by chance lead to good results on a selected validations set; the model has to perform well on k different sets that it has not seen.
May I know whether there is some mistakes for that? As from my understanding, it should be
A: Going from contingency tables N(w,x) to confusion matrices is not so straightforward ...
top row in N(w,x) is class w+, bottom w-.
Element P11 of the confusion matrix is the fraction of times we had class w+ cases and they have really been classified as w+;
we have 10 w+ cases for x=0 and 40 for x=1;
if x=1 the MAP model selects always class w+, so all 40 cases are assigned to w+,
and P11=0.40;
the other 10 cases for x=0 that belong to w+ class are going to be assigned by the MAP model erroneously to class w- because all cases with x=0 are assigned to w-,
therefore P12=0.1.
So, this confusion matrix is correct as it is.
To generate it first find the MAP classification rule.
Here it is: if x=0 then w-, if x=1 then w+.
Then use these rules to predict how many cases are correctly identified.
For class w+ and x=0 there are 10 cases assigned to w-, and for x=1 there are 40 cases assigned to w+. So,
P12=P(w+,w-)=0.1 and P11=P(w+,w+)=0.4.
Some tutorials on PCA are available in the net, for example:
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
http://www.fon.hum.uva.nl/praat/manual/Principal_component_analysis.html
For more visualization methods based on discriminant analysis see:
http://www.blackwellpublishing.com/content/BPL_Images/Journal_Samples/ANZS1369-1473~43~2~158/158.pdf
For more, just write "Principal Component Analysis tutorial", and "Fisher Discriminant Analysis tutorial" in Google.
A: Indeed a major problem is formulation of good questions; we can learn methods to solve problems once we are able to formulate them.
It took Summerians hundreds of years to go from a concept of "10 jars of oil", to an abstract concept of "10 of something". This is one of the absolutely non-algorithmic problems and the progress in formulation of new points of view is rather slow, most people just repeat the same thing slightly improving it, without questioning the basics. So there cannot be an easy solution to this problem.
Personally I find cognitive inspirations, studying how brains solve problems, how to simplify complex ways of describing the neurodynamics, very useful. I will go back to such inspirations during lectures many times, even though a formal theory may exist now that seems to be purely statistical.
We should try different assumptions and see what our models are able to explain. In visual system there are many assumptions that people make, and grossly simplified versions of models based on such assumptions find their ways to autofocus in the electronic cameras. Divide the vision field into many regions and then focus on the one with the big variance, for example fast movement or strong contrasts, exploring all such areas.
You are of course right, sorry I have not been too clear here: 8D means that there are 8 features; 4 Gaussian distributions, each in 4D, have been generated, the first centered at (0,0,0,0), the second at (1,1/2,1/3,1/4), the third at 2(1,1/2,1/3,1/4) and the fourth at 3(1,1/2,1/3,1/4); so the first 4 features x1-x4 are made from overlapping Gaussians, presented in different colors; the remaining x5-x8 features are made from these first four, x5=2*x1+eps, where eps are random numbers evenly distributed in [-1,+1] interval. I have added some remarks to the lecture presentation.
A: I am glad to hear that, showing you some perspectives was important part of my plan and of course I feel that I have not done enough ... but then it is always very subjective, I can tell you what in my opinion is worthwhile to pursue, and of course others will have different ideas. I will try to continue adding some answer and advices to this page.
A: The beauty of the kernel-based approaches is that we do not need to write this transformation explicitly, all we need are scalar products, and to be sure that kernels have properties that scalar products in some target spaces have. I do not know about any book or paper where this transformation is directly written and it may be difficult to find it. Although this may be an interesting excercise it is not clear if it could be useful for anything.
A: Indeed, Bayes name is attached to many things, and NB is one of many of them and is a method of probability estimations useful in classification, while Bayesian estimation is commonly used to estimate parameter distribution functions more than single parameters.
A: Taking the log of likelihood will make from these constants an additive term. We were given n1 - n4, so they do not change, and the derivative will be zero. After optimal value of theta is found P(Class1)=(2+theta)/4 is the probability of Class1, times n=125 samples = 
I have made it a bit more explicit updating the slides, but I am afraid that this type of problems come from not making your own notes ... of course the lecturer will finally spell out all the details on the slides, but making notes and thinking about them actually facilitated understanding and forced the motor cortex of the brain (and it is quite a bit of the total cortex) to participate in thinking.
Having everything spoon-fed is not good ...
A: I talked about this quite a bit: only one book, no notes, self-produced or self-annoted material. It is final to highlight sentences and equations, or even add small comments in derviations, but please do not write other things there. On the second thought I understand why there are almost no open book exams ...
A: I did not talked much about ICA, so I cannot expect the full algorithm for ICA to be given as an answer. The main idea what to do is sufficient, but of course a few more variants will be better.
A: It should not be difficult to construct other classifiers that get results close to the optimum, that is the prior probabilities. For example, density estimation methods such as kNN with larger k on average will get optimum result. A tree with a single level (tree-stump) should also be close to optimum, it may be slightly lower or higher in accuracy than the majority classifier due to random fluctuations in data dsitribution. The full overfitted tree has clearly the worst performance.
A: Not the shape of the grid, becuase in SOM it is rectangular, fixed grid. We do plot node paramters if d=1 or 2, as in the Peano curves example.
You can make simplest example doing hand calculations and see what will happen. Let the grid be just 2 nodes, with 1 parameter each, initially w1=0.1 and w2=0.2. Now start showing them data X=+1 and X=-1, move your w1, w2 using SOM update rule once in the direction of +1 and another time -1. Because w2 is closer to +1 it will be the winner and will eventually move to +1, moving back to larger and smaller values, while w1 will be a winner for -1 and will finally become -1. Nothing is changed in physical space, but node parameters may start to point to different areas of the input space.
So what is visualized if d=10? Only the activity of nodes. Seeing that a node is a winner we know that in 10D space there is a cluster somewhere, and the nearby cluster should excite nearby node in 2D space. Similarity between clusters in the input space and winning nodes in on the gris is preserved, this is why SOM is called topographical mapping, although it is not a perfect mapping.
A: This is not hard to imagine: Peano curve again! Try it with SOM software, like the one I showed during the lecture.
A: In all compettive learning algorithms I showed you Voronoi diagram showed which data vectors are assigned to a given prototype in the center of the cell. This is useful in real applications and may be analyzed to understand what this prototype means. Delaunay triangulation helps to see how this space is partitioned, aiding presentation.
A: This is a longer story, I did not tell you how to implement MDS in practice, but if you are interested read about it and other visualization methods in this PhD thesis.
A: What other reasons can you see for distortions in such algorithm? Is it always possible to represent N-D relations in 2-D? What type of distortions may arise? Does "the best" 2D representation always exist?
A: Factor 2 is just to be sure that the first term measuring separability dominates over the second term. Several alternative formulations of this criterion exist and indeed in this form obviously the bigger SSV is the better, so you are completely right. I have added some remarks to the slides.
A: I am not sure that I understand that question. Probability density functions P(C|X) are used to categorize some phenomena; if you want to apply it to cube with faces in Slide 3 you have a choice in modeling: P(face|X) may have local maximia corresponding to X in corners of the cube and may vanish for some radiculous combinations; you may also mdoel just one face expression, ex. P(smiling_face|X).
A: Yes, but can you characterize them a bit better? Can it be a hyperplane, a circle, a hyperboloidal surface?
A: Linear parameters influence the result in a linear way, so if y(x;w,b) = w*x+b then w is a linear paramter. If RBF networks are used then the Gaussian width of their nodes are non-linear parameters, but the output weights are linear. So, what is the simplest way of finding linear parameters in this case? It is stated in lecture notes.
A: Density may be high but focused in small area, so say only around +1 it will be non-zero. Density may be low but non-zero for all values and then it is less useful.
A: Just try it: in most methods you will need to repeat the training procedure n times nad this is usually slow. What do you have to do in k-NN?
A: What happenes when one of the training vectors is slightly changed with borders from the decision tree? This has been clearly illustrated. Now what happens with the kNN borders?
A: I am not sure what you mean? How would you calculate MI for continuous features?
A: I am not sure which book you have; of course you can use bagging with a committee of different classifiers, but it is commonly used with decision trees, creating slightly different tree for each bootstrap subset. In this case different data is the source of model variability, but it may also be differnet types of models and stochastic learning that will create difference models.
A: So far I have answered all specific questions I've got but for a few questions such as this one it looks like you want to get easy answers to learn.
If you read Lecture 6, you will find description of principal components, and you should be able to answer this question easily,
giving me your opinion.
The point is not to learn what I want you to answer but to understand what the topic is about and make your own judgements.
I am sure that you have a good memory, but this is not a poetry or spelling contents.
Questions below are more or less specific, and if you speculate on important PCA properties and send me a ranking list I may point out a few more.
Asking questions like:
Sample Question 7: Describe how semantic map may be constructed and Why SOM and MDS semantic maps preserve natural classification?
Lecture 27, Sample Question 2:  What alternatives for parameter optimization exist?
Lecture 28, What are the difficulties in non-parametric density estimation?
What may I say? Repeat the lecture?
A: Indeed, two slides earlier it was still fine ... please watch for such errors, there are different formulations depending on what is defined as column and what as a vector. I have added explicit dimensions now to remember what is a column and what is a vector.
A: This is the goal, good models should generalize well and for that we need a small bias, that is a very flexible model, and small variance, to follow only real trends in the data, not the noise. If we take one model we face the tradeoff, but ensambles are not single models, they are a series of models that converge to a final model that is better in both respects. The tradeoff does not prevent invention of new, better models, it just says that if you take a fixed model with small bias and overtrain it to get 100% correct answers it will have a high variance and will generalize poorly on novel data.
A: Anyone how tried to view 3D histograms will know it; in rare cases situation may be clear, but usually one has to view it from different angles, trying ot rotate to see anything. It is mainly a problem with visualization, for a small amount of data number of points may also be an issue.
A: Minimization of the error function means changing model parameters (and thus generating multiple models) in such a way that the error function is minimized. I have added a slide with an explicit example to show how confidence in the decisions may be increased and errors reduced at the expense of the rejection rate and decrease in accuracy.
A: In some books cumulative gains are called lift charts, but marketing experts are careful to distinguish them; I have updated lecture 16 slides.
You may also look at detailed examples on the page:
http://www2.cs.uregina.ca/~hamilton/courses/831/notes/lift_chart/lift_chart.html
A: Amazing, but true! For symmetric arrangment of prototype points on y=x+b or y=-x+b line Manhattan metrics leads not to a single line, but gives rectangles that contain points that are equidistant from the two prototype points (for restricted data range, otherwise it gives quater-spaces in the upper and lower corner)!
Q2: How to draw the lines that are at the same distance from two points (x1,y1) and (x2,y2) using Manhattan metric? We need to solve the equation |x1-x| + |y1-y| = |x2-x| + |y2-y| to find a set of lines from the result of x and y. Is this correct?
A: Yes, but it may be easier to use geometrical symmetry arguments when drawing these points than solving such equations, as is evident form the example given above.
Try to see what happens ...
A: Yes, but there are two points to note: 1) this is true only if we take Gaussian distributions, 2) if they are diagonal then Euclidean distances are obtained, for non-diagonal covariance matrix decisions are still based on distance to the Gaussian center, but Mahalanobis distance should be used, with inverse of the covariance matrix.
Indeed, it is better to use the same letter m as a few slides before, thank you, corrected.
A: True, but note also that in slide 6 we show that the posterior probabilities minimize average errors of the decision procedure.
A: What they should sum to, for example sum of all prior probabilities should be 1. Sometimes we do not have proper esitmaiton of all probabilities, but know that class 2 is observed twice as frequently as class1, then we have to sum p1+p2=3 and divde (normalize) each estimation so that p1+p2=1 as it should, therefore sum rules are called "normalization conditions".
A: Indeed I did not write it clearly, it is the same as the cost or loss matrix lambda.
A: I understand that this question reading the lectures sequentially, becuase this topic is then explained in several lectures, mainly 25 and 26 on linear discrimination.
A: In the Duda and Hart book, chapter on non-parametric techniques, where more details may be found, including two problems that illustrate nicely how growing number of dimensions leads to difficulties in estimation of probability densities.
I gave some examples in the lecture notes, showing that to keep the same density - and thus roughly the same accuracy of approximation of the probability density functions - the number of points should increase exponentially with the growing dimensionality. High-diemensional geometry is very non-intuitive and surprising, but we had no time to discuss it. For example, with binary strings of dimension 1000 and randomly created 1 million strings, searching for the neighbors will find 98% of all strings different by 411 to 430 bits, surprisingly far, and only 1 in 10.000 times a string closer than 400 bis or further than 432 bits. Thus using a nearest neighbor strategy in highly dimensional spaces leads to quite wrong bias of the algorithm and hence larger errors, as shown in slide 8.
A: It all depends whether we have enought reliable information.
If we do not know anything we freqently go by the majority rule, using implicitly majority classifier; for example, this leads to stereotypes that many people use in thinking about other people. If we have sufficient data to calculate class-conditional probability densities then maximum likelihood predictor may be used, or even better MAP; if risks of different decisions are known or may be estimated then an extension to minimum risk is possible.
Note that in the usual case comparing two decisions maximum likelihood and MAP are equivalent because the evidence p(X) cancels. There are more complex cases when rejection of low-confidence decisions and additional constrains on decisions may introduce a difference between these two fomrulations, but we have not discussed them.
A: Physics is using parametric models in the description of simple phenomena, and I it is possible to use guidance from physics our model should incorporate it, it would be an appropriate bias. Unfortunately many phenomena are too complex to describe them by parametric models. Take protein folding as an example. The problem is simple: a chain of aminoacids folds into 3-dim structure that has minimum energy (free energy, to be precise). We can write the equations and they are useless, too many parameters to optimize. Assume that a group of molecules is dscribed by a single potential and introduce some parameters to model it. Now the number of parameters describing the aminoacid chain is much smaller, although it still may be of the order of a million. Try to fit these parameters using the known protein structures - Nature has done some computations for you - and use it to predict new structures.
What is the optimum complexity of your model? I have made some remarks very briefly talking about applicaiton of infromation theory to determine optimal model complexity: AIC (Akaike Information Criterion), BIC, Minimum Description Length, etc. These methods try to estimate generalization error analyzing the training error, frequently using some form of bias-variance decomposition.
In computational learning theory this is an important branch. A good intro is in the Hasti et al book, Chapt. 7, that I recommended at the beginning. In practice, however, crossvalidation is the easiest if you can afford computational costs. It is a good topic for theoreticians, but I would not recommend it if engineering applications are the main goal, becuase it is not our biggest problem and we should not expect great breakthrough that will allow us to set easily much better models than we already have.
COLT (Conference on Learning Theory) conferences, http://learningtheory.org/colt2005
are devoted to learning theory, the first 3 topics they mention are:
Analysis of learning algorithms and their generalization ability;
Computational complexity of learning;
Bayesian analysis.
A: Stability of the models is very important, and it is reflected in the variance of the model during crossvalidation training or bootstrap sampling. If the model gives more or less the same answers it has a low variance. Decision trees will give the same answer independently of the order of presentation, but they are not stable and have large variance. Averaging the answers of many trees in bootstrap procedure gives you quite stable models.
This is an important issue: if you want to have a robust, simple description of the data you should find a method that has good bias for the data analyzed (unfortunately different data require different biases, and thus methods) and find the simplest model that is not significantly worse than others, this may have a small variance. We have tried to pick up rules from trees that are the most common in the CV procedure to achieve it, giving a very simple logical description that does not change when we sample the data many times creating new training sets.
Sometimes it really works, but it is a a challange to find such rules.
One way we have tried to solve it was to introduce hetereogenous systems, based on functions of different types, for example different types of tests for decision trees in each node, different neural node functions, or differnt local distance functions in similarity based methods. This type of models may find best bias for the data but are rather hard to train and the issue how to create them is still open. Some papers that attempt to create heterogenous neural, decsion trees and similarity models are linked to the page:
http://www.fizyka.umk.pl/kmk/publications.html
This topic is not yet popular but I consider it to be one of the most important directions in computational intelligence, finding simplest models that can solve the problem. Consider the parity problem: a string of n bits B=000110...1 has even or odd number of 1 bits. XOR is just the parity with n=2 and it is already not easy for some models, but parity with 10 bits is terribly dififcult for RBF or MLP neural networks, similarity models and decision trees. It may be solved in a trivial way by taking as class c=cos(Si Bi), that is a single node network with all weights equal one, if an appropriate function is found.
I have meany references to filters and distances between distributions here, in a paper still in draft form, it should be corrected soon. This is an interesting topic that has applications in many adaptive systems, people compare for example histograms of images trying to find something that looks similar.
This has changed in the latest version. LDA is accessible now only by choosing from Metalearning models "Classification via Regresion" and then choosing from "Functions" Linear Regression insted of M5P; to make standard LDA switch off feature selection and elimination of colinear features, and put ridge regression to 0. These parameters allow for selection of features and ridge regression coefficient C controls margin by adding C||W||2, like in SVMs.
Q: This is an important question and it deserves more time. I will try to answer it in the last lecture summarizing and outlining new directions.
Q: I am not sure what you mean by the "independent rate", because this term is not commonly used. It is a measure of mutual information, or how much can we predict the distribution of values of one variable by knowing the second. There are many ways to measure this: by correlation coefficients, by distances between distributions and other ways; a review of some methods may be found at:
http://www.fizyka.umk.pl/publications/kmk/04-filters.html
I have tried to do it once but have no notes around; it depends on the assumptions about the way your laws are interrelated. For example if A1=F(A2,A3), A2=F(A3,A4),.. An-2=F(An-1,An), then only 4n+1 solutions are possible.
It may be difficult to find general relations, only in special cases this is possible. Conflicting laws are unlikely to appear. From observations of situations (1) or (2) we will infer that there is no relation between A, and (B,C). The density describing associative relations should reflect this.
It is an interesting direction for research that - as far as I know - has not yet been explored. There are some mehtods to extract associative rules from data (WEKA has some programs for that, but I have no time to talk about it), looking at correlations between different features, but this has not been extended to systematic reasoning in more complex situations.
A: Both C and the Gaussian dispersion paramter b (called "bias/slope" in the GM configuraiton for SVM) are determined by crossvalidation. A mesh of C and b paramters is created, crossvalidation run for each (C,b) value, and maps constructed showing accuracy and the number support vectors. This may be done in the off-line case only, but paramters C/b should not be so critical, once they are roughly determined on some subset of data SVM should work well unless the new data is very different. In the on-line case we will first gather some data, determine reasonable C/b parameters and then may proceed on-line. There is no direct attempt to balance bias and variance of SVM (no formulas that measure it are optimized), but indeed, high C allows for small bias at the expense of variance, and small C sets a high bias (requiring large margins) that reduces the variance, so implicitly bias/variance is optimized.
A: Gaussian kernels usually work better, they also can separate complex distributions, polynomials have oscillations between known data clusters and are therefore rather dangerous to use. This is trial and error game, and although data that do not contain identical vectors with different labels may alsways be separated (for example by putting Gaussians with small dispersions at each data point) this is not what we want, becuase generalization will be poor. The "Cleveland heart" example showed it clearly, it is better to aim at similar accuracy on the training and test partitions in crossvalidations.
A: Both are good and one can come up with more approaches ... in practice differences are frequently small, so one should go with the approach that is simpler, computationally less demanding and converges better for a given taks.
Conceptually the differences are significant. In maximal likelihood approaches we assume fixed values of parameters that are estimated, in Bayesian approaches some prior distributions of parameters are assumed, and these distributions are changed ("learned") as a result of new observations. There is a simple realation between maximum likelihood (ML) and max. a poasteriori (MAP) estimation of parameters; MAP is reduced to ML if a uniform distribution for paramters theta that are estimated is assumed.
Dear Sir, can you give me some general ideas about all the existing classification methods? Such as what is the hidden criteria in all of them? If there are too many things, can you recommend to me some books or papers talking about these kinds of topic?
Q: Yes, I am trying to show you how all types of classifiers may be derived from Bayesian approach as special cases, and that sometimes it may be worthwhile to optimize parameters of simplified classifiers directly becuase in the Bayesian formulation the number of parameters may be too large and thus estimations would not be realiable.
There are very many approaches to adaptive systems that may be used for approximation, classification, discovery of rules, correlations or associations. The books I have mentioned in the first lecture - Duda, Hart and Stork, and Webb - contain excellent theoretical introduction to the problem, but no book is complete and becuase there are so many good methods it is probably not worthwhile to spend much time on their improvement, just select what you need and apply it to interesting problems. Solving classification problems may be done in 1000 different ways with 10 pre-processing, 10 classification methods each with 10 parameters there is 10x10x10 combinations! To make progress we should know what types of methods exist, what type of information they provide and when they will be useful, learn how to analyze the data, as some of you have done quite well in the assignement 1.
A good book with little theory comparing different classification methods may be found
at this link.
A:Please check my links first, http://www.ntu.edu.sg/home/aswduch/software.html Link to C4.5 tutorial is there.
A: Normalization will simply shift and rescale all data to [0,1] interval; as a result if there are a few very large or very small data the variance becomes very small, with most data concentrated around the mean and a few samples near 0 or 1. Standardization will make zero mean and keep most of the data withing a single standard deviation from the mean, with the unusall values far from zero. In methods that use similarities or distances - for example in the nearest neighbor methods - standardization usually works better, but in some methods - for example in most neural networks - inputs should be rescaled to [0,1] or even [0.1,0.9], so normalization should be used.
In general optimal scaling of features is neither normalization nor standardization, but should be a part of data model that includes data transformation and classification.
For example, some features may be removed (scaled by zero factor) in feature selection.
Some methods - decision trees or linear discrimination methods, for example - do not require any data transformations, while other methods are very sensitive to such transformations.
If several Gaussians with differnet covariance matrices form your data there will be no way to linearly transform this data to get idetnical covariance matrices. Finding non-linear transformation that could achieve this in non-trivial and would require first identification of these clusters, esitmation of their parameteres and than making non-linear transformations, a rather hopless case. But, as we shall see soon talking about SVMs, adding non-linear combinations of features to inputs may create transformations extending the feature space in a way that linear discrimination will be sufficient.
Quadratic decsion functions come from the fact that we are taking log of the ration of two such distributions; decision functions in d-dimensional space are therefore described as some multiquadratic forms. If more general distributions are taken, or even if the probabilitiy is modeled as a mixture (a sum) of several Gaussians, decision borders will not be polynomial at all.
Yes, independency implies uncorrelated feature in the sense of diagonal covariance matrix; we cannot guarantee independency by uncorrelating features, that is why PCA, creating uncorrelated features by diagonalizing the covariance matrix which is based on the second-order statistics, is not the same as ICA, that seeks independent features, for example by using higher order statistics, like kurtois. Using features from ICA we should get much better NB results if features were initially correlated.
We are trying to estimate the risk of some decision procedure, such as the risk of using Bayesian classifier (including the way data is preprocessed and probabilities are estimated). If there is no loss, accuracy is perfect, there is no risk. Risk of individual decision is therfore connected with loss or costs we shall pay if we make error of particular type. Accuracy = 1-error; losing accuracy means making error. Let's say that  making correct decision you could gain 1, but your system gives you aonly accuracy A; then 1-A is the error, and times Cost of this particular decision (or loss) ti will give you risk, so it does not matter if we talk about gain, accuracy or error.
We could calculate the risk for individual decision: if probability of correct decision for vectors similar to X is close to 1 than the risk is small. Although such local estimation is useful if we are trying to compare different decision-making procedures we have to estimate total risk by averaging over all possible actions.
(2) And the optimization indices of these compressed representations seem the same, too. All of these methods try to reduce the inner scatter distances or the quantization errors: the total dissimilarities between each original point and their corresponding prototype.
A: Of course, although many variants of such indices have been invented in clustering.
(3) In K-means clustering, it seems the algorithm use a fixed number (k clusters) of cluster centers as the prototypes, and the following iterative optimization step tries to reduce the total inner dissimilarities in each cluster. SOM seems better than k-means. Although it also uses a fixed number of prototypes to represent the original data, it is more flexible since it will allocate more prototypes (the nodes) in the more complex (more diverse-as you showed us in the Italian Olive Example, Sir) regions. SOM obtains this flexibility by connecting each node with its neighbors, so the more diverse (complex) region will update its representative node more frequently, and thus pull over more points into this region to get a more detailed presentation. Moreover, SOM is more powerful than k-means since it can also reduce the dimensionality at the mean time. On the other hand, GCS seems more powerful and flexible than SOM. It not only allocates the number of prototypes dynamically according to the diversity/complexity of the region, but also creates more prototypes dynamically. So the number of its prototypes is not fixed, which is more flexible than both k-means and SOM.
A: As a clusterization method K-means algorithm is not bad, SOM may not be so great because of the constraint to form topographical representation. LVQ (Learning Vector Quantization) method (also introduced by Kohonen) works like SOM but does not try to achieve topographical representation and therefore it is more accurate.
(4) So my question comes here, :-). (Sorry for such a long introduction). In the dynamics of GCS algorithm, Sir, you told us there are two alternative criteria of finding the “worst” node-the most possible node that needs adding new neighbors to help it to represent the diverse/complex region. The first criterion is the standard quantization error as you mentioned in the slides, it seems more like the classical inner dissimilarity standards. By this criterion, the goal of GCS is the same to SOM and other clustering algorithms, as mentioned above. The other criterion is the (normalized) frequency counter used in the GCS algorithm. Sir, my question is, the second criterion seems very different from the first one, since the large inner dissimilarity (first criterion) needs not to be implied by the high updating frequency of the node (the second criterion). The high updating frequency may also happen when the density of the points are very high while the diversity (such as covariance) of them is not very large. So, Sir, why can we use these two criteria alternatively? Do we choose the second criterion just because of its computation simplicity? Is there any other interesting thing behind the idea we discussed here?
A: The two criteria may give slightly different results. The original algorithm used the frequency counter but soon other versions were tried.
GCS algorithm guarantees that either quantization errors are similar for each node, or the frequency of winning is roughly the same.
The total quantization error is a sum over all vectors in the Voronoi cell of the node.
High quantization error is achieved when the cell is large and has a few vectors in it; then it is worth splitting using the quantization error criterion.
By the frequency counter criterion the prototype in the cell is selected rarely, so it will not be split. The maps for strongly non-uniform density of data points will be rather different, but if the points have similar density in most areas differences should not be large.
Many variants of GCS exist, including a supervised version (which happens to be a constructive RBF network). It is a good method of visualization and classification, but unfortunately there are no good programs to experiment with, except for the demo I've showned, and the GCSVIS package (see my software links).
A: It is much simpler than that! There is no assumption about Gaussianity of the d-dimensional data in the feature space. We simply need a function that falls gradually with a distance, otherwise all nodes in the mesh will be pulled in the direction of the presented data vector X, and we want only those that are close to the winner to be affected. A Gaussian function simply falls to zero with a distance, but a function h0(t)/||r-rc||2 or anything similar would also be good; functions that grow with the distance are bad because the influence on the weight vectors of neurons that are far from the winner is growing in an unbounded way. We could also invent a simple rule saying that only the winner and the nearest neighbors are going to be pulled in the direction of X. In practice it does not matter much, if the speed of h0(t) decreasing is suffiicently slow the map has a chance to organize itself, otherwise it may not reflect the data distribution.
A: It is an interesting idea and I am glad to hear that you are thinking in this direction.
The literature on SOM is now really extensive, with more than 5000 papers published up to 2000 (the list is here
http://www.cis.hut.fi/research/refs/. Many variants have been considered, and supervised SOM among them. In the Kohonen book (1995 version) and in the SOM Toolbox there is a very simple approach to supervised SOM, based on adding class information to the data. After training this class information has to be removed, otherwise new vectors of an unknown class will have different number of components.
The idea of updating the winner only if it is from the correct class is used in some algorithms in different variants.
Supervised SOM is not very popular, although sometimes it could improve the quality of maps. Most people think that it should be used to visualize distance realations in the feature space, and adding class labeles changes the maps in a way that is hard to understand. DCA (or FDA, as it is usually called) simply gives another angle of view on the data, but SOM creates non-linear mappings that are not so easy to understand if manipulation with the original data is allowed. We already have a problem with data preprocessing, such as normalization or standardization, that change the SOM maps.
My first name - Wlodzislaw - is unfortunately quite difficult to pronounce. V-wo-gee-s-wa-v, where V- like in V-incent; wo - like in wo-man, gee, s, wa - like in wa-nt. Last name - Duch - is easy, with like Bach, but with D.
There is much more to these transformations than I was able to tell you. Variance of variable X tells us only how much we can expect that it will deviate from the mean E(X). If X has Gaussian distribution than we can expect that 2/3 of all data is E(X)± STD(X), and 95% in E(X)± 2STD(X) where STD is the square root of variance.
Standardization is needed in various situations, for example if the data is transformed by changing the units (so X is no measured not in meters but centimeter). If we have two variables, and one has large values becasue it is measured in small units, and the other small values because the units are large, it may be hard to compare their influence on the distances, because the first will dominate and the second will change little. If any clustering of the data is done and similarity is measured using distances only large values will determine how data is grouped, but the second variable may be also useful to distinguish between different groups of data. Therefore it may be better to standardize the data first, making it invariant to scaling transformations.
There may be other linear rescaling of features, for example normalization, shifting all data to the [0,1] interval, so Z=(X-E(X)/(max(X)-min(X)). Unfortunately results of visualization, clusterization and classification depend on how we are transforming the data and there is no best way to define scaling factors. The goal is to balance contribution of different features to the distance evaluation. Normalization and standardization are not ideal methods, they may unfortunately decrease the influence of some useful features. This is a viscious circle: withouth knowing what we want to see in visualization we cannot set our priorities (scaling factors) correctly; to see data structures we need to know how to scale features, but to scale features properly we need to know what is interesting in the data ...
In addition if different features are correlated than rescaling of features may not be sufficient. Suppose that Y=2X, then if we take both X and Y into account we are going to take the same information twice, and other independent features will have relatively lower contribution. Therefore it is good to use Mahalanobis distances, that make the distances invariant agains the rotations, but not the rescaling. We may perform the transformation to the Principal Component space and then standardized the transformed features. Again it may not always be a good thing to do: we may want to reduce the dimensionality of the space, but the may want to reject first some features. Note that for a gourp of linearly dependent features only one will have non-zero eigenvalue. Rejecting features corresponding to small eigenvalues (that is small PCA variance of transformed features) may also be a bad idea becasue they may high discriminative value; I will talk about the discriminant feature analysis next time.
If the data cluster has approximately a multivariate Gaussian shape Mahalanobis distance provides a natural metric to measure the distance from the mean of the cluster to the data: the unit in each direction is the variance (for uncorrelated features C-1 is diagonal with the inverse of the variances), and the measurements are along the pricipal axis.
In signal processing a common practice is to transform data to principal components and then standardize them; this is called "the whitening transformation". After such transformation we have full scaling and rotation distance invariance, that is if the data has been linearly transformed it will be brought to a standard form after whitening.
Correlation coefficient is simply standardized covariance; if we standardize the data first and then compute the covariance matrix then all diagonal elements will be 1 and off-diagonal will be equalt to correlation coefficients. If features are higly positively correlated than if one grows we should expect that the other will also grow; if correlation is 1 that implies linear dependence (please check this).
Surprisingly, independence of features is not the same as zero correlation! I will talk about it next week discussing Independent Component Analysis (ICA).
This is just the point: becase diagonalization of covariance matrix is unique (it is well behaved, positive definite), it gives as uniquely definied eigenvectors. The eigenvector that corresponds to the largest eignevalue transforms our data in such a way that the variance in this direction is largest.
Please read about matrix diagonalization in Chapter 11 of the Numerical REcipes book,
http://www.library.cornell.edu/nr/cbookcpdf.html
A: Yes, it does, but unfortunetely the best ordering depends on the data; in some cases it may be worthwhile to permute the order of parallel axes to avoid line crossing and see the structure better. You can find some examples on the pages linked to my slides, including some oder techniques, like clusterization to make the whole bundle of lines fuzzy. Parallel coordinate techniqe is used frequently in bioinformatics, with each coordinate representing a gene activity level, where the order of genes is arbitrary.
Q: On slide #21, the picture of a 3d sphere does not look symmetrical, which is
contrary to my intuition.  Is it because of the artefacts introduced by image scaling?
This may be one factor but symmetry is also affected by the choice of the points; it is not easy to select points from a sphere to preserve perfect symmetry.
This is indeed what many people do. Development of computational tools is important and it will for some time be sufficient to make your PhD and write a few papers in technical journals. However, to really make progress in bioinformatics knowledge of molecular biology and the importance of biological functions of genes and proteins is the most important, and that goes way beyond technical knowledge. If you try to read papers in Nature and Science you will find that technical part of bioinformatics is usually hidden, and yo make important discoveries one needs to know medical and biological importance of various processes.
AI = higher level cognitive functions, problems solving, thinking, reasonin, planning, therefore symbolic methods, expert systems, knowledge representation, natural language processing. Search-based algorithms are used to find the best solutions by looking at combinations of different moves, like in board gamess (chess, checkers).
CI=lower level cognitive function, signal analysis, object recognition, discovery of interesting patterns in data. Neural networks, statistical methods, pattern recognition methods, are all part of CI. Optimization methods are used to adapt parameters of these systems to account for data structures.
A: I know that it is confusing! As I told you there are no universally accepted definitions even of the term Artificial Intelligence, although it has started in 1956. Pool et al in their book decided to steal the term and use it for a typical AI book; this adds to the confusion. Instead of going on for an hour on the topic I gave you my definitions which seem logical to me but are by no means used by everybody.
Q: For the term soft computing, we can generally refer to Zadeh [1] who coined the term.  Some web sites claimed that Soft Computing is sometimes referred to as Computational Intelligence http://library.gsfc.nasa.gov/SubjectGuides/SoftComp.htm. Some authors claimed that Bezdek [2] coined the term Computational Intelligence http://ci.uofl.edu/zurada/ciil_book.html.
[1] L. A. Zadeh, 1994, "Fuzzy Logic, Neural Networks and Soft Computing" Communications of the ACM, 37(3), 77-84.
[2] J. C. Bezdek, 1994, "What is computational intelligence?," in Computational Intelligence Imitating Life. 1-12, New York: IEEE Press.
A: My definition agrees with Bezdek, leaving knowledge-based system for AI. Zadeh initiated joint conferences on neural, fuzzy and evolutionary topics calling them soft computing, but this type of division, combining several branches of science and excluding others that solve exactly the same problem (for example statistical methods, decision trees and optimization theory) is not good because science is defined by the problems, not by the tools that someone arbitrarily chooses. Soft Computing still apears in the names of conferences/books but CI becomes more popular, especially after IEEE has changed the name NNS (Neural Network Society) to CIS (Computational Intelligence Society). Noone seriously proposed the name Soft Computing Society.
Q: Take for instance, two topics which have vague boundaries are Machine Learning and Data Mining, but the according to some authors, eg. Jiawei Han and Micheline Kamber [3], the clear distinction between these two topics are that the latter deals with the discovery of patterns in large databases.
[3] J. Han and M. Kamber, 2001, “Data Mining: Concepts and Techniques”, San Fancisco: Morgan Kaufmann.
A: There is also no clear distinction here, ML methods may be applied to Data Mining (DM) problems and not all DM is about huge data. Traditionally ML people were solving simple problems requiring symbolic representations and genreated many interesting algorithms. DM people came from the databases and on-line analytical tools, stressing the importance of methods that work on large databases. I would say that data mining is a part of computational intelligence specializing in analysis of large datasets, and incorporating many steps related to cleaning and acquiring the data that CI does not focus on.
Q: Now back to the question regarding CI and AI, here are more additional questions:
1.  Is there someone who coined the term Computational Intelligence?
A: The term has been floating at the begining of 90-ies, book [2] was not the first to use it; look at CI bibiliography:
http://www.computelligence.org/download/cibiblio.pdf
World Congress on CI was held in 1994, bringing NN, FL, and EA people together; Robert Marks has discussed it in 1993 in a paper:
http://www.ecs.baylor.edu/faculty/marks/REPRINTS/1993_IntelligenceComputationalVersus.pdf
Probably the first paper discussing CI is by J.C. Bezdek, On the relationship between neural networks, pattern recogniton adn intelligence.
Int J Approximate Reasoning 6 (1992) 85-107
I have encouraged myself IEEE officials to form CI Society instead of forming IEEE Neural Network Society,
but this change of name was finallized only this year and IEEE CIS was formed.
2.  What is your opinion on the authors of http://www.cs.ubc.ca/spider/poole/ci.html who claimed that Computational Intelligence a modern term for Artificial Intelligence?
A: They wanted to sell AI textbook and called it CI ... just adding to the confusion.
3.  If CI is distinct from AI, is there literature which one can quote that indicated the distinction as mentioned in Q/A No. 2?
A: Yes, for example Bezdek paper. I have also made some remarks on that in these two recent reviews:
http://ci.uofl.edu/zurada/papers/ZuradaSetionoDuchPIEEE2004.pdf
http://www.fizyka.umk.pl/publications/kmk/04-Challenges.pdf
Unfortunately terminological discussions go on for ever.
It is true that AI and CI people do not meet at conferences and their societies do not talk to each other, problems they solve are quite different.
I hope that my proposal to disctinguish low and high-level cognition will some day be accepted.
ML has developed in AI to automatically find the rules that a reasoning system could use. These were mostly inductive methods based on covering examples with logical rules of some form, such as "version spaces". The goal was to understand the data in the way humans do, using logic and rules that could be useful to reason in expert systems. Later ML developed universal methods that could be also used in pattern recognition, such as the AQ family of algorithms, COBWEB conceptual clustering etc. Some people refer now to all adaptive methods as "machine learning".
Pattern recognition has developed in engineering from multivariate statistics, probability theory, signal analysis, analysis of real, continous valued data. It has never tried to derive logical rules, just to model data, predict, cluster and classify.
ML and PR methods have some overlap now, but the two communities are still rather distinct, ML people are connected to AI communitiy and their conferences, while PR people to engineering.
A: This is explained very well in Kecman's book that I have recommended, unfortunately I do not plan to cover fuzzy modeling in this course, thre are separate courses on this subject. In essence it is very easy: instead of using intervals to turn on or off the controler when temperature is in or out of the range [a,b] a membership function (for example, of trapeozidal shape) is used turning it gradually on and off.
Q2. What kind of uncertainty exists, besides input uncertainty, in a given training data in the context of pattern recognition problem?
Q3. If the uncertainty besides input uncertainty exists, how does fuzzy membership function capture and represent this uncertainty? In order words, how to create the membership function that represent this uncertainty?
Q4. Does the handling of uncertainty improve classification accuracy? You mentioned in [6] that accuracy of crisp rules extracted for a number of real world datasets proved to be higher than of any other classification methods, including fuzzy-rule based systems. But you also mentioned in [1] that the introduction of fuzziness is not only desirable, but in most cases it is also unavoidable. Thus, in your opinion, does the introduction of fuzziness improve accuracy in pattern recognition? If not, what is the advantage of introducing fuzziness in pattern recognition?
Q5.  In your opinon, which can handle uncertainty in Pattern Recognition problem more: Probability Theory or Fuzzy Theory?
[1] Duch, W. (2004). Uncertainty of data, fuzzy membership functions, and multi-layer perceptrons. Fuzzy Systems,  IEEE Transactions on, to appear.
[2] Klir, G. J., & Wierman, M. J. (1998). Uncertainty-Based Information. Heidelberg: Physica-Verlag.
[3] Klir, G. J., & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic Theory and Applications. Upper Saddle River, NJ: Prentice Hall.
[4] Sarkar, M. (2002). Rough-fuzzy functions in classification. Fuzzy Sets and Systems, 132(3), 353-369.
[5] Medasani, S., Kim, J., & Krishnapuram, R. (1998). An overview of membership function generation techniques for pattern recognition. International Journal of Approximate Reasoning, 19(3-4), 391-417.
[6] Duch, W., Adamczak, R., & Grabczewski, K. (2001). A new methodology of extraction, optimization and application of. Neural Networks, IEEE Transactions on, 12(2), 277-306.
A: This is indeed enough questions for separate set of lectures or for a book! I am not such an expert on fuzzy systems and since the literature on the uncertainty is vast there may be books that I do not know about.
A1: I do not know any good book that really reviews all methods used for determination of MFs;
there are of course many papers on different approaches that one may find typing "fuzzy membership functions" in Google. You may read Fuzzy Logic FAQ, although it is old now,
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/ai-repository/ai/html/faqs/ai/fuzzy/part1/faq.html
You may also pose question on fuzzy logic discussion groups
http://groups.google.com/groups?group=comp.ai.fuzzy
A2: I have not seen any papers summarizing it but at least these 4 type of uncertainty exist:
Stochastic uncertainty: dice, accident, insurance risk - requires probability theory
Measurement: 3 cm; 20 points - distribution of results, dealt with by statistics.
Errors in data: we cannot be certain if the values are correct, some may be rejected, some corrected.
Uncertainty due to many factors, lack of information which is hidden in large amount of data: biomedical, commercial, financial databases, insufficient medical tests - data mining, searching for information in data.
Linguistic imprecision: small, fast, low price - fuzzy logic tries to makes these concepts more precise.
Some more information is in my tutorial
http://www.is.umk.pl/~duch/ref/kdd-tut03/04tutorial.html
Lotfi Zadeh also in his talks recently says that the proper area of fuzzy logic applciation is linguistic imprecision.
A3. Many papers on application show how to create membership function that represent various uncertainties; linguistic concepts are used, such as fast, slow, moderate, and then some simple functions like smeinlinear or trapezoidal fitted. I cannot discribe the whole process here, please consult some books on fuzzy systems, for example Kecman's introduction.
A4. Yes, it may be very useful and lead to more natural answers. Adding Gaussian-distributed noise to inputs helps to find better answers (there is mathematical proof of this in the book of C. Bishop on Neural Networks). Even if fuzzy rules are sometimes more accurate they still give yes/no answers where gradual anser would be more appropriate. Are you sick or healty? This for people who are a bit sick is not yes/no, but gradual, there is a degree of being sick, or a degree of "high" fever. It woudl be very interesting to learn how to calcualte with distributions given as inputs, instead of single numbers. Frequently we know this distributions, for example in medical tests accuracy is given by Gaussian distributions, in other situations by Poisson, and sometimes by uniform distributions. These distributions hsould propagate through our computing system and give distributions as outputs. In control processes we may be explicitly interested in reaching values that are within some tolerance in a process where we can regulate some input parameters and one to obtain some precision at the output. Fuzziness may be useful in many ways.
However, frequently introduction of fuzziness is done in a naive way and does not contribute to improvement of pattern recogniton systems. I have quotd some examples in [6].
A5. In Pattern Recognition probability concepts are more fundamental; in fact there is little advantage to use fuzzy logic for pattern recognition if the number of rules is too large to give understandable description of data. If the goal is to understand the structure in data and visualizaiton or crisp logic is insufficient fuzzy rules may be attmepted, but if too many are created by neurofuzzy systmes they do not teach us anything new and usually more accurate methods may be used. However, in rare cases it may happen that simple assumptions about data are
This is a question for the EA course, rather than CI ;-).
EA are not optimal as far as I know. Best results are obtained when operators that know about specific problem structure are used - see recent papers of Zbigniew Michalewicz. Combination of EA and learning is very promissing, Ryszard Michalski had a good paper on this subject. Use Citeseer to search for these papers  - it is linked to my page
http://www.ntu.edu.sg/home/aswduch/CI.html   or search on my page for AI/ML
http://www.ntu.edu.sg/home/aswduch/ai-ml.html
My own opinion is that people have put too much effort into EA and too little in comparison with classical techniques. See for example this survey:
http://www.fizyka.umk.pl/publications/kmk/99globmin.html
I will not talk about optimization in this course at all, EEE has a course E6002-Genetic Algorithms, where this will be very appropriate. We cannot fit everything in one course ...
It has to be bounded from below! Square error has a minimum =0, but cube error can decrease to - inifinity. There are many other error functions, for example based on entropy, but sqare error is the simplest, it is easy to compute gradients, and it can be proven that neural networks using this function approximate correctly probabilities. This is an interesting question though and I have once devised a method to accelerate learning by using initially high exponent a in the error function Sum ||Output(x)-Target(x)||^a. Unfortunately this idea has never been published so if you would like to pursue it please see me!
We will see it in some examples later, but it is clear that for ordinal features intervals may be used, for example [yelow-red], while for nominal features only sets, for example {taste_mango, taste_orange ... taste_apple} = test_fruit. The number of subsets that may be generated from N symbols grows like 2N. In data mining to find a subset that can be used in a simple rule will therefore be difficult. It is easier to convert ordinal values to numerical values and thus easier to deal with them. Most classification and approximation methods work only with numerical values that are the easiest to handle.
This is true for graphical models that are very fashionable now, so you are on a good track; I will say a bit more about Bayes formula in lecture 4.
If your topic is already well defined than it may be good to take relevant courses to learn the background; please look at some introductory books too.
In most cases not too high, some linear algebra and analysis may already be sufficient. Computer algorithms are frequently based on rather simple mathematics. Neural networks require mainly good skills in calculation of gradients and partial derivatives. But it is its hard to say anything general, because some reasearch goes into specialized subjects, optimization benefits from operational research courses, if you work with signal analysis than there are courses covering specialized methods in this field. More math is always better because it gives in ways of thinking about the problem.
Unfortunately I am not familiar with this paper and do not actively follow what happens in this area of neural networks. Computational intelligence covers so many things that it is impossible to known everything! It is always better to find someone who works on similar subjects, otherwise your poor lecturer will have to spend a lot of time trying to understand papers he or she has little interest in!
This is easy knowadays with the use of Internet; I have many students in Poland and spend some part of the day advising them ... In Nicolaus Copernicus University we had a New Zealander with our faculty, collaborating with his friends in NZ and Australia and some USA people, most of his papers were co-authored by people from 3 continents.
Great thinkg about research - at least in engineering or natural sciences - is that it is almost totaly independent of your language and culture.
You may find a common language with people from any country and ethnic background.
2003 results.
By 10 Jan 2003 I've got 11 answers on paper + 1 by email:
Some would like to hear about neural networks:
EID-Virtual Classroom for Artificial Neural Networks (E279-E032-02),
Instructor(s): Mdm . Tsai Flora S, A/Prof. Chen Lihui;
SC436-NEURAL NETWORKS (SC436-SC4-02S1), Instructor(s): A/Prof. Rajapakse Jagath Chandana.
Evolutionary algorithms and genetic programming are a subject of:
E6002-GENETIC ALGORITHMS (E6002-02S2)
Instructor(s): A/Prof. Ponnuthurai Nagaratnam Suganthan, Prof. Narasimhan Sundararajan, A/Prof. Wang Han
Data mining
H6405-DATA MINING (H6405-02S1)
Instructor(s): A/Prof . Ng Wee Keong
Fuzzy systems:
H6405-DATA MINING (H6405-02S1);
Instructor(s): A/Prof . Ng Wee Keong
E485-NEURO-FUZZY CONTROL (E485-LE-02S1);
Instructor(s): A/Prof . Wijerupage Sardha Wijesoma, A/Prof. Paramasivan Saratchandran, A/Prof. Song Qing
E6224-NEURAL & FUZZY SYSTEMS (E6224-02S1)
Instructor(s): Asst Prof . Mao Kezhi, A/Prof. Paramasivan Saratchandran, A/Prof. Song Qing
Some would like to see specialized topics.
Dempster-Shaefer possibilistic theory is sometimes mentioned in AI courses:
E483-ARTIFICIAL INTELLIGENCE (E483-LE-01S2)
Instructor(s): A/Prof. Chen Lihui, A/Prof. Chan Chee Keong, A/Prof. Wang Lipo
SC430-ARTIFICIAL INTELLIGENCE (SC430-SC4-02S1)
Instructor(s): A/Prof. Michel B Pasquier
Introduction to Artificial Intelligence
Time series analysis for finacial application may be more appropriate for:
FE8905 - Artificial Intelligence Techniques in Finance (FE8905)
Instructor(s): A/Prof. Lim Meng Hiot
This course explores some of the AI techniques used in financial applications. Participation in a significant project is mandatory in the course.
More specialized bioinformatics course
MSC in BIOINFORMATICS (BI-MSC)
Instructor(s): A/Prof . Kwoh Chee Keong
2003: Thank you for sending the questions. Few are answered below:
Crossvalidation is used for two reasons: to determine the expected statistical accuracy of classification models, and to find the best parameters of the model that do not lead to overfitting of the data. The final model is created on all available data using optimal parameters determined in CV, and then applied to the new data.
For example, the optimal number of nodes in the decision tree is determined using CV, along with the expected statistical accuracy, and then such a tree is created using all data; it may have slightly better accuracy than those achieved in CV test on the new data, since it has been created using all available data. In models based on basis function expansions, like RBF, the number and some parameters of these functions may be determined as an average number that maximized the validation accuracy in the CV test.
For some models, like LDA, CV will be used only to estimate statistical accuracy, since there are no model parameters that may be set.
In most systems this is rather confusing, but in Ghostminer this is rather clear. For example, after running the SSV tree which is trained by CV optimal parameters are found, and the following information is reported: train result from a final tree, and average train/test results from CV, and individual results on each CV partition.
You may pass, but I hope that you will be more ambitious than that ...
No, this is just for people that will find applications of some of the methods that I have talked about, these books are good start to get deeper into the subjects.
If you understand the topics than there are no formulas to memorize! Everything should be obvious. True positives, flase negatives, formula for information, joint information, SOM update rule, or the margin and SVM criterion, if the concept is understood than there is no need to memorize formulas, they can be written in a natural way.
There were very few longer theoretical derivations, SVM and bias-variance decomposition are the two main ones, all others fit on one-page. Good part of these lectures were demonstrations that I obviously will not require during the exam. What is left is really not that much, some visualization techniques, Bayesian fundations, decision trees, linear discrimination, kernel methods, density estimation, information theory, some applications and model improvements.
Unfortunately telling you which lectures should be focused upon will break the fundamental rule that all information related to examinations whould be kept secret.
Similar to the lecture notes. I do not expect very detialed long derivations that you may find in the textbooks. I am sure that those of you who will really need it will go through them anyway.
This year any 4 questions out of 5 should be answered. If you answer all of them the best questions are selected, but there is no way to give you A+ as a reward. Please note that each question may cover different lectures, so it is not sufficient to learn half of the course material !
Page 11 in Lecture 4 shows the contigence table, it is simple joint probability distribution P(r,C), where r are the objects in some bin (for example in a histogram), and C are classes. You should know how to compute these numbers, what will you get summing row, columns, all elements, how to get conditional probabilities from joint probabilities. Use the histograms on a precious page: there are 2 classes and 20 bins. Histograms are showing the number and the class of the fish, so the table is easy to construct, 20 rows with 2 columns.
Like anything else in parallel coordintes! Start from a circle or 2-D sphere. Select some points (x,y) on the circle and for each point plot the value of its coordinates on two parallel axes, one representing x, the second y, join the x, y points on the axis, since they come from the same point on a circle. Result is presented on slide 16. In 3-D again take some points on equator and a few other circles, this time they have (x,y,z), so 3 parallel axis are used. Result is again on slide 16 of Lecture 5. You may do it with 8-D sphere if you know how to calculate the values of the points on such a sphere. More segments, similar to 3-D case, will apear.
Usually binary trees are created but original algorithm allows to split nodes to more than two branches.
Large variance -> Low entropy(uncertainty) -> More information
Slide 7 of lecture 36 shows an example, and indeed you see that narrow peaks carry little information (becuase average suprise is low), while broad peaks, large variance distributions, contain more infromation. Think about some lottery in which a flat distribution over one milion people taking part in it is assumed; if you win, you are suprised, gain large information, and hopfully a huge prize. If on the other hand you know that only the family of the lottery organizer may win, the amount of information after announcing the winner is low ...
Sample questions that have not been used: