Goal of Computational Intelligence: not only prediction, diagnosis, classification, but also understanding of the data, explanation, reliability assesment, knowledge discovery, building theories and causal models.
What type of explanation is satisfactory?
Interesting cognitive psychology (CS) problem.
Exemplar and prototype theories of categorization in CS.
Types of explanation:
"Best explanation"  depends on particular field.
Logical rules, if simple enough, are preferred by humans.
Rules require symbols, defined by predicate functions Property(X)Î [0,1].
Perception:
Types of logical rules:
Crisp logic rules:
for continuos X use linguistic variables Property(X) = 0 or 1 (F/T):
IF smallheight(X) AND hashat(X) AND hasbeard(X) THEN (X is a Brownie)
ELSE IF ... ELSE ...
Crisp logic based on rectangular membership functions: True/False values jump from 0 to 1.
Step functions are used for partitioning of the input space.
Decision regions: hyperrectangular (cuboidal).
The number of problems that one may analyze using crisp logic may be limited.
Crisp logic rules are most desirable; try them first,
but remember ...
Fuzzy logic rules
Property(X)Î [0,1] instead of 0 or 1.
Typical approach: define triangular, trapezoidal, Gaussian and other type of predicate (membership) function.
Membership function m(X)  degree of truth that X has the property m
Frequently features are partitioned by membership functions, defining large receptive fields, reducing information (?).
For example, using triangular functions.
Fuzzy logic: separable functions  products of onedimensional factors:
Many other possibilities exist to produce Ndimensional membership functions.
Fuzzy rules have continuous membership functions, giving some advantages:
But remember ...
Example: Ljubliana breast cancer
286 cases, 201 no recurrence cancer events (70.3%), 85 are recurrence (29.7%) events.
9 features, symbolic with 2 to 13 values.
Single logical rule:
IF involved nodes >2 AND Degreemalignant=3 THEN recurrence
ELSE norecurrence
gives over 77% accuracy in crossvalidation tests;
best classification systems do not exceed 78% accuracy (insignificant difference).
Some fuzzy/rough systems provided 200 rules with CV accuracy close to the base rate!
Reason: too complex, hard to find simplest solution.
All knowledge contained in the data is:
IF more than 2 areas were infected
AND cancer is highly malignant
THEN there will be recurrence.
Contextdependent linguistic variables  adapt membership functions in each rule.
Effect: clusters of different sizes at different input areas.
Clusters may have complex decision border shapes. IF XÍC THEN Fact_{k} = TRUE Simple rules  nonlinear feature transformations may be required.


Problems with rulebased classification models:
Simplest things first: try crisp rules, large information reduction.
Data from measurements/observations are not precise.
Finite resolution, small receptive fields: Gaussian error distribution:
x => G_{x}=G(y;x,s_{x}), where G_{x} is a Gaussian (fuzzy) number.
Given a set of logical rules {Â} apply them to input data {G_{x} }.
Use Monte Carlo sampling to recover p(C_{i } G_{x}; {Â
})  may be used with any classifier.
The rule R_{a}(G_{x}) = True{xx>a} is true with probability:
because the cumulant function is approximated by the logistic function:
for within 2%
Soft trapezoidal membership functions (MF) realized by neural Lunits are obtained.
Fuzzy logic with soft trapezoidal MF and crisp inputs is equivalent to crisp logic with G_{x} inputs;
MLP neural networks with logistic transfer functions provide such MFs.
MLP model includes FL with trapezoidal membership functions.
If rules are overlapping and conditions are correlated the formula leading to Monte Carlo results is:
where 2^{Rc} are all subsets of the set of classification rules for class C and R is the number of rules.
This formula simply avoids counting the same regions twice.
This is not a fuzzy approach!
Here small receptive fields are used, in fuzzy approach a few large receptive fields define linguistic variables.
Benefits:
Next step: optimization of rules exploring the confidencerejection rate tradeoff.
Define confusion matrix F(C_{i},C_{j}M) counting the number of cases from class C_{j} assigned by the set of rules M to the class C_{i}.
For 2 class problems:
F(C_{i},C_{j}M) = 

where F_{ij} is the frequency of cases N_{ij} / N.
F(C_{i},C_{j} R) may be defined for a rule or a (sub)set of rules.
Ideal: only the diagonal confusion matrix elements summing to N should be left.
Define weighted combination of the number of errors and the "predictive power" of rules:
This should be minimized without constraints; it is bound by N (number of all training vectors).
Sets of rules M are parameterized by X_{k}, X'_{k} intervals.
For g=0 predictive power of rules is maximized.
Rules that make fewer errors on the training data should be more reliable.
Cost function E(M; g) allows to reduce the number of errors to zero (large g) for rules M that reject some instances.
How to understand the data?
Explanatory data analysis  show the data.
Kohonen's topographics SOM  most popular, trying to classify/display at the same time, but poor in both respects.
Assesing the new data:
Investigate the dependence of p(CG_{X}; M,S) on S.
If new classes with significant probability appear analyse dependence of
p(CG_{X}; M,S) on S on individual features X_{i}.
Example:
with S_{i} =0 "organic problems'' with 100% confidence;
with 1% uncertainty (optimal, lowest error) "organic problems'' with 90% confidence.
Probabilities of different classes are shown as a function of assumed uncertainties.
S_{i}=r [Max_{i} Min_{i}], horizontal axis shows r in percents of the input feature range, vertical axis shows probability in percents.
In ambiguous cases probabilities of several new classes quickly rise for growing r and of the most probable class quickly decrease.
PCI, Probabilistic Confidence Intervals
If probability of new classes quickly grows with the assumed uncertainty of the measurement analyze
probabilistic confidence levels.
Assume that the value of a single feature is changed, all other are constant; display the two largest probabilities.
A mixed class appears: paranoidal schizophrenia.
Sensitivity analysis: which feature has strongest influence on the decision?
Shows probabilities in the neighborhood of the case analyzed for a single feature.
Allows to evaluate reliability of classification, but not to explain the data.
May be used with any classifier.
May illustrate the trajectory of changes:
IMDS, Interactive MultiDimensional Scaling
Shows interactively the neighborhood of the case analyzed trying to preserve all topographic relations.
Minimization of a Stress function as:
where w_{ij} are weights allowing to control which distances are to be better preserved.
Example: focusing on a data point 'p554' from class 'organika'
 Purpose: view (understand) why this data is classified into class 'organika'.
Classified using IncNet neural network, for which features 2, 4 and 7 are sufficient to classify correctly class 'organika'.
Progressive zooming by mapping successively the 200, 100, 50 and 20 nearest data points (interactively selected) from point 'p554' (marked by a black circled dot).
200 nearest neighbors 
50 nearest neighbors 
An example of a system based on rules derived from data.
An example of data where logical rules do not work.
871 patterns, 6 overlapping vowels, 3 features (formant frequencies).
No reason why simple rules should exist in this case.
Method   Reference 
10xCV tests below  
3NN, Manhattan  88.1 ±0.4  Kosice 
FSM, 65 Gaussian nodes  87.4 ±0.5  Kosice 
SSV dec. tree, 22 rules  86.0  Kosice 
2xCV tests below  
3NN, Euclidean  86.1 ±0.6  Kosice 
FSM, 40 Gaussian nodes  85.2 ±1.2  Kosice 
MLP  84.6  Pal 
Fuzzy MLP  84.2  Pal 
SSV dec. tree, beam search  83.3 ± 0.9  Kosice 
Bayes Classifier  79.2  Pal 
Fuzzy SOM  73.5  Pal 
More than 20 rules give lower accuracy than kNN.
Simplest explanation: 6 FSM prototypes (one per class), Gaussian weighting of the Euclidean distance.
10xCV gives 72.4% accuracy  insufficient info in the 3 formants.
Understanding the data is possible through:
Rules: frequently rule sets are unstable and/or rulebased classification is unstable.
Small number of accurate rules  best explanation?
Interpretation/evaluation of new cases:
Class probabilities may always be estimated by Monte Carlo.
For crisp and most fuzzy rule sets analytical formulas exist.
Most papers are available from these pages
http://www.fizyka.umk.pl/kmk/publications.html
http://www.is.umk.pl/~duch/cv/papall.html