Rafał Adamczak Krzysztof Grąbczewski Karol Grudziński Norbert Jankowski Antoine Naud 
What is this tutorial about:
Neural networks are universal approximators/classifiers (see textbooks for references)
... but are they good tools for real applications?
More methods of classification than datasets to classify.
Computational intelligence (CI) methods developed by experts in:
What type of explanation is satisfactory?
Interesting cognitive psychology (CS) problem.
Exemplar and prototype theories of categorization in CS:
Both are true, logical rules are the highest form of summarization.
Types of explanation:
Best explanation  depending on particular field.
Other implications of knowledge extraction:
Use of various forms of knowledge in one system is still rare.
Cf. DISCERN  distributed lexicon, NLP dialog system (Mikkulleinen)
Logical rules, if simple enough, are preferred by humans.
Are rules indeed the only way to understand the data?
Types of logical rules:
small(x)  = True{xx<1} 
medium(x)  = True{xx Î [1,2]} 
large(x)  = True{xx>2} 
Linguistic variables are used in crisp (propositional, Boolean) rules:
IF smallheight(X) AND hashat(X) AND hasbeard(X) THEN (X is a Brownie)
ELSE IF ... ELSE ...
Crisp logic based on rectangular membership functions: True/False values jump from 0 to 1.
Step functions are used for partitioning of the input space.
Decision regions: hyperrectangular (cuboidal).
For example, some rules for the Iris data are:
Setosa  PL <2 AND PW < 0.6 
Virginica  PL > 4.9 OR PW > 1.6 
Versicolor  ELSE 
Here are decision regions for these rules.
Decision trees provide crisp rules applied in a specific order.
Here is a decision tree for Iris data.
Below its decision regions.
If hyperrectangular regions are too simple, rules are not accurate;
Solution: allow linear combinations of some inputs x.
Oblong decision trees, LDA, Linear Discrimination Analysis.
For example, a good rule for Iris is:
IF (SL+1.57SW3.57PL3.56PW<12.63) THEN Irisversicolor
The number of problems that one may analyze using crisp logic may be limited.
Typical approach: define triangular, trapezoidal, Gaussian and other type of membership functions.
Membership function m(x)  degree of truth that x has the property m
Instead of 0, 1 predicate function map into [0,1] interval.
Partition every feature into some membership functions, for example triangular.
Fuzzy logic: separable functions  products of onedimensional factors:
Many other possibilities exist to produce Ndimensional membership functions.
One form of fuzzy rules is:
Triangular and trapezoidal membership functions give such countours in 2D for different Th
Rough set logic
Roughly: trapezoidal shapes of membership functions, but borders may be nonlinear.
At least M conditions out of N are true.
Natural for neural systems, for example, if at least 2 logical conditions out of 4 are true:
IF at least 2 conditions out of {A,B,C,D} are true
THEN (X is a Brownie)
ELSE IF ... ELSE ...
Clusters may have complex decision border shapes. IF (XÎC) THEN Fact_{k} = TRUE Granulation: covering with simpler shapes, corresponding to many rules.
Simple rules  nonlinear feature transformations may be required.


Crisp logic rules are most desirable; try them first,
but remember ...
Fuzzy rules have continuous membership functions, giving some advantages:
But remember ...
Problems with rulebased classification models:
Knowledge accessible to humans:
Prules have the form:
D(X, P) is a distance function determining decision borders.
Prules are easy to interpret!
IF X=You are most similar to the P=Superman THAN You are in the Superleague.
IF X=You are most similar to the P=Weakling THAN You are in the Failedleague.
In both cases “similar” may involve different features, i.e. different D(X, P) functions.
Prules are more general than Frules!
Frules (and crisp rules) are Prules with separable distance function.
If
for example, Manhattan function, then:
If d(X_{i}, P_{i}) = 1 for X_{i} Î [P_{i}  D P_{i} , P_{i} + D P_{i}], and 0 outside, crisp rules are obtained.
Example: Prules for Iris.
First rule extraction/application is considered;
than some remarks on prototypebased and visualizationbased methods are made.
Methodology of rule extraction, not a particular method.
Many decisions depend on particular application, not completely automatic.
Start from crisp rules  maybe they are sufficient?
Regularization of classification models (for example, network or tree pruning)
allows to explore simplicityaccuracy tradeoff.
A set of rules M is obtained.
Next step: optimization of rules exploring the confidencerejection rate tradeoff.
Define confusion matrix F(C_{i},C_{j}M) counting the number of cases from class C_{j} assigned by the set of rules M to the class C_{i}.
For 2 class problems:
F(C_{i},C_{j}M) = 

where F_{ij} is the frequency of cases N_{ij}/N.
F(C_{i},C_{j}R) may be defined for a rule or a class.
Sensitivity of rules: Se=F_{++} / (F_{++}+F_{+})
Î[0,1].
If Se=1 than all  cases (for example, sick) are never assigned to + class (for example, healthy).
Specificity of rules: Sp=F_{} / (F_{} + F_{+})
Î[0,1].
If Sp=1 than the rule never assigns healthy to sick.
These combinations are sometimes maximized (especially in medical statistics).
Rule confidence factor (relative frequency):
R_{c}=F(C_{i},C_{i}R) /
S_{j} F(C_{i},C_{j}R).
Rule support: how many cases does a rule cover?
Various ways of rule evaluation: entropy (information), mrelative frequency and other
Ideal: only the diagonal confusion matrix elements summing to N should be left.
Define weighted combination of the number of errors and the "predictive power" of rules:
This should be minimized without constraints; it is bound by N (number of all training vectors).
Sets of rules M are parameterized by X_{k}, X'_{k} intervals.
For g=0 predictive power of rules is maximized.
Rules that make fewer errors on the training data should be more reliable.
Cost function E(M; g) allows to reduce the number of errors to zero (large g) for rules M that reject some instances.
Optional risk matrix may be used:
Optimization of rules for a single class or linguistic variables for a single rule is possible.
Note: if the confusion matrix F(C_{i},C_{j}M) is discontinuous nongradient minimization methods should be used (simplex, simulated annealing etc).
Result: sets of rules M_{k} of different reliability.
Frequently more accurate rule R^{(1)} is contained in the less accurate rule R^{(2)}
Reliability should be calculated for the border R^{(2)}  R^{(1)} only.
Estimations of reliability may be poor.
Example: Iris hierarchical rules
Data from measurements/observations are not precise.
Finite resolution: Gaussian error distribution:
x => G_{x}=G(y;x,s_{x}), where G_{x} is a Gaussian (fuzzy) number.
Given a set of logical rules {Â} apply them to input data {G_{x} }.
Use Monte Carlo sampling to recover p(C_{i } X; {Â
})  this may be used with any classifier.
Analytical estimation of this probability is based on cumulant function:
Approximation better than 2% for
The rule R_{a}(x) = {x>a} is true for G_{x} with probability:
If the logistic function is used instead of the error function the exact error distribution is
s(x)(1s(x)); for s^{2}=1.7 it is within 3.5% identical with Gauss.
Soft trapezoidal membership functions realized by Lunits are obtained.
Fuzzy logic with such membership functions and crisp inputs is equivalent to crisp logic with G_{x};
this is realized by MLP neural networks with logistic transfer functions.
MLP < = > FL with trapezoidal membership functions.
For conjunctive rule with many independent conditions:
R = r_{1} Ů r_{2} Ů ...
r_{N}
the probability p(C_{i} X) is a product of
If several rules cover the same input region: adding probabilities counts the overlapping regions many times. For two rules created for class C:
This formula simply avoids counting the same regions twice. For more than 2 rules overlapping more terms appear!
This is not a fuzzy approach!
Here small receptive fields are used, in fuzzy approach typically 25 large receptive fields define linguistic variables.
Benefits:
Alternative approaches: flexible matching in machine learning.
Several practical ruleextraction methods developed in our group:
Simplify the network leaving only 0, ±1 weights, use special linguistic units for input discretization.
Use integer weights/biases.
Start from W_{ij} = 0, bias_{i} = 0.5, change by 1.
Use beam search techniques instead of backpropagation.
FSM (Feature Space Mapping)  separable transfer functions, neurofuzzy network.
Crisp rules: FSM + rectangular transfer functions.
Fuzzy rules: FSM + contextdependent fuzzy membership functions.
Localized functions may be treated as prototypes.
SSV separability criterion: separate maximum number of pairs from different classes minimizing the number of separated pairs from the same class.
Select the best prototypes  "supermans".
Simplest approach: select references in knearest neighbor method.
Explanatory data analysis  show the data.
Overview of visualization methods: if time permits ...
SOM  most popular, trying to classify/display at the same time, but poorly.
May be used with any classifier.
Shows the probabilities in the neighborhood of the case analyzed for all/each feature.
Allows to evaluate reliability of classification, but not to explain the data.
Used directly on the data.
Shows interactively the neighborhood of the case analyzed preserving topographic relations.
Just a brief mention of GhostMiner, our data mining system developed in collaboration with Fujitsu Kyushu Ltd (FQS).
Results for many datasets illustrating the methodology described above.
Showing an example of a system based on rules derived from the data.
In real world projects training and finding optimal networks is not our hardest problem ...
Good methods to discover rules exist although proving that simplest sets of rules have been discovered is usually not possible.
Discovering hierarchical structure in the data:
Dealing with unknown values.
Constructing new, more useful features, from hundreds of 'weak' features.
Constructing the simplest explanations for the data.
Constructing theories that allow to reason about data.
Constructing new and modifying existing classes.
Building complex systems interacting with humans.
IEEE Transactions on Neural Networks 12 (2001) 277306
Most papers are available from these pages
https://www.fizyka.umk.pl/kmk/publications.html
https://www.is.umk.pl/~duch/cv/papall.html