LITERATURE SURVEY
on
NEURO-COMPUTING
- Neural Networks and physical systems
with emergent collective computational abilities, J.J.Hopfield - 1982
- Optimization by simulated annealing,
S. Kirkpatrick, C.F. Gelatt, Jr., and M.P. Vecchi - 1983
- Feature Discovery by Competitive
Learning, D.E. Rumelhalt and D. Zipser
- 1985
- Self-organized formation of topologically
correct feature maps, Teuvo
Kohonen - 1982
- General Introduction Chapter,
Neurocomputing, James
A. Anderson - 1988
- Association Chapter, Psychology,
William James - 1890
- A logical calculus of the ideas
immanent in neurons activity, Warren S.
McCulloch and Walter Pitts - 1943
- Introduction and Chapter 4,
The Organization of Behavior, Donald
O. Hebb - 1949
- Learning Internal Representation
by Error Propagation, D.E Rumelhart, G.E.
Hinton, and R.J. Williams - 1986
- Introduction Chapter of Perceptrons,
Marvin Minsky and Seymour Papert - 1969
- Methods to Speed Up Error Back-Propagation
Learning Algorithm, Dilip Sarkar - 1995
- An Emprical Study of Learning Speed
in Back-Propagation Networks, Scott E. Fahlman,
1988
Neural Networks and physical systems with
emergent collective computational abilities, J.J.Hopfield - 1982
Proceedings of National Academy of Sciences
79:2554-2558,1982
- Interactions among large number of elementary components yield computationally
powerful units.
- Seek collective behaviors, that are robust in details.
- Content addressable memory could retrieve entire
info when a partial info is given.
- Time evolution of a system may be described by a set of coordinates. There
will be stable point regions. For a starting point X = Xa
+ delta represents a partial knowledge of the item Xa
and the system generates to total info Xa.
- Model is composed of neurons which may be only in two different states,
V=0 or V=1. Neurons are connected to each other. Each neuron is adjusted according
to the threshold in each neuron. The most important differences from Perceptrons
are:
- It has a strong back-coupling
- Do ask the questions the questions essential to find emergent
computational properties.
- Syncnocity is not needed as in biological neurons.
- Strength of connection : T. Initial Tij is assumed to be
set by prev experience. Tij =Tji .
- Delay in neurons are simulated like in real ones.
- State changes continue until a smallest Energy
is found. In each update in the state, an energy decrease takes place, or
energy is stable. Therefore, update rule is an
energy minimizing rule.
- Hebb based neural learning.
Optimization by simulated annealing,
S. Kirkpatrick, C.F. Gelatt, Jr., and M.P. Vecchi - 1983
Science 220:671-680,1983
- Main problem in hill-climbing is we may be at the wrong hill.
- One way to tackle with the problem is to examine function
parameters, which requires a high amount of time.
- Main idea : Instead of always
going downhill, go downhill most of
the time.
- Temperature is used as a controller
of how much random search is done by the system.
- Thermo dynamical probability is proportional to Boltzman
probability factor.
- In the thigh temperatures, there may still be a probability to
find the optimum. At high temp, because of the formula higher or lower
energy have same probability approximately. But when we go lower temps,
system will tend to stay in lower temperatures.
- The question is from which temperature to start and cooling
algorithm.
- For NP-Hard problems. Heuristic methods in two classes :
- divide-and-conquer
- iterative improvement : Stuck at local minima
- The most probable behavior of the system (for large number of
atom systems) in thermal equilibrium at a given temperature is observed.
- Cost function -> energy ...
- In Metropolis algorithm when
a configuration change occurs if ' delta_E <= 0 ' new
configuration is accepted, if not, it will be accepted or not
accepted with a probability.
- Simulated annealing process first
optimize system in high temperature, then lower temperature by slow
steps, at last the system freeze in zero temp or when no further changes
occur. At each temperature, the simulation must proceed so long for the
system to reach a steady state.
- Large modifications are done in high temps, small modifications
are done in low temperatures.
- Applied problems: layout of computer chips, routing
of wires and TSP.
- For TSP problem, when cities are grouped together,
temperature changes in the solution results in the rapid separation of
different regions of cities.
Feature Discovery by Competitive Learning, D.E. Rumelhalt and D. Zipser - 1985
Cognitive Science, 1985, 9, 75-112, 1985
- In the beginning, history of neural networks
- Rosenblatt was the first who try to apply unsupervised learning
through neurons.
- Rosenblatt claimed that perceptrons and brains are superior than
computers. Because of their statistical properties both,
- They can interpret environment. He believed in logical approach
is not sufficient to mimic brain's intellectual power. Not the result
but HOW it was performed.
- Result: no network can learn what it is not capable of doing in
principle.
Learning In General:
Paradigms of learning
1. Auto Associator, from part to whole
2. Pattern Associator, from one to pair
3. Classification Paradigm
4. Regularity Detector: Each stimulus pattern is
presented by some probability. Discover the statistically salient
features. No prior classification.
Competitive Learning
- Excitatory connections between layers.
- Inhibitory clusters between layers which compute against each
other.
- Works done :
- von der Malsburg (1973)
- Fukushima (1975)
- Particular System:
- - Each unit in a cluster inhibits all units
within the cluster. Non-ovelapping clusters. One units per cluster may
be active.
- - Units learn only if wins the competition.
- - If lower units is inactive and upper
active -> decrease weight, otherwise -> increase weight
- Features of Competitive Learning :
- - Each cluster classifies the stimulus set
of M groups, one for each unit in the cluster.
- - If there is a structure, it will find.
Dipole Experiment
The goal of these set of experiments is to discover
the dimensional structure of the stimulus population and act as binary
detectors which differentiate half of the grid which contains the
stimulus pattern. The overlapping of the stimulus patterns, which are
constructed as dipoles, can be used to discover the structure. The
stimulus lines form a 2-D grid of 4x4 and as a result there may be 34
possible adjacent dipole patterns. At the end of training period, in 2-D
case, grid is divided as vertically, horizontally or diagonally due to
the initial values of weights. For a cluster which contains two units,
50 to 400 trials are needed to become a stable state. For 4 units, 4000
is sufficient. Additionally, for 3-D space, the system discovered the
spatial structure according to the given input stimuli. In the analysis,
there are two main points which affects the performance of the system
found:
- There is a pressure to reduce the number of border stimuli to
minimum.
- There is a tendency to capture equal sized regions.
Learning Words and Letters
The objective of these experiments is to analyze to
feasibility of letter/word recognition using Competitive Learning
Algorithms. The nature of the input stimuli pattern set is examined in
order to get different behaviors of differentiation in same set of
patterns. The input stimuli is given in a rectangular grid. Because of
the sparse distribution of active input lines for all patterns, some
units may not win during all training process. There are two
modification that could be applied: First is to modify the weight even
when the unit loses in the cluster. Other solution is to adjust
threshold dynamically during the trials. Here are some experiments:
- For 2 units in one cluster and two-letter words, like AA and AB,
the group is constructed according to the first letter. Thus, this can
be called as position-specific letter detectors.
- For 4 units, if the element number is four, four groups appeared.
- Result: If there are N units in one cluster, the specific
position which contains N different letters of the word pattern will be
the position of discrimination criteria.
- In another set of experiments, letter similarity is found by the
algorithm. While A and E are in one group, B and S are in another group.
- In correlated inputs which include AA, BA, SB and EB, the result
of experiments showed that strong correlation between highly similar
inputs overrides first position based units grouping.
- Result: One can control unsupervised learning by controlling
statistical structure of the stimulus patterns presented.
Grouping Horizantal and
Vertical Lines
This is a class of problems that the different
classes are not linearly separable. Since no configuration could
distinguish horizontal lines from vertical ones, using more than one
cluster is obligatory. In this problem, different line types overlap
different from the previous problems. To solve the problem, a three
layered architecture is used. The middle layer was composed of two
clusters and 4 elements in each. The input patterns were given with a
"constant teaching" line in the left and an actual line in the right.
For each pattern, horizontal or vertical line-units were active from
both middle-layer clusters. A top most layer which contains only one
cluster with 2 units was inserted to classify input coming from middle
layer as in the dipole stimulus case. Result: System can differentiate
non-linearly separable patterns using some specific architectures.
Self-organized formation of topologically correct feature
maps, Teuvo Kohonen - 1982
Biological Cybernetics 43: 59-69, 1982
- As it is found in some certain regions in brain it may be useful
to implement a topographic self-organization.
- In this organization, nearby units give similar outputs to same
inputs. For the unit, which gives the best signal to an output signal,
synaptic weight is changed. Additionally neighbor unit weights should be
changed to obtain a "near to the best" outputs. The distant units'
weight strengths should be decreased for normalization concerns.
- A global model should be composed of,
- array of processing units capable of functioning using input
signals,
- a function which selects the "best" value,
- a local neighborhood selection criteria,
- a means of modifying parameters.
BOOK:
General Introduction Chapter, Neurocomputing,
James A. Anderson - 1988
Neurocomputing : Foundations of Research, MIT
Press, Cambridge, 1988
- Connectionalist Model:
The overall behavior of the system is determined by the structure and
the strengths of connections. It is possible to change the connection
strengths by various learning algorithms. It is also possible to build
in various kinds of dynamics into the response of the computing elements.
- A single layer of neurons that connects to itself referred as an autoassociative
system.
- Common sense general rules:
- Similar inputs usually gives rise to similar representations.
- Things to be seperated should be given to widely different
representations.
- If something is important, lots of elements should be used to
represent it.
- Do pre-processing as much as possible.
BOOK:
Association Chapter, Psychology, William James - 1890
Psychology (Briefer Course), New York: Holt,
Chapter XVI, "Association," pp. 253-279
- "The brain is not construted to think abstractly, it is
constructed to ensure survival in the world. Mental facts cannot be
properly studied apart from the physical environment of which they take
cognizance. Mind and world evolved together and mutual fit."
- Brains are only as powerful as they have to be, and are often
suprisingly special purpose. They are very poor in arithmetic or logical
functionalities.
- Association: "When two brain
processes are active together or in immediate succession, one of them,
on reoccuring tends to propagate its excitement into the other."
A logical calculus of the ideas immanent in neurons activity,
Warren S. McCulloch and Walter Pitts - 1943
Bulletin of Mathematical Biophysics 5:115-133,
1943
- The first computational neuron model.
- "A single neuron was simple, and that computational power came
because simple neurons are embedded in an interacting nervous system."
- Assumptions are:
- Neurons are binary.
- If there is any inhibitory active input, this will
inhibit the neuron.
- There is a fixed amount of delay.
- There is a fixed threshold for each neuron.
- Network structure is fixed, each has same weight. No adaptation!
- Inhibitory synapse and Excitory synapses exist. If synapsis is
excitory, its strength is positive, and if it is inhibitory, its
strength is negative.
- He proposed, it is possible to implement logical functions using
neurons.
BOOK:
Introduction and Chapter 4, The Organization
of Behavior, Donald O. Hebb - 1949
The Organization of Behavior, New York: Wiley,
Introduction and Chapter 4, 'The first stage of perception: growth of
the assembly" pp. xi-xix, 60-78, 1949
- Hebbian Learning...
- It is important because he was the first who propose
modifications in synaptic weights will give rise to the physiological
learning rule. --> Hebb synapse.
- "When an axon of cell A is near enough to exite a cell B and
repeatedly or persistently takes part in firing it, some growth process
or metabolic change take place in one or both cells such that A's
efficiency, as one the cells firing B, is increased." From these words,
we can conclude, the efficieny as firing B, is denoted as synaptic
weight.
- He mentioned about some "distributed nature of the repsentation".
- Cell assemblies : There were
interconnected, self-reinforcing subsets of neurons that formed the
representation of information in the nervous system.
Learning Internal Representation by Error Propagation,
D.E Rumelhart, G.E. Hinton, and R.J. Williams - 1986
Nature 323: 533-536, 1986
- Crucial importance in Neural Network Research !!!
- It is considerably faster than the other learning algorithms that
is successful with multi-layer networks, the Boltmann machine(in 1986).
- Back-propagation is a generalization of the Widrow-Hoff error
correction rule -> generalize delta-rule.
- A forward pass then a backward propagation of errors while
modifying the errors.
- Back-prop was descovered by two different places at the same time.
- Real-synapses don't tun backwards.
- Can be used for data compression.
- Like XOR problems, these problems cannot be solved without hidden
units which creates their own internal representations of the input
patterns. Recoding of input patterns in hidden layers (Minsky and Pepert
said).
- The aim is to find a powerful rule, generalized delta rule, for
multi-layer networks.
- The full derivations of delta
rule and generalized one.
- Hidden units with linear activation functions provide no
advantage.
- A continous and non-linear activation function is needed, like logistic activation function.
- A learning rate largest possible which does not lead oscillation should be selected. One way
to increase learning rate while not leading oscillation is to include momentum term to the delta rule.
- If all weights start equal, and solution requires unequal
weights, the system can never learn because the error is propagated
backwards with a proportion of weight -> Symmetry
Breaking.
- XOR, parity, encoding, symmetry about center, addition, negation
are the problems experimented. Discrimination of T
& C independent of translation and rotation was another
problem where each hidden-layer unit is organized into a 3x3 grid
receiving input from a square thus overlapping square regions are the
receptive field.
- Every recurrent network has a
mapping to a feed-forward one. The only requirement is to make unique
all weights and sum up the weight changes in backward pass. ->
Learning to be a shift register and Learning to complete sequences was
the two different experiments done.
BOOK CHAPTER:
Introduction Chapter of Perceptrons, Marvin Minsky and Seymour Papert - 1969
Cambridge, MA: MIT Press, Introduction, pp. 1-20
- The most important reason of the dark
ages of neural networks is this book. It cut all the money
which spent for the neural networks.
- This book mainly examines the perceptrons and its limits. The
requirement of the linear seperability
for problems was the most striking need. There were problems if too much
generalization is needed.
- The R-units was taken as the classification layer by Rosenbatt,
but Minsky took it as predicates, and when retina was taken as 2-D grid,
they was able to make all experiments based on geometrical arguments.
What type predicate geometrical functions could be computed using
perceptrons?
- Connectedness was the main
predicate of Misky and Papert. There were two main limits in their
proofs:
- Ordered limit: only a certin
maximum number of retinal points could be connected to the local
decision units computing the local predicate.
- Diameter limit: only a
geometrically restricted region of retina was connected.
- Diameter limited and order limited perceptrons cannot compute the
connectedness. We cannot either !!!! :)
- Parity is another problem.
- Scaling was the other problem.
While small systems work very nice, larger systems not work at all.
Methods to Speed Up Error Back-Propagation Learning
Algorithm, Dilip Sarkar - 1995
ACM Computing Surveys, Vol. 27, No. 4, December 1995
- Nature of error surface may be different in different regions -> dynamical
learning rate
- Characteristic of error surface may be unique in every dimension -> unique
learning rate
- Indirect dynamical Learning Rate Adjustment:
BOTH - Momentum use previous change in weight.
BATCH - Conjugate Gradient : It is about the error surface in perpendicular
directions.
- Direct Adaptive Learning Rate Adjustment:
- ONE learning rate:
- BATCH - "Bold Driver": If E decrease -> the learning rate is increased
by a factor
If E increase -> the last weight correction to every weight is cancelled
-> the learning rate is decrease by a factor, a new trial
Bold Driver is non-local, only one learning rate for all weights.
- BATCH - Controlling Oscillation: When the sign change occurs in delta-W
the amount of weight change is reduced by a FACTOR.
- ONLINE - Self-Determination of Adaptive Learning Rates: By the means
of computing a most extreme gradient, there would be no need for a "guess"
in the rate coefficient.
- BATCH - SAB : every weight has its own coefficient since partial derivative
of E.
coefficient is
increased if consecutive steps has the same sign.iteration without momentum.
coefficient should
be small when derivative of energy function changes sign.Some iterations
with momentum.
- BATCH -SuperSAB : problems of SAB: determining a "good" coeff in the
change of sign.
solution : decrease the learning
rate coefficient with a high factor.
- BOTH - Rescale of error signal especially for deep layers.
- BOTH - Instead of using actual output, use a expected output value ( from
actual output and error term ).
- BOTH - Use different Error Functions.
- BATCH - Effect of training set size. Select a rate inversely proportional
to the set size. Or integrate different set sized problems.
An Emprical Study of Learning Speed in Back-Propagation
Networks, Scott E. Fahlman,
1988
CMU-CS-88-162, September 1988
- Systematic study of learning speed in back-prop like algorithms.
- The discussion about the quality of different benchmark problems. XOR prolem
is not a classical classification problem. NOSISY ASSOCIATIVE MEMORY benchmarks.
- Develop a faster algorithm, Quick-Prop
In back-prop algorithm, it is desired to follow an optimal path to
reach a global or at least local minimum on the weight surface. Using
second order derivatives may give idea about the curvature of error
function. In quick-prop algorithm, Fahlman use current and previous
error slopes and difference between current and previous weights for
each connection to find current weight change. Thus, he try to obtain
a parabola using two slopes and the difference in weights. After parabola
which is assumed to have arms open upward, is found, minimum point of
this parabola is found to be used in weight change:
delta_w(t) = delta_w(t-1) * S(t) / (S(t-1)
- S(t)) where
S(t) = partial_E(t) / partial_w(t)
There are a number of modifications to overcome local
problems:
* When current slope is larger in magnitude and same in sign, a parameter
'mu' is used to adjust step size, using previous step size.
* A weight decay term is added
to keep weights in an acceptable range.
* To prevent from "flat spot" problem, sigmoid-prime
function is modified by the additoin of 0.1 . |
- A discussion about when learning is completed. Some researchers accept any
output over 0.5 as a one and any output below that threshold as zero. Other
researchers require that each output be very close to the specified target
value. Some other researchers declare success when the sum of the squared
error for all outputs falls below some fixed value. Fahlman sugests a 'theshold
and margin' criterion.
- He gave some experimental results, tuning the parameters of back-prop.
- To eliminate 'flat spot' -> Fahlman added a constant 0.1 to the
sigmoid prime. The reason for that 'flat spot' was : " The value of the sigmoid-prime
function goes to zero as the unit's output approaches 0.0 or 1.0. Even if
such an output value represents the maximum possible error, a unit whose output
is close to 0 or 1 will pass back only a tiny fraction of this error to the
incoming weights and to units in earlier layers. Such a unit wil theoretically
recover, but it may take a very long time; in a machine with roundoff error
and potential for truncating very small valus, such units may never recover.'
- Usage of non-linear error function is applied in experiments.
- The QUICK-PROP algorithm!!!!!!!!!!!!!!!!!!!1
- Scaling experiments
- The complement Encoder Problem and XOR problem