Neural networks are a powerful tool for solving diverse regression and classification problems. However,
their naive application sometimes fails to exploit their full potential. This seminar intends to provide
you with examples of how to enhance and tweak neural networks to better suit your specific problem, as well
as to introduce you to some smart ways of applying neural networks that you probably were not even aware of
before.
Unless otherwise noted, the papers below are taken from a 1996 NIPS workshop, published as
Neural Networks: Tricks of the Trade
Orr, Genevieve B., KlausRobert Müller (Eds.)
Lecture notes in computer science 1524, 1998
TUM Library Signature: 0108/DAT 001z 2001 A 9991524
While some of material might not
be completely current, the field is developing so fast that it often pays to look a good ideas developed
earlier, in order to possible combine them with today's updated methods. If you need a general introduction into
neural network based methods, we recommend looking at
Bishop, C. M. (1995): Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK.
To sign up for this seminar:
Please Email me at ed.mut.ni@redlef
Papers:
 Yann LeCun, Léon Bottou, Genevieve B. Orr, KlausRobert Müller:
Efficient BackProp,
pp. 950.
A nice and comprehensive overview about improvements of the classical BP method.
You may focus on a subset of the alternatives described, if you wish.
 Lutz Prechelt:
Early Stopping  But When?,
pp. 5569.
Early stopping is the most common method to avoid overfitting and improve generalization.
However, in practice ruleofthump stopping criteria are often used, because the criteria from theory
tend to fail. Here is how to do a better job.
 ShunIchi Amari:
Natural Gradient Works Efficiently in Learning,
Neural Computation 10(2), 1998, pp. 251276.
Gradient descent techniques like Backprop often suffer from plateaus and other shallow regions of
the error surface. To avoid those, this mathematical paper introduces the concept of natural gradients, which are
defined as the direction of steepest descent in a structured parameter space. A visualization is shown in
this paper, page 8.
 Gary William Flake:
Square Unit Augmented, Radially Extended, Multilayer Perceptrons,
pp. 145163.
A very simple but effective way to combine the advantages of logistic feedforward and radial basis
function networks.
 Rich Caruana:
A Dozen Tricks with Multitask Learning,
pp. 165191.
Training different networks for variations of one task? Surprisingly, using those variations as
side constraints (additional output neurons) to the main task may actually improve learning. This also
delivers a very smart way of dealing with incomplete input data: Just use them as optional output!
 Patrice Simard, Yann LeCun, John S. Denker, Bernard Victorri:
Transformation Invariance in Pattern Recognition  Tangent Distance and Tangent Propagation,
pp. 239274.
This approach introduces prior knowledge about pattern recognition tasks by looking at invariant output
for certain transformations of the input, e.g. scaling and rotation of scanned characters. The parameter
search can be simplified enormously when these invariant transforms are removed by defining a new metric.
 Steve Lawrence, Ian Burns, Andrew D. Back, Ah Chung Tsoi, C. Lee Giles:
Neural Network Classification and Prior Class Probabilities,
pp. 299313.
In realworld classification problems, there is often a huge imbalance in the frequency of occurence of the
classes, even though rare classes may be as important as frequent ones. This paper shows how to avoid
the resulting problems with network training.
 Jürgen Fritsch, Michael Finke:
Applying Divide and Conquer to Large Scale Pattern Recognition Tasks,
pp. 315342.
This rather advanced paper discusses several methods to deal with huge datasets (thousands to
millions of classes) occurring e.g. in speech recognition problems. The basic idea is to build hierarchies
of smaller networks instead of setting up one big network. You may opt to discuss a subset of the solutions
presented.
 Ralph Neuneier, HansGeorg Zimmermann:
How to Train Neural Networks,
pp. 373423.
As the bold title suggests, this article comprehensively describes the authors' idea of how to
properly apply neural networks to a given problem. Again, you may focus on a few aspects of their method
or give an overview over the entire procedure.
 G. E. Hinton, R. R. Salakhutdinov:
Reducing the Dimensionality of Data with Neural Networks,
Science, vol. 313, 2006, pp. 504507.
This very recent work shows how to apply neural networks to dataset reduction in the form of a very
powerful nonlinear principal component analysis (PCA). It uses two different kinds of networks and two
training stages. You can even try it out yourself  all the sourcecode (MatLab) and data are included.
Some background reading may be required, see references in the paper.
 R. Dybowski:
Confidence Intervals and Prediction Intervals for FeedForward Neural Networks,
in R. Dybowski and V. Gant (eds.), Clinical Applications of Artificial Neural Networks, Cambridge
University Press, 2001.
When using neural networks for regression tasks, it is important to assign error bars to the
network output  not only as a training data average, but also for individual data points. The author
of this paper discusses heuristics to modify standard feedforward networks to enable error estimates.
Organizer: Professor Jürgen
Schmidhuber.
Contact:
Dr. Martin Felder (ed.mut.ni@redlef). Please direct questions, suggestions, etc. regarding
this seminar to Martin.
Presentation: Each student presents one (main) paper. The
presentation should be about 30  40 minutes. The presentation language
is either German or English. You might want to check
out
these suggestions for your presentation. Talk to your advisor
at least 2 weeks before your scheduled talk and show him your presentation.
Audience: The presenter will give his talk not only to those assigning
ECTS credits (or the Schein) but to the whole group. Every member of the audience
should be prepared, and must have (tried to) read the respective paper at least once
or twice. We expect lots of questions to be asked during or after the talks. Hopefully
this will make the seminar vivid and interesting and not a dull mustsitthrough event.
For the same reason, the seminar will be held in (23) blocks after the winter break
(in accordance with participants' schedules).
Composition: You also must write a summary of your talk. It should be about
10 pages. Hand it in until the end of the semester (but better finish your summary
before you give your talk, because trying to write things down in your own words will
help you realize which parts of the paper(s) are important).
ECTS: 4 (2SWS).
Grading: In order to get the credits (ECTS/Schein),
you must give a presentation, write a summary and attend every
talk (occasional exceptions to the last requirement
can be made on an individual basis).

First Block:
Someday in July 2007
Exact time and date to be announced
MI 03.07.023
