Neural Networks: Tricks of the Trade

Cognitive Robotics Lab

Neural networks are a powerful tool for solving diverse regression and classification problems. However, their naive application sometimes fails to exploit their full potential. This seminar intends to provide you with examples of how to enhance and tweak neural networks to better suit your specific problem, as well as to introduce you to some smart ways of applying neural networks that you probably were not even aware of before.

Unless otherwise noted, the papers below are taken from a 1996 NIPS workshop, published as

Neural Networks: Tricks of the Trade
Orr, Genevieve B., Klaus-Robert Müller (Eds.)
Lecture notes in computer science 1524, 1998
TUM Library Signature: 0108/DAT 001z 2001 A 999-1524

While some of material might not be completely current, the field is developing so fast that it often pays to look a good ideas developed earlier, in order to possible combine them with today's updated methods. If you need a general introduction into neural network based methods, we recommend looking at

Bishop, C. M. (1995): Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK.

To sign up for this seminar:
Please Email me at


  • Yann LeCun, Léon Bottou, Genevieve B. Orr, Klaus-Robert Müller: Efficient BackProp, pp. 9-50.
    A nice and comprehensive overview about improvements of the classical BP method. You may focus on a subset of the alternatives described, if you wish.
  • Lutz Prechelt: Early Stopping - But When?, pp. 55-69.
    Early stopping is the most common method to avoid overfitting and improve generalization. However, in practice rule-of-thump stopping criteria are often used, because the criteria from theory tend to fail. Here is how to do a better job.
  • Shun-Ichi Amari: Natural Gradient Works Efficiently in Learning, Neural Computation 10(2), 1998, pp. 251-276.
    Gradient descent techniques like Backprop often suffer from plateaus and other shallow regions of the error surface. To avoid those, this mathematical paper introduces the concept of natural gradients, which are defined as the direction of steepest descent in a structured parameter space. A visualization is shown in this paper, page 8.
  • Gary William Flake: Square Unit Augmented, Radially Extended, Multilayer Perceptrons, pp. 145-163.
    A very simple but effective way to combine the advantages of logistic feed-forward and radial basis function networks.
  • Rich Caruana: A Dozen Tricks with Multitask Learning, pp. 165-191.
    Training different networks for variations of one task? Surprisingly, using those variations as side constraints (additional output neurons) to the main task may actually improve learning. This also delivers a very smart way of dealing with incomplete input data: Just use them as optional output!
  • Patrice Simard, Yann LeCun, John S. Denker, Bernard Victorri: Transformation Invariance in Pattern Recognition - Tangent Distance and Tangent Propagation, pp. 239-274.
    This approach introduces prior knowledge about pattern recognition tasks by looking at invariant output for certain transformations of the input, e.g. scaling and rotation of scanned characters. The parameter search can be simplified enormously when these invariant transforms are removed by defining a new metric.
  • Steve Lawrence, Ian Burns, Andrew D. Back, Ah Chung Tsoi, C. Lee Giles: Neural Network Classification and Prior Class Probabilities, pp. 299-313.
    In real-world classification problems, there is often a huge imbalance in the frequency of occurence of the classes, even though rare classes may be as important as frequent ones. This paper shows how to avoid the resulting problems with network training.
  • Jürgen Fritsch, Michael Finke: Applying Divide and Conquer to Large Scale Pattern Recognition Tasks, pp. 315-342.
    This rather advanced paper discusses several methods to deal with huge datasets (thousands to millions of classes) occurring e.g. in speech recognition problems. The basic idea is to build hierarchies of smaller networks instead of setting up one big network. You may opt to discuss a subset of the solutions presented.
  • Ralph Neuneier, Hans-Georg Zimmermann: How to Train Neural Networks, pp. 373-423.
    As the bold title suggests, this article comprehensively describes the authors' idea of how to properly apply neural networks to a given problem. Again, you may focus on a few aspects of their method or give an overview over the entire procedure.
  • G. E. Hinton, R. R. Salakhutdinov: Reducing the Dimensionality of Data with Neural Networks, Science, vol. 313, 2006, pp. 504-507.
    This very recent work shows how to apply neural networks to dataset reduction in the form of a very powerful nonlinear principal component analysis (PCA). It uses two different kinds of networks and two training stages. You can even try it out yourself - all the sourcecode (MatLab) and data are included. Some background reading may be required, see references in the paper.
  • R. Dybowski: Confidence Intervals and Prediction Intervals for Feed-Forward Neural Networks, in R. Dybowski and V. Gant (eds.), Clinical Applications of Artificial Neural Networks, Cambridge University Press, 2001.
    When using neural networks for regression tasks, it is important to assign error bars to the network output - not only as a training data average, but also for individual data points. The author of this paper discusses heuristics to modify standard feedforward networks to enable error estimates.

Organizer: Professor Jürgen Schmidhuber.

Contact: Dr. Martin Felder ( Please direct questions, suggestions, etc. regarding this seminar to Martin.

Presentation: Each student presents one (main) paper. The presentation should be about 30 - 40 minutes. The presentation language is either German or English. You might want to check out these suggestions for your presentation. Talk to your advisor at least 2 weeks before your scheduled talk and show him your presentation.

Audience: The presenter will give his talk not only to those assigning ECTS credits (or the Schein) but to the whole group. Every member of the audience should be prepared, and must have (tried to) read the respective paper at least once or twice. We expect lots of questions to be asked during or after the talks. Hopefully this will make the seminar vivid and interesting and not a dull must-sit-through event. For the same reason, the seminar will be held in (2-3) blocks after the winter break (in accordance with participants' schedules).

Composition: You also must write a summary of your talk. It should be about 10 pages. Hand it in until the end of the semester (but better finish your summary before you give your talk, because trying to write things down in your own words will help you realize which parts of the paper(s) are important).

ECTS: 4 (2SWS).

Grading: In order to get the credits (ECTS/Schein), you must give a presentation, write a summary and attend every talk (occasional exceptions to the last requirement can be made on an individual basis).


First Block:
Someday in July 2007
Exact time and date to be announced
MI 03.07.023