{"title": "Node Splitting: A Constructive Algorithm for Feed-Forward Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1072, "page_last": 1079, "abstract": "", "full_text": "Node Splitting: A Constructive Algorithm for \n\nFeed-Forward Neural Networks \n\nMike Wynne-Jones \n\nResearch Initiative in Pattern Recognition \n\nSt. Andrews Road, Great Malvern \n\nWR14 3PS, UK \n\nmikewj@hermes.mod.uk \n\nAbstract \n\nA constructive algorithm is proposed for feed-forward neural networks, \nwhich uses node-splitting in the hidden layers to build large networks from \nsmaller ones. The small network forms an approximate model of a set of \ntraining data, and the split creates a larger more powerful network which is \ninitialised with the approximate solution already found. The insufficiency \nof the smaller network in modelling the system which generated the data \nleads to oscillation in those hidden nodes whose weight vectors cover re(cid:173)\ngions in the input space where more detail is required in the model. These \nnodes are identified and split in two using principal component analysis, \nallowing the new nodes t.o cover the two main modes of each oscillating \nvector. Nodes are selected for splitting using principal component analysis \non the oscillating weight vectors, or by examining the Hessian matrix of \nsecond derivatives of the network error with respect to the weight.s. The \nsecond derivat.ive method can also be applied to the input layer, where it \nprovides a useful indication of t.he relative import.ances of parameters for \nthe classification t.ask. Node splitting in a standard Multi Layer Percep(cid:173)\nt.ron is equivalent to introducing a hinge in the decision boundary to allow \nmore detail to be learned. Initial results were promising, but further eval(cid:173)\nuation indicates that the long range effects of decision boundaries cause \nthe new nodes to slip back to the old node position, and nothing is gained. \nThis problem does not occur in networks of localised receptive fields such \nas radial basis functions or gaussian mixtures, where the t.echnique appears \nto work well. \n\n1072 \n\n\fNode Splitting: A Contructive Algorithm for Feed-Forward Neural Networks \n\n1073 \n\n1 \n\nIntroduction \n\nTo achieve good generalisation in neural networks and other techniques for inferring \na model from data, we aim to match the number of degrees of freedom of the model \nto that of the system generating the data. With too small a model we learn an \nincomplete solution, while too many free parameters capture individual training \nsamples and noise. \n\nSince the optimum size of network is seldom known in advance, there are two alter(cid:173)\nnative ways of finding it. The constructive algorithm aims to build an approximate \nmodel, and then add new nodes to learn more detail, thereby approaching the op(cid:173)\ntimum network size from below. Pruning algorithms, on the other hand, start with \na network which is known to be too big, and then cut out nodes or weights which \ndo not contribute to the model. A review of recent techniques [\\VJ91a] has led the \nauthor to favour the constructive approach, since pruning still requires an estimate \nof the optimum size, and the initial large net.works can take a long time t.o train. \nConstructive algorithms offer fast training of the initial small networks, with the \nnetwork size and training slowness reflecting the amount of information already \nlearned. The best approach of all would be a constructive algorithm which also \nallowed the pruning of unnecessary nodes or weights from the net.work. \n\nThe constructive algorithm trains a net.work until no further detail of the training \ndata can be learned, and then adds new nodes to t.he network. New nodes can be \nadded with random weights, or with pre-determined weight.s. Random weights are \nlikely to disrupt the approximate solut.ion already found, and are unlikely to be \ninitially placed in parts of the weight space where they can learn something useful, \nalthough encouraging results have been reported in t.his ar~a.[Ash89] This problem \nis likely to be accentuated in higher dimensional spaces. Alt.ernatively, weights can \nbe pre-determined by measurements on the performance of the seed network, and \nthis is the approach adopted here. One node is turned into two, each wit.h half the \noutput weight. A divergence is introduced in the weights into the nodes which is \nsufficient for them behave independently in future training without disrupting the \napproximate solution already found. \n\n2 Node-Splitting \n\nA network is trained using standard techniques until no furt.her improvement on \ntraining set performance is achieved. Since we begin with a small network, we have \nan approximate model of the data, which captures the dominant properties of the \ngenerating system but lacks detail. We now freeze the weight.s in the network, and \ncalculate the updates which would be made them, using simple gradient descent, \nby each separate t.raining pattern. Figure 1 shows t.he frozen vector of weights into \na single hidden node, and the scatter of proposed updates around the equilibrium \nposit.ion. \n\nThe picture shows the case of a hidden node where there is one clear direction \nof oscillation. This might be caused by two clusters of data within a class, each \ntrying to use the node in its own area of the input space, or by a decision boundary \npulled clockwise by some patterns and anti clockwise by others. If the oscillation \nis strong, either in its exhibition of a clear direction or in comparison with other \n\n\f1074 Wynne-Jones \n\nNew Node \n\n#1 --~U( Weight Update \n\nVectors \n\nFigure 1: A hidden node weight vector and updates proposed hy individual t.raining \npatterns \n\nnodes in the same layer, then the node is split in two. The new nodes are placed \none standard deviation either side of the old position. \\Vhile this divergence gives \nthe nodes a push in the right direction, allowing them t.o continue to diverge in later \nt.raining, the overall effect on the network is small. In most cases t.here is very little \ndegradation in performance as a result of the split. \n\nThe direction and size of oscillation are calculated by principal component anal(cid:173)\nysis of the weight updates. By a traditional method, we are required to make a \ncova.riance matrix of the weight updat.es for the weight vector int.o each node: \n\nc = L6w6wT \n\np \n\n(1) \n\nwhere p is the number of patterns. The mat.rix is then decomposed to a set of eigen(cid:173)\nvalues and eigenvectors; the largest. eigenvalue is the variance of oscillation and the \ncorresponding eigenvector is it.s direction. Suitable techniques for performing this \ndecomposition include Singular Value Dewmposition and Householder Reduction. \n[Vet86] A much more suit.able way of calculating the principal components of a \nstream of continuous measurements such as weight updat.es is iterative est.imation. \nAn est.imate is stored for each required principal component. vector, and the esti(cid:173)\nmat.es are updated using each sample. [Oja83, San89] By Oja's method, the scalar \nproduct of t.he current sample vector wit.h each current est.imate of the eigenvectors \nis used as a mat.ching coefficient., M. The matching coefficient is used to re-estima.te \nthe eigenvalues and eigenvectors, in conjunction wit.h a gain term). which decays \nas the number of patterns seen increases. The eigenvectors are updated by a pro(cid:173)\nportion )'M of the current sample, and t.he eigenvalues hy ).lU 2 . The trace (sum of \neigenvalues) can also be est.imated simply as the mean of the traces (sum of diagonal \nelements) of t.he individual sample covariance mat.rices. The principal component \nvectors are renormalised and orthogonalised after every few updat.es. This algorithm \nis of order n, the number of eigenvalues required, for the re-estimation, and O(n2) \nfor the orthogonalisation; the matrix decomposition method can take exponential \n\n\fNode Splitting: A Contructive Algorithm for Feed-Forward Neural Networks \n\n1075 \n\ntime, and is always much slower in practice. \nIn a recent paper on At eiosis Networks, Hanson introduced stochastic weights in the \nmulti layer perceptron, with the aim of avoiding local minima in training.[Han90] \nA sample was taken from a gaussian distribution each time a weight was used; \nthe mean was updated by gradient descent, and the variance reflected the network \nconvergence. The variance was allowed to decay with time, so that the network \nwould approach a deterministic state, but was increased in proportion to the updates \nmade to the mean. \\Vhile the network wa.g far from convergence these updates were \nlarge, and the variance remained large. Node splitting wa.g implemented in this \nsystem, in nodes where the variances on the weights were large compared with the \nmeans. In such cases, two new nodes were created with the weights one standard \ndeviation either side of the old mean: one SD is added to all weights to one node, \nand subtracted for all weights to the other. Preliminary results were promising, but \nthere appear to be two problems with this approach for node-splitting. First, the \nsplitting criterion is not good: a useless node with all weights close to zero could \nhave comparatively large variances on the weights owing to noise. This node would \nbe split indefinit.ely. Secondly and more interestingly, the split is made wit.hout \nregard to the correlations in sign between the weight updates, shown as dots in the \nscatter plot.s of figure 2. In figure 2a, Meiosis would correctly place new nodes in the \npositions marked with crosses, while in figure 2b, the new nodes would he placed \nin completely the wrong places. This problem does not occur in the node splitting \nscheme based on principal component analysis. \n\n(a) \n\n\u2022 \n\u2022 \n.~ \n\u2022 \n\u2022 \n'-. .. \n\u2022 \u2022\u2022 \n.. \n. ~ . . ~ ... \n\u2022 \u2022 \u2022\u2022 \u2022 \n\u2022 \u2022 \n-.~ .. \u2022\u2022\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \u2022\u2022 \u2022 \n\u2022\u2022 \u2022 \n.~ .. \n\u2022 \u2022 \u2022 \n\u2022 \u2022 \n\n(b) \n\n\u2022 \u2022 \nX \n\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022\u2022 ~ . \n\u2022 \u2022 \u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022 \u2022 \n..... \u2022 \u2022\u2022 \n\u2022 . .. -\n\u2022 \n~. .. . \n\u2022 \n\u2022\u2022 \u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \u2022 \n\u2022 \u2022 \n\nX \n\nFigure 2: Meiosis networks split correctly if the weight. updates are correlated in \nsign (a), but fail when they are not (b). \n\n3 Selecting nodes for splitting \n\nN ode splitting is carried out in t.he direct.ion of maximum variance of the scatter plot \nof weight updates proposed by individual training samples. The hidden layer nodes \nmost likely t.o benefit from splitting are those for which the non-spherical nature \n\n\f1076 \n\nWynne-Jones \n\nof the scatter plot is most pronounced. In later implementations this criterion was \nmeasured by comparing the largest eigenvalue with the sum of the eigenvalues, \nboth these quantities being calculated by the iterative method. This is less simple \nin cases where there are a number of dominant directions of variance; the scatter \nplot might, for example be a four dimensional disk in a ten dimensional space, and \nhence present the possibility of splitt.ing one node into eight. It is hoped that these \nmore complicat.ed splits will be the suhject of further research. \n\nAn alternative approach in determining the need of nodes to be split, in comparison \nwith other nodes in the same layer, is to use the second derivat.ives of t.he network \nerror with respect to a parameter of the nodes which is normalised across all nodes \nin a given layer of the network. Such a parameter wa.c;; proposed by Mozer and \nSmolensky in [Sm089]: a multiplicative gat.ing function is applied to the outputs of \nthe nodes, with its gating parameter set to one. Small incrempnt.s in this parameter \ncan be used to characterise the error surface around the unity value, with the result \nthat derivatives are normalised a.cross all nodes in a given layer of the network. \nMozer and Smolensky rpplaced the sum squared error crit.erion with a modulus er(cid:173)\nror criterion to preserve non-zero gradients close to the local minimum reached in \ntraining; we prefer to characterise the t.rue error surface by mpans of second deriva(cid:173)\nt.ives, which can be calculated by repeated use of the chain rule (hackpropagat.ion). \nBackpropagat.ion of second derivat.ivps has previously been rpport.ed in [So190] and \n[Hea90]. \n\nSince a high curvat.ure error minimum in t.he space of t.he gat.ing parampt.er for a \nparticular nocie indicat.es st.eep gradipnt.s surrounding thp minimum, it is t.hese nodes \nwhich exhibit. t.he great.est instability in their weight-space position. In t.he weight \nspace, if the curvat.ure is high only in cert.ain directions, we have the situat.ion in \nfigure 1, where the node is oscillating, and is in need of splitt.ing. If the curvature is \nhigh in all directions in comparison with other nodes, the network is highly sensitive \nto changes in t.he node or it.s weights, and again it will benefit from splitting. \n\nAt t.he ot.her end of the scale of curvat.ure sensitivity, a node or weight wit.h very low \ncurvat.ure is one to which t.he network error is quit.e insensit.ive, and the parameter \nis a suitable candidate for pruning. This scheme has previously been used for weight \npruning by Le Cun, Denker et a1. [SoW 0] , and offers the pot.ential for an int.egrated \nsyst.em of splitting and pruning - a truly adapt.ive net.work archit.ecture. \n\n3.1 Applying the sensitivity measnre to inpnt nodes \n\nIn a.ddit.ion to using t.he ga.ting parameter sensit.ivit.y to select nodes for pruning, \nMozer and Smolensky mention the possibility of using it on the input nodes to \nindicate those inputs to which the c1a.<;sification is most sensitive. This has been \nimplemented in our syst.em wit.h the second derivat.ive sensitivity measure, and ap(cid:173)\nplied to a large financial classification prohlem supplied by THORN El\\JI Research. \nThe analysis was carried out. on the 78-dimensional dat.a, and the input sensitivities \nvaried over several orders of magnit.ude. The inputs were grouped into four sets ac(cid:173)\ncording to sensitivit.y, and MLPs of 10 hidden nodes were trained on each subset of \nthe dat.a. \\Vhile the low sensitivit.y groups failed to learn anyt.hing at all, t.he higher \nsensit.ivit.y groups quickly attained a reasonable classification rat.e. Ident.ification of \nuseless inputs leads t.o greatly increased training speed in fut.ure analysis, and can \n\n\fNode Splitting: A Contructive Algorithm for Feed-Forward Neural Networks \n\n1077 \n\nyield valuable economies in future data collection. This work is reported in more \ndetail in [WJ91b]. \n\n4 Evaluation in Multi Layer Percept ron networks \n\nDespite the promising results from initial evaluations, further testing showed that \nthe splitter technique was often unable to improve on the performance of the net(cid:173)\nwork used as a seed for the first split. These test were carried out on a number of \ndifferent classification problems, where large numbers of hidden nodes were already \nknown to be required, and with a number of different splitting criteria. Prolonged \nexperimentation and consideration of this failure lead to the hypothesis that a split \nmight be made to correct some miscla.<;sified patterns in one region of the input \nspace but, owing to the long range effects of MLP decision boundaries, the changed \npositions of the planes might cause a much greater number of misclassifications \nelsewhere. These would tend to cause the newly creat.ed nodes to slip back to the \nposition of the node from which they were created, with no overall benefit. This \npossibility was tested hy re-implementing the splitter technique in a gaussian mix(cid:173)\nture modeling system, which uses a network of localised receptive fields, and hence \ndoes not have the long range effects which occurred in the multi layer perceptron. \n\n5 \n\nImplementation of the splitter in a Gaussian Mixture \nModel, and the results \n\nThe Gaussian Mixt.ures Model [Cox91] is a clustering algorithm, which attempts \nto model the distribution of a points in a data set. It consists of a numher of \nmult.ivariate gaussian dist.rihut.ions in different posit.ions in t.he input space, and \nwit.h different variances in different direct.ions. The responses of t.hese recept.ive \nfields (humps) are weighted and summed together; the weights are calculated to \nsat.isfy the PDF const.raint. that t.he responses should sum to one over the data set. \nFor the experiment.s on node splitting, the variance was the same in all direct.ions for \na particular bump, leading to a model which is a sum of weight.ed spherical gaussian \ndistribut.ions of different sizes and in different. positions. The model is t.rained by \ngradient ascent in the likelihood of the model fitting the data, which leads t.o a \nset of learning rules for re-estimat.ing the weights, then t.he cent.re positions of the \nrecept.ive fields, then their variances. \n\nFor t.he splitter, a small model is trained until nothing more can be learned, and \nthe paramet.ers are frozen. The training set is run t.hrough once more, and the \nupdat.es are calculated which each pattern attempts to make to the centre position \nof each receptive field. The first principal component and trace of these updates are \ncalculated by the iterative met.hod, and any nodes for which t.he principal component \nvariance is a large proportion of the trace is split in two. \n\nThe algorithm is quick to converge, and is slowed down only a. lit.tle by the oV('fhead \nof computing the principal component and trace. Figure 3 shows the application of \nt.he gaussian mixture splitter to modelling a circle and an enclosing annulus; in the \ncircle (a) there is no dominant. principa.l component direction in the data ('Overed by \nthe receptive field of each node (shown at. one st.anda.rd deviation by a circle), while \n\n\f1078 \n\nWynne-Jones \n\nin (b) three nodes are clearly insufficient to model the annulus, and one has just \nundergone a split. (c) shows the same data set. and model a little later in t.raining \nafter a number of splits have taken place. The technique has been evaluated on a \nnumber of other simple problems, with no negat.ive results to date. \n\nFigure 3: Gaussian mixt.ure model with node-splitting applied to a circle and sur(cid:173)\nrounding annulus \n\n6 Conclusions \n\nThe split.ter t.echnique based on taking the principal component. of the influences \non hidden nodes in a network, ha.g been shown to be useful in the multi layer \nperceptron in only a very limited number of cases. The split in this kind of net.work \ncorresponds to a hinge in the decision boundary, which corrects the errors for which \nit was calculated, but usually caused for more errors in other parts of the input \nspace. This problem does not occur in networks of localised receptive fields such \nas radial ba.\"is funct.ions of gaussian mixture distributions, where it appears to \nwork very well. Further studies will include splitting nodes into more than two, in \ncases where there is more than one dominant principal component, and applying \nnode-split.t.ing to different. modelling algorithms, and to gaussian mixtures in hidden \nmarkov models for speech recognition. \n\nThe analysis of the sensit.ivity of the net.work error to individual nodes gives an \nordered list which can be used for both splitting and pruning in the same network, \nalthough splitting does not generally work in the MLP. This measure has been \ndemonstrated in t.he input layer, to identify which network inputs are more or less \nuseful in the classification t.ask. \n\nAcknowledgements \n\nThe author is greatly indebted to John Bridle and Steve Luttrell of RSRE, Neil \nThacker of Sheffield University, and colleagues in the Research Initiative in Pattern \n\n\fNode Splitting: A Contructive Algorithm for Feed-Forward Neural Networks \n\n1079 \n\nRecognition and its member companies for helpful comments and advice; also to \nDavid Bounds of Aston University and RIPR for advice and encouragement. \n\nReferences \n\n[Ash89] Timur Ash. Dynamic node creation in backpropagation networks. Tech(cid:173)\nnical Report 8901, Institute for Cognitive Science, UCSD, La Jolla, Cali(cid:173)\nfornia 92093, February 1989. \n\n[Cox91] John S Bridle & Stephen J Cox. Recnorm: Simultaneous normalisation \nand classification applied to speech recognition. In Richard P Lippmann \n& John E Moody & David S Touretzky, editor, Advances in Neural Infor(cid:173)\nmation Processing Systems 3, pages 234-240, San Mateo, CA, September \n1991. Morgan Kaufmann Publishers. \n\n[Han90] Stephen Jose Hanson. Meiosis networks. In David S Touretzky, editor, \nAdllances in Nellral Information Processing Systems 2, pages 533-541, San \nMateo, CA, April 1990. Morgan Kaufmann Puhlishers. \n\n[IIea90] Anthony JR Heading. An analysis of noise tolerance in multi-layer percep(cid:173)\n\ntrons. Research Note SP4 122, R.oyal Signals and Radar Estahlishment, \nSt Andrews Road, Malvern, Worcestershire, WR14 3PS, UK, July 1990. \n[Oja83] E Oja. Sllhspace Methods of Pattern Recognition. Research Studit's Press \n\nLtd, Letchworth, UK, 1983. \n\n[San89] TD Sanger. Optimalunsupervispd learningin a single-Iayn linear feed for(cid:173)\n\nward neural network. Neural Networks, 2:459-473, 1989. \n\n[SmoR9] MC Mozer & P Smolensky. Skeletonization: A tedlllique for trimming the \nfat from a neural network. In DS Touretzky, editor, Advances in Neural \nInformation Processing Systems 1, pages 107-115, San Mateo, CA, April \n1989. Morgan Kaufmann Publishers. \n\n[SoWO] Yann Le Cun & John S Denker & Sara A Solla. Optimal brain damage. \nIn David S Touretzky, editor, Adt'ances in Neural Information Processing \nSystems 2, pages 598-605, San Mateo, CA, April 1990. Morgan Kaufmann \nPublishers. \n\n[Vet86] WII Prpss & BP Flannery & SA Teukolsky & \\VT Vette-rling. Numerical \nRecipes in C: The A rt of Scientific Compttting. Camhrigde University \nPress, 1986. \n\n[WJ91a] Mike Wynne-Jones. Constructive algorithms and pruning: Improving the \n\nmulti layer perceptron. \nceedings of the 18th IMACS World Congress on Computation and Applied \nMathemetics, pages 747-750, Duhlin, July 1991. IMACS '91, IMACS. \n\nIn R Vichnevetsky & JJII 1\\filler, editor, Pro(cid:173)\n\n[\\VJ91b] Mike Wynne-Jones. Self-configuring neural nptworks, a new constructive \n\nalgorithm, and assessing the importance of individual inputs. Techni(cid:173)\ncal Report X2345!1, Thorn BMI Central Research Lahoratories, Dawley \nRoad, Hayes, Middlesex, UB3 lHH, UK, March 1991. \n\n\f", "award": [], "sourceid": 531, "authors": [{"given_name": "Mike", "family_name": "Wynne-Jones", "institution": null}]}