Multilayer Perceptron for Prediction of 2006 World Cup Football Game

Multilayer perceptron (MLP) with back-propagation learning rule is adopted to predict the winning rates of two teams according to their official statistical data of 2006 World Cup Football Game at the previous stages. There are training samples from three classes: win, draw, and loss. At the new stage, new training samples are selected from the previous stages and are added to the training samples, then we retrain the neural network. It is a type of on-line learning. The 8 features are selected with ad hoc choice. We use the theorem of Mirchandani and Cao to determine the number of hidden nodes. And after the testing in the learning convergence, the MLP is determined as 8-2-3 model. The learning rate and momentum coefficient are determined in the cross-learning. The prediction accuracy achieves 75% if the draw games are excluded.


Introduction
Neural network methods had been used in the analysis of sport data and had good performance.Purucker employed the supervised and unsupervised neural networks to analyze and predict the winning rate of National Football League (NFL), and found that multilayer perceptron neural network (MLP) with supervised learning performed better than Kohonen's self-organizing map network with unsupervised learning [1].Condon et al. used MPL to predict the score of a country which participated in the 1996 Summer Olympic Games [2].And the result outperformed that of regression model.Rotshtein et al. used the fuzzy model with genetic algorithm and neural network to predict the football game of Finland [3].Silva et al. used MLP to build the non-linear relationship between factors and swimming performance to estimate the performance of swimmers, and the difference between the true and the estimated result was low [4].However, there are few discussions on the parameter determination of MLP in different applications.Here, we adopt the supervised multilayer perceptron neural network with error back-propagation learning rule (BP) to predict the winning rate of 2006 World Cup Football Game (WCFG).We use the theorem to determine the number of hidden nodes.Also we determine the learning rate and momentum coefficient by the less average time and deviation time in the cross-learning.
According to the schedule of 2006 WCFG, shown in Figure 1, there are 32 teams in this competition and overall 64 matches at 5 stages in this tournament from the beginning to the end.The competition rules in each stage are explained as follows.
(1) Stage 1 is the group match, also known as round robin tournament.There is no extending time after 90 minutes regular time.In this stage, there are 32 teams in 8 groups (Group A-H), each group has 4 teams, and each team plays 3 matches.There are 6 matches in each group and there are 8 groups, so totally it has 48 matches (Match 1-48) in stage 1.
The criterion of gaining points is that winning one game has 3 points, drawing one game has 1 point, and losing one game has 0 point.After stage 1, two teams that have the higher points in each group enter the next stage.From the website of 2006 WCFG held in Germany [5], we can obtain the official 64 matches' statistical records provided by FIFA [6].From the report of each match, there are 17 statistical items: goals for, goal against, shots, shot on goal, penalty kicks, fouls suffered, yellow cards, red cards, corner kicks, direct free kicks to goal, indirect free kicks to goal, offside, own goals, cautions, expulsions, ball possession, and foul committed, which represent the ability index to win the game.From these statistical data, we apply an MLP neural network to predict the winning rate of two teams at the next stage games (stage 2 to 5) by means of their statistic data from previous games.Figure 2 shows the supervised prediction system, which is composed of two parts: training part and prediction part.There are training samples from three classes: win, draw, and loss.

Feature Selection and Normalization
2.1.Feature Selection.We get 64 match reports from [5], and there are 17 statistical items in each match report.We select 8 items by an ad hoc choice, and they effectively represent the significant capability to win the game as the input features.Ad hoc choice is a common process to the real application of an algorithm.These 8 features are marked as x 1 = goals for (GF), x 2 = shots (S), x 3 = shots on goal (SOG), x 4 = corner kicks (CK), x 5 = direct free kicks to goal (DFKG), x 6 = indirect free kicks to goal (IDFKG), x 7 = ball possession (BP), and x 8 = fouls suffered (FS).

Normalization: Relative Ratio As Input
Feature.We consider relative ratio as input feature.Training samples and prediction samples are normalized by relative ratio as follows: If The input features of x 1 − x 8 are converted into y 1 − y 8 by (1), and then the y 1 − y 8 are fed into the neural network model for training or prediction.In (1), the symbol "A" indicates team A and the symbol "B" indicates team B. The symbol "i" is the index of 8 features.The input values of y 1 − y 8 are between 0-1 after normalization.We set "If x iA = x iB , then y iA = y iB = 0.5" that includes "if x iA = x iB = 0, then y iA = y iB = 0.5."The example of normalization result of Germany (GER) versus Costa Rica (CRC) is listed in Table 2.

Multilayer Perception with Back-Propagation Learning Algorithm
MLP model with BP learning algorithm is important since 1986 [7,8].The weighting coefficient adjustment can be referred to in [7][8][9].Figure 3 shows the 8-6-3 MLP with one hidden layer used in this study to predict the winning rate of the football games.There are 8 inputs, 6 hidden nodes, and 3 outputs.Each symbol is explained as below: y is the input data vector with 8 features that have been normalized, w is the connection weights between nodes of two layers, net is the value which is the sum of the product of inputs and weighting coefficients, f (net) is the transfer function and the value is in 0∼1, o is the output value, d is the desired output, and e is the error value.The transfer function used in hidden layer and output layer is a log-sigmoid function, shown in (2).Using the least-squared error and with gradient descent method, we can get (3) and (4) for weighting coefficient adjustment.Considering the momentum term in the inertia effect of the previous step adjustment, the final adjustment equations are modified as ( 5) and ( 6), where η is the learning rate, t is the index of iteration, and β is the momentum coefficient: Feature name Features Before Normalization After Normalization Team A (GER) Team B (CRC) Team A (GER)

Training Samples.
At stage 1, we select the teams which win or lose all the three games as the training samples.The data are representative samples of the win or loss.But the selection is ad hoc also.They are italicized in Table 1.Also we select the teams which have the draw games.We set the desired output to 1-0-0 for winning the game, 0-1-0 is for the draw game, and set to 0-0-1 for losing the game.From stage 2, the winning team's record will be added to training samples for all subsequent stages' training process only if the team had won three games at stage 1.
The selected training teams (background color is gray in Figure 1) and training matches (the bold line and bold number with under line in Figure 1) from stage 1 to stage 4 are as follows.Also, the draw games are selected as training samples (Match 4, 13, 16, 23, 25, 28, 29, 35, 37, 40, and 44).Games that end in penalty shoot-out after stage 2 are considered as draw games.Only the record of the regular 90 minutes and extension 30 minutes is calculated.
(  4.2.Input Team Data for Predicting.We do not have to predict the game result at stage 1, but records at stage 1 are extracted in order to predict the game results at the next stages.Therefore, the input data used to predict game results at stages 2-5 are described as follows: (1) The input data for predicting round of 16 (stage 2)we respectively take the average value from the 3 match records of each winning team that enters stage 2 as the input data.Totally, we get 16 input team data for the 8 games that we want to predict at stage 2. For example, the input team data to predict the winner of GER versus SWE are listed in Table 3.
(2) The input data for predicting the quarter-finals (stage 3)-the input data are taken from the records of stages 1-2.We respectively take the average value from the 4 match records (3 games from stage 1 and 1 game from stage 2) of each team as the input data.Totally, there are 8 input team data for the 4 games that we want to predict in quarter-finals.
(3) The input data for predicting the semifinals (stage 4)-the input data are got from stage 1∼3.We respectively take the average value from the 5 match records of each team as the input data.Therefore, we totally get 4 input team data for the two games to predict result of semifinals.
(4) The input data for predicting the finals (stage 5)the input data are got from stage 1∼4.We respectively take the average value from the 6 match records of each team, which have entered the final stage, as the input data.Therefore, totally 4 input team data are ready for the last two final games we want to predict.

Determination of the Number of Hidden Nodes by
Theorem.Mirchandani and Cao [10] proposed a theorem that maximum number of separable regions (M) is a function of the number of hidden nodes (H), and input space dimension (d).

M(H, d)
C(H, k), where C(H, k) = 0, H < k. (7) There are total 52 training samples at stage 4 and the input dimension d is 8. Based on formula (7), when the network has 6 hidden nodes, it makes the maximum 64 separable regions: Therefore, 6 hidden nodes are sufficient.So from the theorem we adopt the 8-6-3 MLP model.

Cross Determination of Parameters for the Back-Propagation
Learning.Kecman ever recommended the ranges of learning rate η and momentum coefficient β for BP learning [11].Here we use cross-learning.We use the 43 training team samples selected from stage 1 in the training set.The cross procedures of determining the parameter settings for BP are listed in Table 4.
To determine the momentum coefficient β, we set hidden nodes = 6, mean square error (MSE) = 0.01, and fixed learning rate η = 0.1, then test five different momentum coefficients β (β = 0.5, 0.6, 0.7, 0.8, and 0.9).Each β setting is tested for 40 tests.Testing results are listed in Table 5.From Table 5, it shows that the standard deviation of convergent iterations at β = 0.8 is the smallest, which means the training process is more stable.Therefore, we decide to set momentum coefficient β to be 0.8 in the MLP model.After setting the β = 0.8, we find the learning rate η next in this kind of cross-learning.We set hidden nodes = 6, MSE = 0.01, fixed β = 0.8, and then test five different learning rate η (η = 0.1, 0.3, 0.5, 0.7, and 0.9).Each η setting is tested for 40 tests.The testing result is listed in Table 6.From Table 6, we find out that when η is set as 0.9, the learning can have a less average convergent time than other η settings.Finally, using this systematic analysis, we decide to set the learning rate η = 0.9 and momentum coefficient β = 0.8 in the MLP model.

Refine the Number of Hidden
Nodes.From a previous theorem, it needs 6 hidden nodes to converge for 52 training samples with 8 input dimensions.However, in practice, the number of required hidden nodes may be less than that in theory due to data distributions.It is worth checking the number of hidden nodes for MLP to converge in this application.Tests are made from 6 hidden nodes to 1 hidden node with total 52 training samples at stage 4. We decrease one hidden node at each time for learning convergent test.Figure 4 shows the plots of MSE versus iterations for 1-6 hidden nodes.The result shows that MLP converges in 80 iterations when 6, 5, 4, 3, and 2 hidden nodes are given.But it cannot converge with only one hidden node even after 10,000 iterations.For clear view, Figure 4 only shows first 80 iterations.According to this test, we conclude that it needs only two hidden nodes for MLP to train samples.We can infer that some training samples are grouped together, and we do not need the 6 hidden nodes.

Prediction Results
From previous analysis, the final prediction model used in this study is 8-2-3 MLP with BP learning.The parameter settings for BP learning are η = 0.9, β = 0.8, and MSE = 0.01.The prediction method is explained as follows.
We input the average data of each team into the welltrained MLP.Then we compare the output values of the first output node if two teams in a game to determine the win or loss.The team with bigger output value of the first output node, meaning the greater ability to win the game, is the winner.The winning rate prediction results of two teams at each game from stage 2 to stage 5 are listed in Table 7.The symbol "W" means the team is the winner whose real output value of the first output node is bigger.The symbol "L" means that the team is the loser whose real output value the first output node is smaller.
The prediction results must compare with the real game results.The symbol "Y" means that the prediction result is correct, and the symbol "N" means that the prediction result is wrong.The symbol "N/A" means that the prediction result is not counted because two teams draw.
The prediction for football games is not easy because the players use feet to control the ball.Too many factors and situations are sometimes changeable, and thus the game results are usually unpredictable.Scoring is not easy in football games, so there are many draw games in the records.Most of the time neural network can only predict the winner and loser.In fact, it is not easy to get an equal winning rate from the output value of the first output node of MLP for two teams, and to predict the draw game.So we exclude the draw games in the calculation of prediction accuracy.If we exclude four draw games (Match 54, 57, 59, 64) and calculate the average prediction accuracy of other 12 games from stage 2 to stage 5, the percentage of the prediction accuracy is 75% (9/12).
The odds can be calculated from the MLP for betting reference and its formula is defined as follows: where O A and O B are the real outputs from the first output node of MLP for team A and team B. The odds for team B versus team A are reversed.The results of odds are also shown in Table 7.

Conclusions and Discussions
In this study, we adopt multilayer perceptron with backpropagation learning to predict the winning rate of 2006 WCFG.We select 8 significant statistical records from 17 official records of 2006 WCFG.The 8 records of each team are transformed into relative ratio values with another team.
Then the average ratio values of each team at previous stages We use the theorem of Mirchandani and Cao to determine the number of hidden nodes.It is 6, and the MLP model is 8-6-3.After the testing in the learning convergence, the MLP is determined as 8-2-3 model.We can infer that some training samples are grouped together and we do not need the 6 hidden nodes.
If the draw games are excluded, the prediction accuracy can achieve 75% (9/12).
The 8 features are selected in ad hoc choice.But if we want to select 8 best features, we must work on C (17,8) combinations in analysis.We can select the feature set such that the error is the smallest or the distance measure is the maximum.Usually we may use divergence computation, Bhattacharyya distance, Matusita distance, and Kolmogorov distance in the use of distance measures for feature selection [12].Also we can use entropy in the feature selection [12].
There are other methods: conjugate gradient method, Levenberg-Mardquardt method, simulated annealing, and genetic algorithm that can be used in the learning [13][14][15][16][17].But MLP with BP learning is simpler in the determination of hidden node number, parameter setting, and the observation of learning convergence in this application.Other pattern classification methods may be used for comparison in prediction accuracy [12,18].
From pattern recognition point of view, compared with the two class training sets (win and loss), the three class training sets (win, draw, and loss) can have the more reliable prediction accuracy, because the decision regions or boundaries after training can be more precise.

Figure 3 :
Figure 3: MLP network used in predicting the winning rate of 2006 WCFG.

( 4 )
The training samples for predicting the games in the finals (stage 5)-because the two teams (ITA and FRA) do not have all 3 wins at stage 1, they are not selected as the training samples.The training samples at stage 5 are the same as that at stage 4 (52 training samples).

Table 1
lists the score table for 32 teams in 8 groups after 48 matches finished at stage 1.(2)The competition rule of stages 2-5 is single elimination tournament.It is necessary to have penalty Figure 1: Total 64 matches at 5 stages for 32 teams in the schedule of 2006 WCFG.kick if two teams tie after regular time (90 minutes) and additional time (30 minutes).The winner enters the next stage and the loser is eliminated from the competition.Stage 2 is the round of 16, and there are 8 matches (Match 49-56) for 16 teams.Stage 3 is the quarter-finals, and there are 4 matches (Match 57-60) for 8 teams.Stage 4 is the semifinals, and there are 2 matches (Match 61-62) for 4 teams.Stage 5 is the final-game, and there are 4 teams (the same teams as in stage 4) for 2 games.One is the third place game (Match 63), and the other is the final game (Match 64).

Table 1 :
Score table after finished 48 matches at stage 1.

Table 2 :
Data normalization of GER versus CRC.
The selected training samples for predicting the games in quarter-finals (stage 3)-besides the 43 training data at the stage 1, we add the match data of stage 2 from those teams, which are not only the winner at stage 2, but also have all 3 wins at stage 1, as the training samples.We can find 3 teams' records (GER, POR, and BRA) as the training samples at stage 2. Also, there is a game (SUI versus UKR), which ends in penalty kick, and we consider it as a draw game.Therefore, currently we have 48 (43 + 3 + 2 = 48) training samples for the training to predict at stage 3.
Consequently, we can find 21 samples from the 20 matches of 7 teams as the training data.We have 4 teams with 3 wins: GER, POR, BRA, and ESP, and 3 teams with 3 losses: CRC, SCG, and TOG.GER and CRC have one match and their records are selected as two training samples.Besides, there are 11 draw games (Match 4, 13, 16, 23, 25, 28, 29, 35,

Table 3 :
Input data used to predict the winning rate of GER versus SWE at stage 2.

Table 4 :
Procedures of determining the parameters for BP learning rule.

Table 7 :
Prediction results of winning rates from stage 2 to stage 5. MLP for predicting the win and loss.The teams of 3 wins, 3 losses, and draws at stage 1 are selected as the training samples.The 8 records and training samples are selected by ad hoc choice.It is a common process to the real application of an algorithm.New training samples are added to the training set of the previous stages, and then we retrain the neural network.It is a type of on-line learning.The learning rate and the momentum coefficient are determined by the less average and deviation time in the cross-learning.