Recently, data mining studies are being successfully conducted to estimate several parameters in a variety of domains. Data mining techniques have attracted the attention of the information industry and society as a whole, due to a large amount of data and the imminent need to turn it into useful knowledge. However, the effective use of data in some areas is still under development, as is the case in sports, which in recent years, has presented a slight growth; consequently, many sports organizations have begun to see that there is a wealth of unexplored knowledge in the data extracted by them. Therefore, this article presents a systematic review of sports data mining. Regarding years 2010 to 2018, 31 types of research were found in this topic. Based on these studies, we present the current panorama, themes, the database used, proposals, algorithms, and research opportunities. Our findings provide a better understanding of the sports data mining potentials, besides motivating the scientific community to explore this timely and interesting topic.
The advent of computing has produced a society that feeds on information. Most of the information is in its raw form, known as data [
Therefore, the data mining technique is one of the most competent alternatives to help in the knowledge extraction from large data volumes, discovering patterns, and generating rules for predicting and comparing data, which can help institutions in decision-making and achieve a greater degree of confidence [
Historically, the technique of finding useful patterns in data has been named with a variety of names, including data mining, knowledge extraction, identification information, data archeology, and data processing standard [
Data mining is used to generate knowledge in many scientific, industrial, and mainly business sectors [
Hence, connoisseurs and experts have dedicated to predict and discuss sporting results. With a large amount of data available (especially since the advent of the Internet), it was natural for statisticians and computer scientists to show interest in discovering patterns and making predictions using these data. The processing of sports data with data mining techniques can not only reduce manual workload and errors, but also improve fairness and development of sports games, assisting coaches and managers in predicting results, assessing players’ performance, identifying talents, sporting strategy, and mainly making decision [
Based on these assignments, this article aims to perform a Systematic Literature Review (SLR), with the purpose of identifying researches using data mining in the sports field and describes the techniques and algorithms applied and possible research opportunities. Moreover, the current panorama of research, temporal distribution, themes, databases used, and proposals of these works will be presented.
This paper is organized as follows: Section
This study used the systematic review method which, according to Brereton et al. [
The planning stage addressed the scientific questions definition, the intervention of interest specification, databases identification, keywords definition, search strategies, criteria for inclusion and exclusion, and articles quality [ RQ1: What is the current researches overview? RQ2: What are the most used techniques? RQ3: What is the temporal distribution of the works? RQ4: What are the most cited research papers? RQ5: What are the datasets? RQ6: What was the result of the researches? RQ7: What are the algorithms and methods applied? RQ8: What are the research opportunities?
Usually, inclusion, exclusion, and quality criteria are determined after the definition of research questions [
Inclusion, exclusion, and quality criteria.
Criteria | |
---|---|
| Studies in English |
Article, Conference or Methodology paper | |
Studies relevant to the sports data mining techniques | |
| |
| Researches that do not accomplish the quality criteria |
Studies outside the context of work | |
Studies written in another language than English | |
Studies published before 2010 | |
| |
| Studies with full results |
Studies with different proposals/results |
Aspects of this process may include decisions about the type of revisions that should be included in the research, which is used to manage the selection criteria in a subset of primary studies [
Finally, the selected method to search in these databases was boolean recovery. Essentially, it divides a search space, identifying a subset of documents in a collection, according to the criteria of consultation [
This stage involves five steps:
Figure
Systematic review research flow.
It uses BibTeX extensions (bibliographic formatting file used in LaTeX documents) to perform these analyses. Therefore, these extensions were extracted from the aforementioned databases as well. It is important to note that the BibTeX files were exported without any filter, which explains the number of researches returned.
Thereafter, works published before 2010 were eliminated, returning an amount of 1172 titles (410 rejected articles). Nevertheless, in order to refine the search and eliminate articles that were outside the scope of this review, a careful analysis was applied in the titles, keywords, and abstracts, according to the exclusion criteria (see Table
Finally, after the works preselection, a synthesis of the data was performed, with the objective of applying an evaluation based on the stated quality criteria. Thereby, of the 36 articles, 5 were eliminated (they have not demonstrated the methods or techniques applied), leading to a final set of 31 articles with relevant information about data mining in sports.
In this section, the results of this SLR are presented. Thus, each subsequent subsection will answer the issues raised at the beginning of the research.
The works selected by this SLR are presented in Table
Selected papers (IDs and references).
| | | |
---|---|---|---|
1 | [ | 17 | [ |
2 | [ | 18 | [ |
3 | [ | 19 | [ |
4 | [ | 20 | [ |
5 | [ | 21 | [ |
6 | [ | 22 | [ |
7 | [ | 23 | [ |
8 | [ | 24 | [ |
9 | [ | 25 | [ |
10 | [ | 26 | [ |
11 | [ | 27 | [ |
12 | [ | 28 | [ |
13 | [ | 29 | [ |
14 | [ | 30 | [ |
15 | [ | 31 | [ |
16 | [ |
Therefore, to provide an overview of the thematic types that have been proposed in the articles, they were categorized in nine classes: Motion Analysis; Performance Evaluation; Sports Data Capture; Generating Eating Plans; Training Planning; Strategic Planning; Predicting Results/Patterns; Sports Data Analytics; and Decision-Making Support. For a better understanding of these categories Table
Thematic types.
Theme | Articles |
---|---|
Motion Analysis | [ |
Performance Evaluation | [ |
Sports Data Capture | [ |
Generating Eating Plans | [ |
Training Planning | [ |
Strategic Planning | [ |
Predicting Results/Patterns | [ |
Sports Data Analytics | [ |
Decision Making Support | [ |
Papers distribution between the aforementioned categories is exposed in Table
A words cloud is demonstrated in Figure
Frequently occurring words in articles under review.
The preponderance of words suggests that these are the most used techniques to create computational intelligent systems in sport’s field. Thus, as can be seen in the figure, data mining and machine learning are the most used, considering that words related to these techniques often appear in the papers, based on their size in this figure.
Analyzing the temporal distribution of included articles, it was noticed that years 2014 and 2016 reported the highest amount of publications, with 12 articles (6 each year), representing 38.71% of the works reviewed. The years 2010, 2013, and 2017 together presented the same total number of articles (12-38.71%), but with only 4 works in each year. The year of 2012 had 3 (9.68%) articles, whereas 2015 and 2018 had only 2 (12.90%). This distribution is shown in Figure
Work’s temporal distribution.
In total, the top 5 articles contributed 133 citations related to sports data mining, as can be seen in Table
Top 5 articles according to citation frequency.
Article | Journal/Conference | Year | Citation |
---|---|---|---|
[ | Journal of sports science & medicine | 2013 | 39 |
[ | Intelligent Systems and Informatics | 2010 | 31 |
[ | Computational and Business Intelligence | 2013 | 26 |
[ | Procedia Computer Science | 2014 | 23 |
[ | IFAC Proceedings Volumes | 2012 | 14 |
This section presents the proposals or results as well as datasets used by the selected works in Tables
Proposals/results of the works.
Paper | Proposal/Result |
---|---|
[ | Research of strategies most used for the recognition and classification of human movement patterns. |
| |
[ | Analysis of sports skills data with temporal series image data retrieved from films focused on table tennis. |
| |
[ | Guide the athletes on how to improve their performance and how to eliminate errors related to the selection of the proper running strategy through the differential evolution algorithm. |
| |
[ | Proposes a new clustering algorithm based on ant colony optimization. |
| |
[ | Proposed the development of an information extraction system wherein its purpose was to obtain data frames of multiple sports performance documents. |
| |
[ | Automatic generation of optimal food plans for athletes, through the particle swarm optimization algorithm. |
| |
[ | Proposed an automated personal trainer. |
| |
[ | Solution for automatic planning of training sessions. |
| |
[ | A new solution capable of adapting training plans. |
| |
[ | A framework to automatically analyze the physiological signals monitored during a test session. |
| |
[ | Implementation of artificial intelligence routines for automatic evaluation of exercises in weight training. |
| |
[ | Presented three geometric/temporal features of pen trajectories used in a cognitive skills training application for elite basketball players. |
| |
[ | An data mining algorithm to soccer tactics using association rules mining. |
| |
[ | Discussed the application of the association rule mining in sports management, especially, in cricket. |
| |
[ | Presented a relational-learning based approach for discovering strategies in volleyball matches based on optical tracking data. |
| |
[ | A generalized predictive model for predicting the results of the English Premier League. |
| |
[ | A data analysis to identify important aspects separating skilled golfers from poor. |
| |
[ | Compared the performance of algebraic methods to some machine learning approaches, particularly in the field of match prediction. |
| |
[ | A sports data mining approach, which helps discover interesting knowledge and predict results from sports games such as college football. |
| |
[ | Data mining techniques for predicting basketball results in the NBA (National Basketball Association). |
| |
[ | Developed a tool COP (Cricket Outcome Predictor), which outputs the win/loss probability of a match. |
| |
[ | Classify players into regular or All-Star players from the National Basketball Association and identify the most important features that make an All-Star player. |
| |
[ | Designed and built a big data analytics framework for sports behavior mining and personalized health services. |
| |
[ | Provides a prediction model of sports results based on knowledge discovery in database. |
| |
[ | A machine learning system with unsupervised learning and supervised learning components to analyze chess data. |
| |
[ | Concluded that the most important elements in basketball are two-point shots under the arch and defensive rebound. |
| |
[ | A data mining approach for classification and identification of golf swing from weight shift data. |
| |
[ | Describes machine learning techniques that assist cycling experts in the decision-making processes for athlete selection and strategic planning. |
| |
[ | Predict match outcomes in the 2015 Rugby World Cup. |
| |
[ | Presented a visualization system that uses statistics and movement analysis. Basically, the type of pattern of attack and play can be understood dynamically and visually. |
| |
[ | Conducted a study on a decision support system for techniques and tactics in sports. |
Datasets.
Paper | Dataset |
---|---|
[ | - |
| |
[ | Moving images of 15 male college tennis players. |
| |
[ | Data of Laguna Poreč half-marathon (2017) and Ormož half-marathon (2017). |
| |
[ | A set of sports performance data of college students. |
| |
[ | - |
| |
[ | Training plan generated by an Artificial Sports Trainer and a list of potential nutrition. |
| |
[ | - |
| |
[ | Used exercise datasets for training. |
| |
[ | Sports training plans generated by an Artificial Sport Trainer. |
| |
[ | Training data and competitions of a cycling mode athlete. |
| |
[ | Used sensors attached to various exercise equipment, allowing the collection of characteristics during the workout. |
| |
[ | - |
| |
[ | A football match data from European Cup 2008’s final match - Spain vs Germany. |
| |
[ | Data of matches played by India. |
| |
[ | The data from the FIVB Volleyball World Championships finals that were held in Poland and Italy in 2014. |
| |
[ | Data from 2005 to 2016, spanning 11 seasons of the English Premier League. |
| |
[ | Data from 275 male golfers. |
| |
[ | Website data |
| |
[ | Real-life statistical data from |
| |
[ | A dataset comprising 778 games from the regular part of the 2009/2010 NBA season. |
| |
[ | Data from One Day International (DOI) matches during the time period 2001-2015 for each team - |
| |
[ | An NBA men basketball dataset that is publicly available at open source sports in the period 1937 till 2011. |
| |
[ | - |
| |
[ | - |
| |
[ | Data from 500 games from each of the 10 grandmaster chess players, a total of 5000 chess games. |
| |
[ | Data from the First B basketball league for men in Serbia, from seasons 2005/06, 2006/07, 2007/08, 2008/09 and 2009/2010. |
| |
[ | Weight shift data from golf experiments conducted by the research team. |
| |
[ | Competition results for senior riders including the Australian Championships 2009, World Championships 2007–2010, UCI World Cup Melbourne 2010, UCI World Cup Cali 2011, UCI World Cup Beijing 2011, UCI World Cup Manchester 2011 and Oceania Championships 2010. |
| |
[ | History of statistical data, ranking, and points of 20 rugby teams. |
| |
[ | American football game data. |
| |
[ | - |
It is observed in Table
Here, we present a brief description of the data mining techniques used by the surveyed papers in this SLR. Five main techniques were identified: classification, clustering, association, regression, and heuristics.
Furthermore, Table
Classification of thematic types by techniques.
Theme | Classification | Clustering | Association | Regression | Heuristic |
---|---|---|---|---|---|
| [ | - | - | - | - |
| |||||
| - | [ | - | - | [ |
| |||||
| [ | - | - | - | - |
| |||||
| - | - | - | - | [ |
| |||||
| [ | [ | [ | - | [ |
| |||||
| [ | - | [ | - | - |
| |||||
| [ | [ | [ | - | - |
| |||||
| [ | [ | - | - | [ |
| |||||
| [ | [ | [ | [ | - |
This section aims to explore the algorithms used in the revised works, as in Table
Algorithms.
Paper | Algorithm |
---|---|
[ | Artificial Neural Networks, Statistical Classifiers and Hidden Markov Models. |
| |
[ | C4.5, Random Forest and Native Bayes Tree. |
| |
[ | Differential Evolution. |
| |
[ | k-means, Ant Colony Optimization. |
| |
[ | Naive Bayes. |
| |
[ | Particle Swarm Optimization. |
| |
[ | - |
| |
[ | Bat Algorithm. |
| |
[ | Particle Swarm Optimization. |
| |
[ | K-Means. |
| |
[ | Artificial Neural Networks. |
| |
[ | AISReact. |
| |
[ | Association Rule Mining Algorithms. |
| |
[ | Association Rule Mining Algorithms. |
| |
[ | Inductive Logic Programming |
| |
[ | Gaussian Naive Bayes, Support Vector Machine and Random Forest. |
| |
[ | Random Forest and Classification and Regression Trees. |
| |
[ | Linear Algebra Methods, Artificial Neural Networks and Random Forest. |
| |
[ | Decision tree, Artificial Neural Networks and Support Vector Machine. |
| |
[ | Naive Bayes, Decision tree, Support Vector Machine and K Nearest Neighbors. |
| |
[ | Naive Bayes, Support Vector Machine and Random Forest. |
| |
[ | Random Forest. |
| |
[ | K-means. |
| |
[ | Artificial Neural Networks. |
| |
[ | Hierarchical Clustering. |
| |
[ | Artificial Neural Networks. |
| |
[ | Particle Swarm Optimization, Support Vector Machine, C4.5. |
| |
[ | Bayesian Belief Networks, Naive Bayes and K-means. |
| |
[ | Random Forest. |
| |
[ | - |
| |
[ | - |
It is important to emphasize that the revised works cited other algorithms for sports data mining. However, Table
Topics related to sports data mining are relevant. However, this domain still has several branches to be explored. Thus, in Table
Sports modalities or application field.
Modalities/Field | Articles |
---|---|
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
| [ |
According to Table
Finally, it is important to emphasize that the sports domain presents several others subjects in which data mining can be applied. Nevertheless, the objective of this section was to report themes and to highlight some hypotheses for future studies, based on the reviewed papers.
In recent years, sports data mining has evolved. Consequently, many sports organizations have noticed that there is a wealth of unexplored knowledge in the data extracted by them. This is because even a small additional view of the variables can decide several factors, thus increasing the competitive advantage of the teams over their rivals. That is, data mining transfers to sports a higher degree of professionalism and reliability. Therefore, this article covered the last eight years (2010-2018) of papers available in relevant databases. The review issues considered methods, information, and applications. Thereby, the current panorama, temporal distribution, themes, the datasets used, proposals, and results of these revised works were presented. Moreover, techniques, algorithms, methods, and research opportunities have been reported.
As a result, we find 31 articles relevant that were separated into nine thematic types, whose highest frequency of publication was in the years 2014 and 2016. We also present the most cited articles, their datasets, and results. Regarding data mining techniques, the classification was most applied. Finally, possible areas to be explored were reported, such as swimming, athletics, hockey, boxing, fencing, and tennis. It is expected that this article provides an important source of knowledge for future researches, beyond encouraging new studies.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors thank the Federal University of Technology, Paraná (UTFPR, Grant: April/2018), and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brasil (CAPES), Finance Code 001, for their financial support.