South Africa has been classified as one of the most homicidal, violent, and dangerous places across the globe. However, the two elements that pushed South Africa high in the crime rank are the rates of social violence and homicide. It was reported by Business Insider that South Africa is among the most top 15 ferocious nations on earth. By 1995, South Africa was rated the second highest in terms of murder. However, the crime rate has reduced for some years and suddenly rose again in recent years. Due to social violence and crime rates in South Africa, foreign investors are no longer interested in continuing or starting a business with the nation, and hence, its economy is declining. South Africa’s government is looking for solutions to the crime issue and to redeem the image of the country in terms of high crime ranking and boost the confidence of the investors. Many traditional approaches to data analysis in crime-related studies have been done in South Africa, but the machine learning approach has not been adequately considered. The police station and many other agencies that deal with crime hold a lot of databases that can be used to predict or analyze criminal happenings across the provinces of South Africa. This research work aimed at offering a solution to the problem by building a model that can predict crime. The machine learning approach shall be used to extract useful information from South Africa's nine provinces' crime data. A crime prediction system that can analyze and predict crime is proposed. To accomplish this, South Africa crime data on 27 crime categories were obtained from the popular data repository “Kaggle.” Diverse data analytics steps were applied to preprocess the datasets, and a machine learning algorithm (linear regression) was used to build a predictive model to analyze data and predict future crime. The appropriate authorities and security agencies in South Africa can have insight into the crime trends and alleviate them to encourage the foreign stakeholders to continue their businesses.
The causes of high crime rates in South Africa were linked to factors including the low standard of education, alcohol abuse, a lack of social and vocational skills, poor housing and living conditions, and a lack of parenting skills [
With a significant increase in crime across the nations, it has become necessary to analyze crime data to reduce the crime rate. This helps the police, other security agencies, and citizens to take required actions and unravel the crimes faster. Yearly, enormous data is generated by the police and other law enforcement organizations, and analyzing these data to execute the decision to prevent future crime is the main issue. Performing the analyses of the data will facilitate the recognition of the features responsible for the increase in crime and important steps to curb the crimes. Data mining processes involve evaluating and examining large data such as South Africa crime datasets at Kaggle [
There is a need for an innovative system and new crime analytics methods for protecting South African communities from crime. By using data mining methods shown in Figure
Cross-industry standard process for data mining (CRISP-DM).
Crime category.
Number | Category |
---|---|
1 | All theft not mentioned elsewhere |
2 | Arson |
3 | Assault with the intent to inflict grievous bodily harm |
4 | Attempted murder |
5 | Bank robbery |
6 | Burglary at nonresidential premises |
7 | Burglary at residential premises |
8 | Carjacking |
9 | Commercial crime |
10 | Common assault |
11 | Common robbery |
12 | Driving under the influence of alcohol |
13 | Drug-related crime |
14 | Illegal possession of firearms and ammunition |
15 | Malicious damage to property |
16 | Murder |
17 | Robbery at nonresidential premises |
18 | Robbery at residential premises |
19 | Robbery in cash transit |
20 | Robbery with aggravating circumstances |
21 | Sexual offenses |
22 | Sexual offenses as a result of police action |
23 | Shoplifting |
24 | Stock-theft |
25 | Theft of motor vehicle and motorcycle |
26 | Theft out of or from motor vehicle |
27 | Truck hijacking |
Province area (km2) [
Rank | Province | Area (km2) | Percentage |
---|---|---|---|
1 | Northern Cape | 372,889 | 30.5 |
2 | Eastern Cape | 168,966 | 13.8 |
3 | Free State | 129,825 | 10.6 |
4 | Western Cape | 129,462 | 10.6 |
5 | Limpopo | 125,755 | 10.2 |
6 | North West | 104,882 | 8.6 |
7 | KwaZulu-Natal | 94,361 | 7.7 |
8 | Mpumalanga | 76,495 | 6.3 |
9 | Gauteng | 18,178 | 1.5 |
Total | South Africa | 1220813 | 100.0 |
Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology is considered for this work. CRISP-DM is very efficient and suitable for data mining projects. It has been widely used for data mining research in the literature. CRISP-DM steps are described below and depicted in Figure
This study aims to build a predictive model that can analyze the existing South Africa crime data, detect hidden patterns, and generate useful information that can be communicated to the government and/or security agencies to make timely decisions on how to curb crime in the country.
South Africa crime data obtained from the Kaggle repository is used for this work. The activities carried out at this stage include data description, data exploration, and verification of data quality.
The crime dataset was organized and makes ready for data analytics. Data selection, data cleansing, data construction, and data integration were performed using Python library Scikit-learn (sklearn). Some attributes in the comma-separated values (CSV) files contain string values as well as numeric values.
This is a very crucial stage of the data mining process where machine learning algorithms are applied to the prepared data to analyze the data and create predictive models to make predictions into the future using the useful information generated from the hidden patterns of the data. The activities of this stage include select suitable modeling techniques, that is, the appropriate machine learning algorithm to build the predictive models; generate test design to test the model quality and validity; build a model and run the model tool on the prepared dataset to create one or more models; interpret the model according to domain knowledge, success criteria, and the desired test designs; and ensure the accuracy and generality of the model.
In the execution of the linear regression of some dependent variable
This equation is the regression equation where
This function detects the inputs and output dependencies. The estimated or predicted response,
In this work, a linear regression which is one of the machine learning algorithms was considered for building crime predictive models. Linear regression is a predictive modeling method where the target variable to be estimated is continuous. The technological tool is used for implementing the linear regression model in Python using Scikit Learn Modules. This is an efficient data mining tool built on NumPy, SciPy, and matplotlib modules of Python. Sklearn Linear Regression in Scikit allows studying relationships between two continuous (quantitative) variables: one variable, denoted by
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
regr = linear_model.LinearRegression ()
regr.fit (
The efficiency of a linear predictive regression model has evaluated the squares of the errors average or deviations (i.e., the difference between the estimator, “features,” and what is estimated, “Target Variable”). The difference of actual responses
The degree to which the model meets the project objective is assessed at this stage. After assessing the models, the generated model that meet the objective of the project is considered.
Strategies to establish the evaluation results will be determined at this stage which includes the final report.
The visualization of 27 crime categories, trends in crimes, and all the results from the linear regression implementation are provided in this section. Linear regression machine learning predictive technique has been widely used in the literature for building predictive models [
2005–2016 South Africa population and crime statistics. (a) Population statistic. (b) Crime statistic.
Number of police stations and number of reported crimes in South African Provinces (2005–2016). (a) Number of police stations per province. (b) Total crimes per province.
Crime category reported in South African provinces (2005–2016).
All province total crime trends.
The trends of the 27 crime categories per province in 2005–2016 are depicted in Figure
Trends in the 27 crime categories per province in 2005–2016.
A data visualization technique known as the Word Cloud (Figure
Word cloud of the crime category in (a) Gauteng, (b) KwaZulu-Natal, (c) Western Cape, and (d) Eastern Cape.
A machine learning model was built using linear regression (with the existing data on crime, population, area, and density) to predict future crime occurrence. Multicollinearity among features can be identified by doing Feature-Feature correlation analysis. In linear regression, the input variables should not be multicollinear, that is, dependent on each other. The heatmap shown in Figure
Correlation heatmap.
Correlation gradient.
Series of experiments were carried out and the regression results are shown in Figure
Linear regression results.
Crime prediction with linear regression approach.
Target =
Crime_Number =
Crime_Number =
Linear regression reduces the sum of squares of the variables predicted by linear approximation.
From the illustration of linear regression results in Figure
Machine learning technique can effectively detect the hidden patterns in crime data that are valuable, give good visualization for crime prediction, and thus provide support for crime prevention in South Africa. Crime data analytics can extract unknown, vital information from raw data and, thus, assist the government to speed up the procedures of resolving crime. It would enable appropriate authorities in the government to gain a better understanding of crime trends and mitigate against them. When the crime is prevented and the environment is peaceful, foreign investors are happy to continue with their businesses in South Africa, and hence economic growth is sustained. This work presents a predictive model trained with crime data and can take population and density as inputs to predict the total crime of any province of South Africa. The extension of this work shall seek information on crime other factors from the South Africa Police authority and build a predictive model considering those factors.
South Africa crime data are obtained from Kaggle,
The authors declare no conflicts of interest.
Ibidun Christiana Obagbuwa and Ademola P. Abidoye contributed equally to the manuscript. Ibidun Christiana Obagbuwa contributed to generation of ideas, design, implementation, literature review, and writing of the paper. Ademola P. Abidoye contributed to implementation, literature review, and paper writing.
The authors would like to appreciate the support of Sol Plaatje University for this research. Furthermore, the authors are thankful to Kaggle for the availability of South Africa crime data (2005–2016) and the notebooks.