This paper suggests a novel clustering method for analyzing the National Incident-Based Reporting System (NIBRS) data, which include the determination of correlation of different crime types, the development of a likelihood index for crimes to occur in a jurisdiction, and the clustering of jurisdictions based on crime type. The method was tested by using the 2005 assault data from 121 jurisdictions in Virginia as a test case. The analyses of these data show that some different crime types are correlated and some different crime parameters are correlated with different crime types. The analyses also show that certain jurisdictions within Virginia share certain crime patterns. This information assists with constructing a pattern for a specific crime type and can be used to determine whether a jurisdiction may be more likely to see this type of crime occur in their area.
The National Incident-Based Reporting System (NIBRS) is a crime reporting program for local, state, and federal law enforcement agencies that provides a wealth of incident level data for use in analysis. It is part of the Uniform Crime Reporting (UCR) Program which is administered by the FBI. The UCR Program provides a nationwide view of crime based on data submitted through state programs or directly to the national UCR Program and has been operational for around 70 years. The NIBRS was implemented in the late 1970s to meet law enforcement need for the 21st century. This vast system houses information on offenses, victims, offenders, property, and persons arrested, as well as the incident itself. The data of NIBRS are well structured and readily available for researchers and law enforcement agencies to assist with understanding the intricate nature of crime.
Akiyama and Nolan [
For criminologists the NIBRS data holds the answers to many long-standing questions about crime, criminal offending, and crime victimization. However, gaining access to some of these answers has remained difficult because of the size and complexity of the data. Effective techniques, such as data mining and clustering, for criminal justice data are of increasing importance to both the research and law enforcement communities [
Clustering categorical data poses a challenge not encountered in clustering numerical data because the attribute categories are not ordered and defining a metric with which to measure the distance between data objects in a data set becomes a challenge. Many of the algorithms that have emerged for clustering categorical data rely on the occurrence/cooccurrence frequencies of attribute values in the data set to determine clusters of similar data objects. The basic goal is to choose a set of attribute categories that provide a summary of the data objects in a cluster. There are a wide range of clustering algorithms for categorical data, including K-modes [
Due to the lack of well-defined mathematical models and optimization goals, most existing graph theory clustering approaches could not guarantee a proper clustering result in general cases. For example, agglomerative hierarchical clustering methods could not produce proper clusters with larger sizes, while divisive hierarchical methods could not produce clusters with smaller sizes, and clusters with large difference in their sizes and k-core method may produce clusters with small edge-cuts, and so forth. Many papers and articles have mentioned these problems and frustration among users (e.g., see [
The purpose of this paper is to present a novel multidimensional clustering method for the NIBRS data. We firstly outlines a new measure, called the
The rest of the paper is organized as follows. In Section
The data sets available in the NIBRS provide a wealth of incident level data about each reported crime. As for 2010, approximately 40 states contribute their data to the massive data set. The data and tools are made available by University of Michigan for use by law enforcement agencies and researchers. In order to devise a manageable set of data for preliminary testing of techniques and for preliminary data analysis, only the 2005 data on assaults were explored. From the 2005 assault data, 121 jurisdictions (counties or cities) in Virginia were selected for examination. These represent all jurisdictions within Virginia with populations greater than 10,000. There were 10,183 incidents reported in these 121 chosen jurisdictions.
For this study, 21 indexes from the NIBRS were chosen from the 246 available indexes. These 21 indexes were deemed important to provide the relevant characteristics of the victim(s), offender(s), and the circumstances of each incident. The selected particular indexes were listed in Table
NIBRS indexes used.
Segment | Index | Number of subindexes |
---|---|---|
Victim indexes | Type of victim | 3 |
Victim age | 3 | |
Victim sex | 2 | |
Victim race | 5 | |
Victim ethnicity | 3 | |
Victim residence status | 3 | |
Aggravated assault/homicide circumstances | 10 | |
Type of injury | 6 | |
|
||
Offender indexes | Offender age | 3 |
Offender sex | 3 | |
Offender race | 5 | |
|
||
Additional indexes | Injury | 2 |
Juvenile | 1 | |
Violent crime | 1 | |
Juvict | 1 | |
Multiple victims | 1 | |
Multiple offenders | 1 | |
Multiple offenders and victims | 1 | |
Multiple offenders and one victim | 1 | |
One offender and multiple victims | 1 | |
One offender and one victims | 1 |
In order to facilitate the selected analysis techniques, the data was expanded from one column, with many possible entries, to multi columns that contained zero or one. For example, the Offender Segment index contains the sex of the offender and has the possible entries of male, female, or unknown. This index column was split into three individual columns where an entry in the three columns of (1 0 0) means female, (0 1 0) means male, and (0 0 1) means unknown. This turns the column for sex of the offender to three columns. All created columns were binary (0/1) columns that were used to help classify the characteristics of the incident. From the expansion of the original 21 indexes, 57 binary columns were created. This led to the creation of a 121 × 57 Crime Data Matrix, where each row
Normalization of the rows of the matrix was completed by dividing each row entry by the population of the jurisdiction. This gave a per person rate for each crime parameter or each crime type. Normalization of the columns was completed by averaging the columns and subtracting the average from each entry in the column. Then each entry in the column was divided by the vector length of the column. Equation (
This section explains several different analyses that were performed on the data in the matrix described above in order to attempt to answer the research questions listed in Section
The motivation for comparing the different crime parameters to crime types is to determine if there are some characteristics that can tell us about the likelihood of a crime type to occur in a certain jurisdiction. Each crime type may have factors that contribute to a specific crime appearing in a certain place. An overall increase in crime in an area may or may not correlate to an increase in any one particular type of crime, say hate crime, in that area. However, there may be individual parameters whose increase may indicate an increase in a particular type of crime. For example, if juvenile offenders are up in a certain area, this may indicate that hate crimes will also be up in that area. Also crime type vectors were compared against each other in a similar way. For example, crimes like juvenile gang were compared against hate crime to see if these crime types also have a correlation.
In order to determine the relationship between the
For this comparison of two column vectors, the variation in two vectors must be transformed to eliminate the effects of mean differences. Once the mean deviation is determined, then the correlation can be determined by the cosine of the angle between the vectors. As an example, let
In doing this comparison, with one vector fixed and comparing it to all other vectors and itself, we form a row vector of size 57. Consider the example that compares the hate crime vector with all other vectors. We construct another vector that we refer to as the Hate Crime Character Vector of size 57, which contains all the cosine
Hate crime correlation coefficient vector values for other crimes.
Crime type | Hate crime correlation coefficient ( |
---|---|
Argument | 0.1325 |
Assault on law enforcement officers | 0.0371 |
Drug dealing | 0.0265 |
Gangland (organized crime involvement) | 0.1565 |
Juvenile gang |
|
Lovers’ quarrel |
|
Other felony involved |
|
Other circumstances | 0.0961 |
Unknown circumstance | 0.1356 |
Table
Hate crime correlation coefficients for age of victim.
Age of victim | Hate crime correlation coefficient ( |
---|---|
Age < 18 |
|
18 ≤ Age ≤ 60 | 0.0540 |
Age > 60 | 0.0474 |
By using the correlation coefficients, a 57-dimensional vector, two data analyses can be performed, each of which indicates the likelihood of a particular crime for each jurisdiction. The first one is a numerical index, called the
Let
Likelihood index and clustering for hate crime.
Name | Cluster | Hate crime likelihood index | |
---|---|---|---|
1 | NEWPORT NEWS |
|
0.867 |
2 | NORFOLK |
|
0.837 |
3 | CHESAPEAKE |
|
0.810 |
4 | GREENSVILLE |
|
0.762 |
5 | PORTSMOUTH |
|
0.759 |
6 | RICHMOND |
|
0.716 |
7 | WYTHE |
|
0.681 |
8 | ALEXANDRIA |
|
0.678 |
9 | BRISTOL |
|
0.667 |
10 | NEW KENT |
|
0.618 |
11 | FAUQUIER |
|
0.531 |
12 | ROANOKE | A | 0.439 |
13 | RICHMOND | A | 0.386 |
14 | WILLIAMSBURG | A | 0.377 |
15 | VIRGINIA BEACH |
|
0.318 |
16 | CHARLOTTESVILLE | A | 0.312 |
17 | PETERSBURG | A | 0.280 |
18 | SPOTSYLVANIA | A | 0.226 |
19 | HOPEWELL | A | 0.205 |
20 | RUSSELL | A | 0.121 |
21 | CLARKE | A | 0.121 |
22 | WINCHESTER | A | 0.118 |
23 | SUFFOLK | A | 0.111 |
24 | MARTINSVILLE | A | 0.075 |
25 | STAUNTON | A | 0.073 |
26 | GALAX | A | 0.070 |
27 | SHENANDOAH | A | 0.063 |
28 | CAROLINE | A | 0.044 |
29 | SURRY | A |
|
30 | DANVILLE | A |
|
To begin the clustering, a weighted complete graph with 121 vertices is formed. The weight on each edge is the correlation coefficient between the jurisdictions. The novel graph theory clustering method we proposed in [
The clustering results of 121 counties in Virginia according to hate crime.
It can be seen that of the top 30 jurisdictions with respect to hate crimes, 12 of the top 15 are in Cluster A and the others are in Cluster B. The remaining 91 jurisdictions also fall into Cluster B. The 12 higher hate crime rate counties (cities) are showed in Figure
Similar analyses where performed with respect to other crime types: drug-dealing, juvenile gang, and gangland (organized crime involvement). The results are summarized in Tables
Drug dealing correlation with other crime parameters related to victim.
Drug dealing correlation coefficient | |
---|---|
Type of victim | |
individual |
|
Business | 0.1721 |
Society/public | 0.1729 |
Age of victim | |
Age < 18 | 0.2933 |
Age ≥ 60 | 0.3872 |
18 ≤ age < 60 |
|
Sex of victim | |
Male |
|
Female | 0.4989 |
Race of victim | |
White | 0.3906 |
Black |
|
Asia/Pacific Islander | 0.0764 |
Unknown | 0.0003 |
American Indian | 0.0393 |
Ethnicity of victim | |
Hispanic origin | 0.0688 |
Not of Hispanic origin |
|
Unknown | −0.0616 |
Resident status of victim | |
Nonresident | 0.2631 |
Resident |
|
Unknown | 0.1290 |
Drug dealing correlation with other crime parameters related to offender.
Offender age | |
Age < 18 | 0.3514 |
Age |
0.3762 |
18 |
|
Offender sex | |
Male | 0.5331 |
Female |
|
Unknown | 0.1972 |
Offender race | |
White | 0.3145 |
Black |
|
American Indian/Alaskan native | −0.0479 |
Unknown | 0.1948 |
Asian/Pacific Islander | 0.113 |
Drug dealing correlation with other crime types.
Other crime types | |
---|---|
Argument |
|
Assault on law enforcement officer(s) | 0.3117 |
Gangland (organized crime involvement) | 0.3456 |
Juvenile gang | 0.0706 |
Lovers’ quarrel | 0.2967 |
Other felony involved | 0.2955 |
Hate crime | 0.0265 |
Likelihood index and clustering for drug dealing.
Name | Cluster | Drug dealing likelihood index | |
---|---|---|---|
1 | CHARLOTTESVILLE |
|
0.9094 |
2 | NEWPORT NEWS |
|
0.9012 |
3 | CHESAPEAKE |
|
0.8851 |
4 | PETERSBURG |
|
0.8637 |
5 | RICHMOND |
|
0.8365 |
6 | PORTSMOUTH | A | 0.8281 |
7 | HOPEWELL | A | 0.7917 |
8 | ROANOKE |
|
0.791 |
9 | SUFFOLK |
|
0.7839 |
10 | NORFOLK |
|
0.7756 |
11 | BRISTOL |
|
0.7365 |
12 | DANVILLE | A | 0.707 |
13 | GREENSVILLE | A | 0.7032 |
14 | GALAX | A | 0.6881 |
15 | CAROLINE | A | 0.6852 |
16 | SUSSEX | A | 0.6396 |
17 | WINCHESTER | A | 0.6327 |
18 | FREDERICKSBURG |
|
0.6031 |
19 | CLARKE | A | 0.6021 |
20 | FRANKLIN | A | 0.5808 |
21 | RICHMOND | A | 0.5667 |
22 | LYNCHBURG | A | 0.5395 |
23 | MECKLENBURG | A | 0.5185 |
24 | RADFORD | A | 0.5133 |
25 | GOOCHLAND | A | 0.5114 |
26 | MANASSAS | A | 0.494 |
27 | HENRY |
|
0.4844 |
28 | NORTON | A | 0.4456 |
29 | WISE |
|
0.4381 |
30 | WILLIAMSBURG | A | 0.395 |
Likelihood index and clustering for gangland (organized crime involvement).
Name | Cluster | Gangland likelihood index | |
---|---|---|---|
1 | NEWPORT NEWS |
|
0.9217 |
2 | ROANOKE |
|
0.8786 |
3 | CHESAPEAKE |
|
0.8713 |
4 | PETERSBURG |
|
0.8421 |
5 | NORFOLK |
|
0.8129 |
6 | RICHMOND |
|
0.7953 |
7 | LYNCHBURG |
|
0.6965 |
8 | FREDERICKSBURG |
|
0.689 |
9 | BRISTOL |
|
0.6787 |
10 | ALEXANDRIA |
|
0.6569 |
11 | NORTHAMPTON |
|
0.632 |
12 | LOUDOUN |
|
0.4939 |
13 | CHARLOTTESVILLE | A | 0.4152 |
14 | STAFFORD |
|
0.3853 |
15 | PORTSMOUTH | A | 0.3075 |
16 | HOPEWELL | A | 0.2724 |
17 | GALAX | A | 0.2121 |
18 | CAROLINE | A | 0.1372 |
19 | SUFFOLK | A | 0.1077 |
20 | GREENSVILLE | A | 0.0817 |
21 | CLARKE | A | 0.0689 |
22 | DANVILLE | A | 0.0519 |
23 | HENRY | A | 0.0518 |
24 | RICHMOND | A | 0.0074 |
25 | WINCHESTER | A |
|
26 | FRANKLIN | A |
|
27 | SUSSEX | A |
|
28 | MANASSAS | A |
|
29 | WILLIAMSBURG | A |
|
30 | GOOCHLAND | A |
|
Likelihood index and clustering for juvenile gang.
Name | Cluster | Juvenile gang likelihood index | |
---|---|---|---|
1 | NORFOLK |
|
0.8424 |
2 | ROANOKE |
|
0.8387 |
3 | RICHMOND |
|
0.7987 |
4 | PORTSMOUTH |
|
0.7977 |
5 | GREENSVILLE |
|
0.7575 |
6 | CHESAPEAKE |
|
0.7211 |
7 | NEWPORT NEWS |
|
0.7112 |
8 | WILLIAMSBURG |
|
0.6639 |
9 | WYTHE |
|
0.5719 |
10 | ALEXANDRIA |
|
0.5597 |
11 | LYNCHBURG |
|
0.5452 |
12 | MARTINSVILLE |
|
0.5355 |
13 | PULASKI |
|
0.4884 |
14 | HAMPTON |
|
0.4848 |
15 | SPOTSYLVANIA |
|
0.451 |
16 | POWHATAN |
|
0.442 |
17 | PETERSBURG | A | 0.4207 |
18 | CHARLOTTESVILLE | A | 0.4064 |
19 | HOPEWELL | A | 0.3519 |
20 | RICHMOND | A | 0.208 |
21 | GALAX | A | 0.2025 |
22 | BRISTOL | A | 0.2012 |
23 | SUFFOLK | A | 0.1925 |
24 | CAROLINE | A | 0.121 |
25 | CLARKE | A | 0.0939 |
26 | WINCHESTER | A | 0.0893 |
27 | DANVILLE | A | 0.0834 |
28 | SUSSEX | A | 0.051 |
29 | CAMPBELL | A | 0.0119 |
30 | FRANKLIN | A |
|
Table
Tables
Table
Each of the other crime parameters related to the victim and the offender was compared to the gangland (organized crime involvement) vector. There were no significant relationships to report from this comparison.
Table
Table
Table
The NIBRS provides a wealth of incident level data for use in analysis. The methods investigated in this research yielded promising preliminary results. The methods were applied only to the assault data from 2005 but can easily be extended to other crime types and to other years to validate these results and also provide longitudinal investigation.
The comparison between the crime type vector and the individual parameters vectors helped in two cases (hate crimes and drug dealing) to determine which factors was more related to those crimes. The different types of analyses that were conducted on the jurisdictions helped to validate one another. The likelihood index looked at whether a certain crime pattern existed in that jurisdiction, while the clustering method sought to cluster all the jurisdictions based on the crime patterns of that jurisdiction. This information could be useful to assist law enforcement agencies or policy makers in determining which jurisdictions share common challenges that could possibly be addressed through cooperation and sharing resources between jurisdictions.
The next steps would be to utilize this same approach for data from other states or perhaps a larger region to examine if the same information is observed from the analyses. It will be interesting to see if Virginia data and other states have the same patterns or if different patterns emerge. Further research and refinement of these methods should yield tools that would provide researchers, law enforcement agencies, and government officials with a means to find patterns of different crime types and possibly identify jurisdictions that may be likely to experience that type of crime.
Peixin Zhao, Marjorie Darrah, Jim Nolan, and Cun-Quan Zhang certify that there is no actual or potential conflict of interests in relation to this paper.
First author was partially supported by the China Postdoctoral Science Foundation Funded Project (2011M501149), the Humanity and Social Science Foundation of Ministry of Education of China (12YJCZH303), the Special Fund Project for Postdoctoral Innovation of Shandong Province (201103061), the Informationization Research Project of Shandong Province (2013EI153), the National Statistical Science Project (2013LZ38), and Independent Innovation Foundation of Shandong University (IIFSDU) (IFW12109).