Detecting Illegal Online Gambling (IOG) Services in the Mobile Environment

Despite the extensive ramiﬁcations of illegal online gambling (IOG) services, actions taken by government authorities have had little eﬀect in halting these operations. In order to reduce the prevalence of IOG, the ability to detect malicious uniform resource locators (URLs) is crucial. Text mining and binary classiﬁcation have been widely adopted to detect and prevent spam short message services (SMSs), but government authorities and various task forces that monitor and regulate gambling also rely on the analysis of malicious URLs. This study proposes a novel system to analyse the characteristics of spam URLs, oﬀering a method that can assist government agencies combatting mobile IOG sites.


Introduction
Despite the gambling market being one of the most regulated industries around the world, recent advancements in telecommunication technology have allowed illegal gambling to flourish online [1][2][3]. According to UNODC [4], 80% of sports and racing betting worldwide is illegally operated, with an estimated value of between 340 billion and 1.7 trillion USD. Most racing bets have wager limits, and past studies have focused on the effect of wagering limits on payouts and losses [5]. Moreover, harm-reduction strategies such as customer messaging have been considered by examining four Australian online sports and racing wagering sites [6]. Unlike authorized platforms, IOG sites do not impose limits on betting. Moreover, regulating these sites has become difficult as they must first be detected and then accurately identified.
Recent studies have found that gaming disorders have shown aetiological pathways into problematic gambling [7,8], while gambling has been associated with the misuse of substances such as alcohol and nicotine in adolescents [9]. Internet gambling disorder is included in the diagnostic and statistical manual (DSM-5) for mental disorders, with detrimental ramifications for adolescents [10]. Common anxiety disorders such as social anxiety, depression, and loneliness have also been positively associated with gaming in adolescents [11]. e number of games introducing randomly-generated in-game rewards has increased throughout the past decade alongside the number of platforms such as mobile game markets, consoles, and PCs [12,13]. Consequently, loot boxes, a virtual item that produces various rewards through a game of chance, have been banned in various nations such as the Netherlands and Belgium.
IOG sites rely on marketing to attract users, often using a "recommendation" system in which new members are invited by original members. However, as this method cannot bring in a large number of customers and illegal services cannot be advertised publicly, IOG sites also use smartphone applications to send out text messages. IOG organizers gather or purchase contact information to invite random users to their platforms. e Korea Internet and Security Agency (KISA) works with smartphone manufacturers and mobile communication companies to provide Android users with a reporting system through which people can report illegal spam messages sent over short message service (SMS) or multimedia messaging service (MMS). In South Korea, approximately ten million spam SMSs are reported by KISA each year, and approximately 50% of these were confirmed to be related to illegal gambling. As some people are not aware of these reporting systems or fail to report spam SMSs, the actual number of messages is likely much larger. ese messages severely affect the safety of the online environment and, therefore, must be researched so that they can be effectively blocked by relevant authorities. Although it may not be possible to obstruct all spam messages, authorities must still investigate their patterns, content, and features to develop technologies capable of, for example, extracting URLs. While authorities have already taken actions against many illegal gambling houses, illegal operators are willing to risk continuing due to record profits [14].
is study proposes a system based on artificial intelligence to sort illegal gambling messages from reported suspicious messages with a detection accuracy rate of 97%. Moreover, this study finds that illegal messages exhibit several patterns, including features that revise URLs to stop them from being filtered automatically. By reversing such patterns, the URL information can be reconstructed, and it will be easier for IOG websites to be automatically reported and taken down. As a result of our investigation, we suggest technologies to identify illegal gambling SMSs from reported spam and extract URL information from illegal gambling websites. We believe that this method represents a considerable contribution toward automating the process of classifying and blocking illegal sites, thus helping to keep our online environment safer. We further believe that our proposed methods can form the basis of new safeguards for government agencies, citizens, and the gambling industry against various illegal operations.

Background
e Council of Europe Convention on the Manipulation of Sports Competitions, better known as the "Macolin Convention," defines illegal gambling as "Any sports betting activity whose type or operator is not allowed under the applicable law of the jurisdiction where the consumer is located" [15].
is definition interprets illegal gambling widely, meaning that the same situation might be judged differently in different countries.
IOG websites mainly target people in countries where online gambling is illegal [16]. Broadly speaking, there are two types of IOG: (i) games and (ii) sports gambling. Online Live Casinos, Web Board Games, Internet reel games, and Power Ball are all illegal in South Korea, with some other illegal games such as ladder rides, snail games, and Mario probability games specifically targeting young people. In online sports gambling, users wage on the outcomes of sporting events, such as horse or cycling races. Examples of IOG are presented in Figure 1.

Issues with Illegal Gambling.
Several key issues have arisen as more and more jurisdictions are allowing and controlling online gambling worldwide. ese countries take steps to keep online gambling responsible, such as a dedicated budget for addiction centers and limits on betting amounts [17]. Illegal operators, however, encourage users to bet large amounts frequently and avoid paying taxes.
As illegal gambling is not subject to laws, users might not receive their winnings. In addition to financial fraud, illegal gambling causes several social problems [15]: (i) Illegal gambling enables money laundering and organized transnational crime (ii) Match fixing poses a challenge to the dignity of sports (iii) Illegal gambling causes gambling disorders and related social problems e disorders mentioned above can be observed in legal gambling but are more severe in illegal gambling (Table 1) [15]. Table 1 is reproduced from Asia Racing Federation 2018. In addition, there is concrete evidence worldwide that illegal gambling contributes to a higher incidence of problems than legal betting.
Illegal gamblers are more likely to be at-risk, moderaterisk, or problem gamblers and less likely to be nonproblem gamblers than those who gamble legally. As a result, problem gambling is more common among people who gamble illegally online, resulting in issues such as depression, alcohol and drug abuse, family breakup, debt, and suicide [18][19][20][21].
In general, illegal gamblers are able to bet larger amounts of wagers than legal gamblers. At the minimum, the lack of any limitations on gambling activity in illegal environments can spur and worsen the issues of excessive gamblers. Hence, it is necessary to identify IOG websites and block them for social good.

Comparison of Illegal Gambling across Different Nations.
As shown in Table 2, illegal gambling is prevalent, especially in Asia [15]. Table 2 is reproduced from Asia Racing Federation 2018. South Korea constitutes more than 60% of illegal gambling in the world.

Negative Effects of Illegal Gambling on the Adolescents.
Owing to behavioral and emotional immaturities, children are vulnerable to gambling issues through social pressure and advertisements [22]. In several high-income nations, the increased availability of legal gambling has led to an increase in underage gambling and gambling disorders in young people [23]. e increase in the number of online video games with probability-based items has reduced the resistance of many adolescents to gambling since 2000. New levels of exposure to illicit gambling sites have created an environment where teenagers, who spend a considerable amount of time on the Internet, are easily influenced. Although teenage gambling is illegal in most countries, the incidence of problem gambling in adolescents is higher than that seen in adults [24].

Process for Blocking Illegal Online Gambling Sites.
e Korea Racing Authority (KRA), the sole racing authority in Korea, investigates IOG operations alongside other government agencies such as the National Gambling Control Committee (NGCC), a national organization that oversees gambling-related public institutions, and the Korea Communication Standard Commission (KCSC), a public institution that screens various illegal websites such as gambling, pornography, and financial fraud. Reporting an IOG site requires evidence such as an URL address or screenshots of the IOG sites. ese pieces of evidence are collected from the KRA and NGCC, which are then transferred to the KCSC, who reviews the sites and then notifies Internet Service Providers. e KCSC requires a three-week window to verify these flagged sites. Illegal sites are the most common evidence that the government can use against perpetrators in subsequent legal action. e process of blocking IOG sites is as follows: Step #1: crawling URLs associated with IOG Step #2: collecting suspicious URLs and any supporting evidence Step #3: submitting URLs and evidence to KCSC Step #4: KCSC review to verify designation Step #5: URLs verified to be illegal forwarded to the ISP Step #6: URLs blocked by ISP e essential part of the first step is collecting a list of suspicious URLs by sorting through reported sites. e list serves as evidence for cybercrime and allows the KCSC to address criminal activities. ese organizations have been collecting IOG data for a considerable period, but there is still difficulty finding the sites automatically through Google and SNS platforms. Hence, the data must be collected manually which is extremely time consuming and allows IOG operators to effectively circumvent enforcement by continuously closing and reopening sites with new URLs. As a result, enabling timely prosecution is now a vital focus for researchers. In this study, we attempt to offer a faster solution.

Defining the Spam SMS.
Spam is defined as any unwanted message sent to a user for commercial gain or simply to cause detriment or discomfort [25]. Another definition of "spam" is promotional information that has been provided without the agreement of recipients from an official KISA website. Spam SMSs include messages that are sent to mobile phones for advertisement purposes, which can range from legal but nonessential information to severely illegal content    [26]. ese regulatory definitions fall under the purview of criminal law in South Korea, and offenders thus face fines and imprisonment. Based on these definitions, the SMS activities below are considered illegal in South Korea: (i) Advertisement without agreement from the recipient (ii) Advertisement between 21:00 and 08:00 (iii) Advertisement that does not clearly identify itself as an "advertisement" (iv) Advertisement for illegal goods or services 2. 6. Spam SMS about Illegal Online Gambling. Spam messages containing the term "gamble" are illegal in South Korea, where all accredited legal gambling is operated by the government, but they remain a common tool for IOG platforms. In these messages, the URL is modified to avoid filters, as shown in Table 3.
IOG spam exhibits the following features: (i) URLs are presented in an abnormal form to avoid smartphone and application filters (ii) URLs are easily legible to people but not to detection systems (iii) Messages employ terminology that obscures the illegality of the advertised service To extract URLs, it is necessary to understand several conversion conditions used with the messages.

Related Work.
Data mining approaches such as supervised classification have been employed to detect spam or illegal content in the past [27]. Cascading Style Sheets (CSS) are often used to detect specific page layouts, and prior studies have used SVM techniques and map-reduce algorithms to detect spam emails [28]. Akbari and Sajedi [29] introduced GentleBoost, an algorithm for SMS spam detection, that achieves high accuracy with minimum storage consumption.
Recent studies that detect spam include CNN-based filtering with deep learning [30][31][32]. Spam filtering based on sentimental analysis using SentiWordNet has also been proposed [33]. Various other spam filtering methods are discussed in academic literature, such as similarity-based corpus and Wikipedia link-based spam filtering [34].
Various machine learning models have also been utilized to detect and classify malicious URLs [35]. Yan et al. [36] proposed an unsupervised learning algorithm that trains URL embedding models, an approach that far exceeded the performance of other algorithms such as SVM, DT, LR, NB, and CNN. e accuracy of deep learning methods was far higher than conventional machine learning methods when utilizing binary classification to filter spam messages [37].
Liu et al. studied "spear phishing" (targeted phishing efforts) and promotional SMS from a security point of view [38]. And our own previous study on illegal gambling utilized a readable transformation technique (RTT) [39].

Research Design and Methods
We propose a system for classifying messages based on the characteristics identified earlier and then extracting and converting IOG URLs. In order to identify the ideal NLP approach, this study uses real data from spam messages to test binary classification algorithms.
Several studies have classified spam SMSs using machine learning. Nagwani and Sharaff proposed the use of ML algorithms such as Naïve Bayes (NB), support vector machine (SVM), non-negative matrix factorization, and latent Dirichlet allocation to identify spam [40], while Almeida et al. suggested text normalization [41]. Fattahi and Mejri applied natural language processing (NLP) techniques, namely, Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) to identify spam SMSs [42]. Choudhary and Jain applied random forest (RF) classification algorithms [43]. Sethi et al. compared NM, RF, and logistic regression (LR) algorithms [44].
NLP aids in the detection, extraction, and interpretation of particular information from text, which is often used in web search engines, for example, Apple's Siri and Google Translate. For NLP in English text, our study employs the natural language toolkit documentation. In this study, we referred to KoNLPy, an open-source library designed for Korean language text mining. ere are five NLPs in KoNLPy: okt, mecab, komoran, kkma, and hannanum.
Examples of the okt options are presented as follows: (i) okt.morphs() splits text based on the morpheme (ii) okt.nouns() extracts nouns from the text (iii) okt.phrases() extracts word segments Other NLPs have similar options. e next step is a feature vectorizer. Typical examples of vectorizers include the following: (i) CountVectorizer: a vectorizer that counts the number of words in each text (ii) TfidfVectorizer: a vectorizer that uses the "TF-IDF" tune scale of frequencies by counting words in each text to focus on meaningful spam messages (iii) HashingVectorizer: a vectorizer that uses a hash function to increase the processing speed of the CountVectorizer is study obtained data from KISA, including suspected spam SMSs reported by smartphone users from 2020 and legal SMSs, such as nonbeneficial public advertisements. SMSs contain 160 or fewer characters. After removing duplicates, 30,527 unique messages were tested. Messages with slight differences, such as one letter or number, were included.
is study labeled gambling-related messages (14,334 messages) as "Class 1" and nongambling-related messages (16,193 messages) as "Class 2." is study began with preprocessing, which deleted words and phrases that commonly appear at the beginning of Korean messages, such as "sent from web" or "advertisement." e experiments were then designed to have three parts.

Deciding Parameter of Vectorizers.
is study used three vectorizers (TfidfVectorizer, CountVectorizer, and HashingVectorizer). A criterion algorithm was set, and the parameters for each vectorizer were determined. en, the performances of the vectorizers were compared.

Deciding KoNLPy and Matching Options.
To determine KoNLPy and its options for the experiment, representative KoNLPs such as okt, mecab, hannanum, and kkma, as well as the RF algorithm with each option, were chosen for this study. From the next experiment, okt and mecab were used to consider their performance and speed.

ML Algorithm, KoNLPy, and Matching
Options. All ML algorithms were selected using a hyperparameter tuning process. is study used the GridSearchCV function. e range of this function used the numbers found in this study's pilot test. "Training set" and "Test set" were randomly chosen based on a 3:1 ratio, and classes were set at this rate. Cross validation was performed four times.

Proposed Detection System
We propose two automatic detection systems to identify IOG websites using spam SMS, as shown in Figure 2.
Vectorizers for NLP and feature extraction are selected and configured depending on the language. e algorithm then produces optimized modeling with hyperparameter tuning. After applying samples of spam to the model, SMSs can be classified and extracted. e proposed system applies the option of morphs to mecab KoNLPy and the SVM algorithm, which was chosen for this study.
As described in Section 2.6, classified illegal gambling messages exhibit repeated patterns in the ways that they obscure URLs. e extraction and conversion process can be seen as a recovery operation that creates an accessible form of each URL. is study used more than 250 conversion rules to interpret the characters; detailed examples have been provided in the Appendix. As conversion rules can differ based on the language and legal requirements of the country, further collection and analysis are required. e resulting URL is tested through an alive check process to confirm if it is active. If the alive check is positive, the URL is an IOG website, and screen capture functions can be used to report it.

Parameters of the Vectorizer.
To filter spam SMSs, this study used the RF algorithm [16] as a criterion, which was also used in previous studies. Each parameter was determined through experiments. Random forest, an ensemble learning method for classification and regression, works by training a large number of decision trees. For classification tasks, the random forest's output is the class chosen by the majority of trees. e mean or average prediction of the individual trees is returned for regression tasks [45,46].
Each parameter was manually increased (ngram_range and min_df were adjusted in units of 1, and max_df was adjusted in units of 0.1), and a parameter representing the best performance was selected. e parameters that exhibit the best performance are as follows: (i) TfidfVectorizer: ngram_range � (1, 4), min_df � 3, max_df � 0.9 (ii) CountVectorizer: ngram_range � (1, 2), min_df � 3, max_df � 0.9 (iii) HashingVectorizer: ngram_range � (1, 2) Each parameter was determined through experiments, and the process for manual determination has not been mentioned here. e outcomes of vectorizers based on the RF algorithm are listed in Table 4. e F1-score is made up of two components: precision and recall. e F1-score's purpose is to combine the precision and recall measurements into a single number, and it was created to work well with the unbalanced data. Looking at the results of F1-score and accuracy, it is clear that count vectorizer performs best among the vectorizers based on the current best performance parameter.

Deciding KoNLPy and the Matching Options.
Ten combinations of representative KoNLPy (okt, mecab, hannanum, and kkma) and each option (morphs, nouns, and phrases) were used for tests in this study (Table 5). e loading and execution time for 100K characters, drawn from the official website of KoNLPy, are shown in Table 6. e okt, kkma, and mecab KoNLPy exhibited excellent accuracy, but kkma was very slow as indicated in Table 6. As a result, this paper performed experiments using okt and mecab.

ML Algorithm, KoNLPy, and the Matching Options.
e experiments presented in Sections 5.1 and 5.2 were processed using the RF algorithm, whereas the following Security and Communication Networks experiment was processed with five combinations of the three vectorizers (TfidfVectorizer, CountVectorizer, and HashingVectorizer), KoNLPy, and the options that were decided in advance. Four ML algorithms (linearSVM, rbpSVM, LR, and RF) were then added to the experiment. e main objective of the SVM algorithm is to find a line or side that separates data of different classes with the largest margin. As such, the algorithm finds the optimal linear decision boundary or hyperplane that linearly separates data. e kernel SVM technique is a method of mapping and classifying data that might otherwise be difficult to distinguish linearly into high-dimensional features. rbfSVM is known to perform well as one of the types of kernels.
Linear regression is a traditional statistical model. By fitting a linear equation to observed data, linear regression seeks to model the relationship between two variables. RandomForest is described in section 5.1 as an ensemble learning algorithm. Overall, 60 cases (3 × 4 × 5) were tested. e combinations of KoNLPy and matching options are listed in Table 7.
e outcomes of the study with four algorithms, including the RF, are given as follows.
e results from the TfidfVectorizer are depicted in Figure 3. e x-axis consists of the various algorithms and vectorizers, and the y-axis exhibits accuracy. e top three combinations were as follows:

Security and Communication Networks
With TfidfVectorizer, it is apparent that the overall performance of the random forest algorithm is lower than that of the other three algorithms. e accuracy results of the CountVectorizer are illustrated in Figure 4. e top three results are as follows: (i) okt.morphs and rbpSVM: 97.78% accuracy (ii) mecab.morphs and linearSVM: 97.75% accuracy (iii) mecab.morphs and rbpSVM: 97.72% accuracy Even with CountVectorizer, it is apparent that the random forest algorithm has a lower overall performance than the other three algorithms. e accuracy results of the HashingVectorizer are shown in Figure 5. e top three algorithms with the top accuracy are listed as follows: (i) mecab.morphs and rbpSVM: 97.96% accuracy (ii) mecab.morphs and linearSVM: 97.91% accuracy (iii) logistic regression and mecab.morphs: 97.89% accuracy Among the 60 experimental outcomes, the case where the okt.morphs KoNLPy of linearSVM was applied to TfidfVectorizer yielded the best performance. Several algorithms tested in this study classified more illegal gambling SMSs than the RF algorithm. e proposed detection system selects an optimized model by continuously comparing performances to discover the best vectorizer that works with NLP and matching options. e process of finding optimized parameters for vectorizers and algorithms requires considerable time, as shown in Table 8. erefore, the speed of the process, the purpose of the vectorizers, and the matching options selected should be considered when choosing between models.

Performance Comparison of the Algorithms.
e purpose of the experiment in this section is to comprehensively examine each algorithm and the KoNLPy and option (tokenizer) matching them. is experiment can be seen as an extension of the experiment in Section 5.3 and was conducted based on Tfidf's vector, which showed the highest performance on an accuracy basis. e experiment in this section was conducted with a total of seven algorithms. MLP and boosting algorithms     were added. MLP was added for the purpose of applying neural networks, and MLP classifiers were utilized. Boosting algorithms are machine learning ensemble techniques that combine several sequential weak learners to improve prediction or classification performance. e algorithm applied in this experiment is AdaBoost. A total of five KoNLPy and option combinations (Okt.morphs, Okt.nouns, Okt.phrases, Mecab.morphs, and Mecab.nouns) were used. e GridSearchCV() function was utilized to find the optimal parameters. In sklearn, the GridSearchCV function allows us to identify the best parameters by sequentially inputting hyperparameters used in classification or  regression algorithms to be learned and measured. e option used was cv (cross validation) four times, and scoring was set to accuracy.
As shown in Table 9, there is no KoNLPy/Option (tokenizer) combination with algorithms that clearly demonstrates outstanding performance as a result of the experiment. However, the F1-score indicates that each algorithm has a tokenizer that produces good performance. Generally speaking, the tokenizer of Okt.morphs and Mecab.morphs performs well. However, Mecab.nouns performs best in RF, and Okt.nouns performs best in AdaBoost. erefore, it is important to select algorithms and KoNLPy.Options that achieve optimal speed as a part of the detection system we propose in the next study.

Conclusion
is study developed technologies to extract URL information and automatically classify messages reported as spam. While spam messages have many attributes that make them readily identifiable to human recipients, it has been difficult to rapidly detect gambling-related messages amongst other spam.
First, this study classified 30,527 messages collected by the KISA from 2020 into gambling-and nongamblingrelated groups for experiments. en, NLP was used to extract features, and various ML algorithms and hyperparameter tuning (GridSearch) were used to find optimized parameters. To solve the paper's initial problem, this study finally proposed a novel extraction model that yielded 97% accuracy, which implies that the detection technology could provide even higher accuracy when analyzing a mixture of spam and normal messages in realworld conditions. e proposed technologies can replace current methodologies, which are typically dependent on manual reporting, to quickly and precisely classify approximately 27,000 spam messages that are sent to KISA each day. In particular, the system proposed in this paper can provide a URL pool to quickly block illegal gambling sites based on compiled spam SMS activities. Moreover, our study was able to effectively reduce the time required to detect and block IOG sites, which is the key to stopping operators who evade enforcement by changing their URLs frequently.
is work provides a cornerstone for future researchers interested in detecting illegal gambling and other problematic content that employs spam mass marketing. In the future, we plan to identify optimal parameters (such as the number of hidden layers) centered on DNN and continue research on methods to improve performance. e results of these experiments are limited to text-based data, so further investigation is needed for image-based spam messages. e proposals presented here may be adopted by ISPs, government agencies, or licensed racing regulators in any country. While this study targeted illegal gambling, the proposed technologies can also be applied to any other field that detects illegal content, such as adult content or illegal loans.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study. 10 Security and Communication Networks