Automatic Feature Engineering-Based Optimization Method for Car Loan Fraud Detection

,


Introduction
Car loans, with the characteristics of low threshold, small amount, high liquidity, short cycle, and so forth, have become an important part of online loans. However, the car loan business will certainly face the three following risks: fraud risk, credit risk, and postloan risk. Fraud risk refers to whether the car loan business carried out by the platform has the possibility of attracting fraud gangs to cheat loans. Credit risk refers to whether a single borrower who buys a car has repayment ability. Postloan risk refers to the ability of the platform to dispose of assets after being overdue. us, in recent years, due to the continued increase in business volume, optimizing fraud detection, solving a series of problems in credit application fraud, financial intermediary identification, ganging monitoring, or early warning, and building an antifraud cloud platform by means of artificial intelligence methods to improve risk control capability have become a prevailing topic.
In the evolving machine learning methods, most researchers focus on using those state-of-the-art technologies to solve the risk control problem. Neural network and support vector machine [1] showed outstanding performance in specific fraud detection tasks [2]. Yang et al. [3] proposed a "card library reconciliation-preprocessing-neural network detection" workflow to secure the campus card funds from frauds, where the algorithm can detect campus cards with abnormal transactions in the system. Taha and Malebary [4] proposed a new OLightGBM method on the basis of LightGBM [5] algorithm, incorporating the Bayesbased hyperparameter optimization algorithm to achieve credit card fraud detection. e detection accuracy of this method reaches 98.40% on real datasets, which include both fraudulent transactions and legitimate transactions.
is paper studies how to improve the efficiency of fraud detection based on a real car loan dataset. We noticed that traditional fraud detection methods are often limited to batch processing of occurring transactions. During the computation, the two most central stages in machine learning, feature engineering and training process, are very time-consuming, while different algorithms can have advantages and disadvantages in performance. us, the high time cost increases the difficulty for the real-time applications to achieve accurate detection results. Focusing on this problem, ennakoon et al. [6] proposed a real-time credit card fraud detection method. Similar methods collaborate with software to inform users at the moment when fraudulent transactions occur through the GUI; personally, I prefer using "provides more" time for lending institutions to take relevant measures. However, such method only optimizes the business process of fraud detection applications in engineering. Artificial intelligence fraud detection methods stay unchanged. Based on a car loan dataset, we work on utilizing feature engineering to explore data relationships and improve the accuracy and interpretability of the model at the algorithm level. Currently, similar research directions comprise using multiple transformations to adjust existing features and create new features through the association of different datasets [7], which play an important role in car loan fraud detection. In practical applications, efficiently extracting useful features from large transaction data is extra essential. For instance, Bahnsen et al. [8] used von Mises distribution to analyze the periodic behavior of transaction time on the basis of transaction aggregation strategy. e periodic features were applied to several popular credit card fraud detection models. e results showed that the model cost decreased by 13% in average. Wedge et al. [9] proposed a method based on automatic feature engineering to solve the false positive problems in fraud detection, which used the DFS (deep feature synthesis) algorithm to automatically extract a large number of behavioral features from historical transaction data. 237 features are constructed for each transaction and the method learned the classifier thorough random forest. is method performed well on largescale datasets and reduced the false alarm rate by 54%. Chen et al. [10] proposed a new neural structure NFS (Neural Feature Search), of which a controller based on recurrent neural network is used to transform each original feature through a series of conversion functions.
is controller is trained through reinforcement learning. e method outperforms existing automatic feature engineering methods on public datasets, and it can well reduce the time cost, cutting machine learning development time by a factor of 10, and build better predictive models and generate more meaningful features, while also providing better model performance, preventing unnecessary data leakage, and effectively extracting potentially valuable higher-order transformations, alleviating the feature explosion problem.
According to the researches mentioned above, compared with traditional feature engineering, automated feature engineering can often construct features without limiting the depth, so that it is able to take advantage of feature-based behavioral datasets in fraud detection, which are dominated by features of Boolean types, indicating whether some behavior is done. For example, in financial fraud loan datasets, it relates to whether customers carry out credit rating every year or repay the loan on schedule. e forecasting model is also optimized while saving costs and improving the accuracy simultaneously [11].
In recent years, numerical data has become increasingly popular by virtue of its characteristics of being readily available and easy handling. e financial industry mainly deals with numerical data, and there are many statistics directly related to property like asset, cost, amount of sanction loan, different kinds of ratio, and so on, which totally fit the characteristics of datasets mentioned in this essay. Existing automatic feature engineering methods have problems such as large feature scale, complex model, long training time, and lack of regular measures when processing datasets consist mainly of numerical statistics. Compared with a wide variety of trained models, these numerical datasets with their own characteristics obviously require us to spend more time. A correct understanding of the dataset is essential for subsequent model training and related processing operations. e reason is that the numerically dominated dataset lacks strong explanatory behavioral information, and it has more abstract information and a high degree of quantification. Compared with the behavioral dataset, numerical datasets are more difficult to predict. Meanwhile, numerical datasets in financial sector also bring about difficulty in processing missing values and exception values and finding deep connection between relevant features, which has something to do with the final result. It is also an important research topic for future loan fraud detection products with high potential engineering value.
A car loan dataset characterized mainly in numerical data is studied in this paper, and the existing automatic feature engineering methods are optimized by limiting the depth of feature generation to solve the above problems. e advantages can be summarized as follows: (1) Enhanced feature interpretability. e number of features is reduced by 92.5%, which is from originally 1,520 features to 114 features, so that the features can be faster processed. Hence, the model can be trained based on cognition, avoiding unexplainable abnormalities when the performance of the model decreases due to complex features.
(2) Reduced total time cost. Although the existing automatic feature engineering works by shortening the time to process features, the substantial time cost increase in training stage still has large weight in the overall solution. e performance of the proposed method in this paper shortens the time of feature engineering and model training by 54.3% compared with traditional automatic feature engineering methods.
Experiment results demonstrate that the proposed automatic feature engineering method improves detection accuracy by 23% in car loan fraud detection. In addition, optimized performance can be seen in base model selection, automatic feature generation, optimization, and time cost control.

Deep Feature Synthesis.
From rule-based to featurebased, the dimensionality of machine learning continues to increase in problem solving, while the reliance on rules and experience continues to decrease. For a given dataset, feature engineering is the last step before the data enters the model. Whether it is handled properly will affect the performance of the model to a large extent. Hence, there is no denying that feature engineering is a key part of the data processing flow based on machine learning. From an engineering perspective, Amazon's chief scientist Li Mu believes that "the time spent on data for machine learning projects should account for more than 80% of the total time [12].
Generally, there is a disproportion between positive and negative samples in fraud detection datasets; and fraud samples often only occupy a small part of the sample space. e value of feature engineering in fraud detection problems lies in amplifying the features of positive samples and detecting them more accurately. In pursuit of good engineering effects, indepth domain knowledge and manual processing, if necessary, are required in feature engineering, which greatly affects the efficiency of smart fraud detection products. Automated feature engineering simplifies the cumbersome work, laying a foundation for optimizing the construction and deployment of machine learning models, which frees data scientists from complex feature engineering work. e involvement of automatic feature engineering, especially the invention of the open-source tools like Featuretools [13], improved the overall efficiency of smart fraud detection data products and has attracted more and more attention.

Application of Automatic Feature Engineering Method in
Car Loan Fraud Detection. In this paper, automatic feature engineering method is used in car loan fraud detection.
Around this specific application scenario, the method has the following advantages: (1) Automatic feature engineering provides a quantitative method of features, eliminating some meaningless feature engineering operations. (2) Automatic feature engineering provides a method to manipulate features in batches with higher efficiency, reducing the time cost significantly compared with selecting features manually. (3) Generally speaking, domain knowledge is required in feature engineering. Automatic feature engineering reduces the dependence on domain knowledge and facilitates mutual collaboration. Compared with fraud detection methods based on knowledge graph [14], the data utilization rate is higher in automatic feature engineering. In both the precision, which indicates the capacity of the model to avoid the inclusion of samples from any other classes in the analyzed class, and the recall, which shows the capacity of the model to include all the samples that, in fact, are inside a class, automatic feature engineering works better than traditional feature extraction techniques. e average accuracy reaches 94.96% [15], while valuable information can be extracted from fewer features in addition.
e existing automatic feature engineering still has shortcomings, manifested in the following aspects: (1) e number of features extracted through automatic feature engineering is too large. As a result, the time saved in feature engineering is offset by the high cost of model training and the cost of model tuning time soars. e practical application value of automatic feature engineering is limited by the increased total cost.
(2) Feature dimensional explosion [16] leads to a decline in the interpretability of new features. One of the main contributions of this paper can be concluded to solve the problems of feature superposition, depth loss, and interpretability decline by limiting the depth. e original features are manually filtered so that the newly generated features have a certain directionality, which ensures the accuracy of the new features while reducing the depth. (3) If the model stays unadjusted, too many automatically generated new features can easily lead to adverse effects on dirty data; the noise in the dataset will also affect the performance of the model.
Automatic feature engineering lacks regular measures when adding features. Too many constructions of the same feature can easily result in excessive rewards for a few features, which makes it too weighted in the model, leading to overfitting at the end. In fraud detection, fraudsters will disguise themselves. When the confidence of few samples is too high, it is difficult to identify fraudulent behaviors. e car loan dataset studied in this paper is a dataset whose characteristic types are mainly numerical. As mentioned above, existing automatic feature engineering methods have a large number of features, high model complexity, high training time cost, and lack of regularity. ese methods cannot achieve the expected effect on the car loan dataset. is paper attempts to limit the depth of feature generation through strategies and optimize the method of automated feature engineering.

Preprocessing.
e dataset used in this article is provided by a domestic financial institution and contains 52 fields. is paper selects 150,000 pieces of data as the research sample. e dataset has the following characteristics: (1) Disproportionate data samples: 26,545 out of 150,000 records are predicted to be fraudulent, which are referred to as positive samples. at is, 123,455 records are labeled as negative samples, accounting for 82.4% of the total sample space. Actually, data imbalance is a property of this dataset itself and a natural law of car loan samples. ere are about 500,000 original pieces of data, and we screened 150,000 of them and retained the imbalance property appropriately.
(2) Some highly correlated features: a heat map generated with relation coefficients of all features (Figure 1) was used to detect the relationship between each original feature and the degree of Discrete Dynamics in Nature and Society 3 connection between the original feature and the predicted value. e selected features form the x-axis and y-axis, and the correlation coefficients of each pair of features are given by the following formula: (3) Numerical features dominated: several typical characteristics are selected for analysis (Table 1).
e above features contain some characteristics: (1) e features in the data are basically numeric data.
(2) Among the numerical types, different types are also included, such as some data indicating economic quantities, such as asset cost; some data indicating ratios, such as the following features with ratio; and some data showing customer attributes, such as Credit_score.
ese features cover almost all the characteristics of the whole data and are very useful to provide ideas for our subsequent processing.

2.3.2.
Classification. Sufficient training samples are conducive to the improvement of model performance. We selected 80% of the 150,000 datasets as the training set and    [17]. e algorithm can automatically construct a predictive model for complex datasets. As an algorithm that automatically generates features for relational datasets, DFS follows the relationship between the data and the basic fields and sequentially applies mathematical functions along the path to create the final features. rough sequential stacking calculations, experiments show that each new feature can be defined as a specific depth.

Entity Feature.
Entity feature (efeat) calculates the value of each entry to derive characteristics. Based on element-wise multiplication, these features can be applied to the calculation function arrays x and j. Examples include converting existing features in the entity table into another type of value function, such as converting the categorical string data type to a predetermined unique value or rounding the value. Other examples include converting timestamps into 4 different characteristics: working days (1-7), days in a month (1-30/31), number of months of the year (1-12), and hours (1-24). e data used in this paper converts discrete features such as manufacturer number and service personnel number into one-hot coding features or converts the mobile phone number fill-in feature into Boolean type features.
is function is applied to the entire value set of the j-th feature x : ,j and x I,j : x i,j′ � e feat x :,j,i . (2)

Relation
Feature. Relation features are obtained through joint analysis of two related entities and E k , including forward and backward relations: Forward: it is defined between an instance m of entity E l and another instance i of entity E k . is is considered as a forward relationship because i has an explicit dependency in m.

Backward: backward relation refers to the relationship from an instance
Direct features (dfeat) are applied over the forward relationships. In these, features in a related entity i ∈ E k are directly transferred as features for i ∈ E k .
Relational features (rfeat) are applied over the backward relationships. ey are derived for an instance i of entity E k by applying a mathematical function to x l : ,j|e k �i , which is a collection of values for feature j in related entity E l , assembled by extracting all the values for feature j in entity E l where the identifier of E k is e k � i. is transformation is expressed as Some examples of rfeat functions are min, max, and count. rfeat functions could also be applied to the probability density function over x l : ,j|e k �i . e pseudocode of the algorithm is given in Algorithm 1.

Growth of Number of Features.
e feature space that can be enumerated by deep feature synthesis grows very quickly. In this paper, we analyze the number of features, z, and the algorithm will synthesize for a given entity. Due to the recursive nature of feature synthesis, the number of features created for an entity depends on the number created for related entities. us, we use z i to represent the number  Combining all rfeat, dfeat, and efeat features, we see that If i � 0, only efeat features can be calculated, so Let p � (r · m + n)(e + 1) and q � e · j; by substitution, we get Replacing z i−1 by z i−2 · p + q, we get Continuing the iteration till z 0 , we get Replacing in the above equation, we get erefore, the closed form for z i is (1) Aggregation: the production process involves variables produced by the integration of multiple features, such as the maximum value, minimum value, and mean value in the superimposed feature. (2) Transformation: generate feature variables by operating only in a single feature.

Choose Features.
In this paper, the DFS algorithm is used to generate features. Two generation methods, namely, aggregation and transformation based on metafeatures, are implemented. We adjusted for the parameters of the algorithm. If no constraint is placed on the depth of the algorithm, the relation features will operate with other features many times, making the algorithm go deeper and deeper. When we limit the depth of the algorithm to one or two layers, the frequency of the operation decreases, so the depth of the algorithm can also decrease. In addition, we also reduced the number of features involved in feature engineering to ensure that invalid features are not overincluded in feature engineering.
(1) Based on the aggregation method, the generated features in the fields provided by the dataset are shown in Table 2.
(2) Based on the transformation method, the generated features in the fields provided by the dataset are shown in Table 3.  Table 4.

Experiment Environment.
e server configuration of the experiment is as follows: Intel(R) Core(TM) i7-10875H CPU @ 2.30 GHz and GeForce RTX 2070 Super GPU. e programming languages are Python 3.8 and scikit-learn 0.24.2. Jupyter notebook works as the compiler.

Evaluation Index.
ree model evaluation indicators are considered in this paper: accuracy, receiver operating characteristic curve (ROC), and area under ROC curve (AUC).
Accuracy refers to the accuracy of the model. ROC curve and AUC coefficient are mainly used to test the ability of the model to correctly rank customers. e ROC curve describes the cumulative ratio of negative customers under a certain proportion of positive customers.
e more robust the model is to distinguish, the closer the ROC curve is to the  Discrete Dynamics in Nature and Society upper left corner. e area under the ROC curve is represented through AUC coefficient. e higher the AUC coefficient, the stronger the risk discrimination ability of the model. ROC curve and AUC coefficient are commonly used indicators to measure the pros and cons of risk control models, which are very suitable for evaluating the effect of fraud detection. e results and discussion may be presented separately, or in one combined section, and may optionally be divided into headed subsections.

Comparison of the Base Models.
Several mainstream models in machine learning are selected for pretraining.
rough K-fold cross-validation [18], the average value of K prediction accuracy of the model is used to evaluate the predictive ability of the model. e pretraining results are shown in Figure 3. Among the base models above, RandomForestClassifier [19], GradientBoostingClassifier [20], ExtraTreeClassifer, and LightGBM achieved roughly the same results. To compare the training times of these four models, results are shown in Table 5.
It is evident that LightGBM takes the least time in training, improving the efficiency by 74.5% compared with ExtraTreeClassifer. e dataset in this paper contains 150,000 pieces of records, while the amount of data credit institutions need to process in actual loan fraud is far more than that. It is reasonable to choose the LightGBM algorithm with the shortest training time.

Experiment of Automatic Feature Engineering.
In this experiment, the results of the proposed scheme are compared with a benchmark group and a group with existing method. e settings of each group are shown in Table 6. After experiments, the comparison of the experimental groups is shown in Table 7.
e ROC curves of the three experiment types are shown in  e abscissa of ROC curve is false positive rate (FPR). e ordinate of ROC curve is true positive rate (TPR). FPR indicates how many of all negative cases are predicted to be positive, and TPR indicates how many real positive examples are predicted.
Here the calculation method of ROC is given: (1) Sort from large to small according to score the probability that each test sample belongs to a positive sample.
(1) function MAKE FEATURES(E I ,E 1:    Discrete Dynamics in Nature and Society (2) From high to low, take score as the threshold. When the probability of the test sample belonging to the positive sample is greater than or equal to this threshold, we consider it as a positive sample; otherwise, it is considered as a negative sample.
(3) Each time we select different scores as the threshold, we can get a set of FPR and TPR, that is, a point on the curve. A total of 20 groups of FPR and TPR values are obtained. We can obtain a complete ROC curve by connecting these (FPR and TPR) pairs.

Conclusions
After analyzing the results, we can draw the following conclusions: (1) By comparing the ROC curve and AUC coefficient of all the groups, it is shown that the automatic feature group B performs better. 23% of the corresponding AUC value of group B is optimized compared with group A, and 25.5% optimized compared with the benchmark group.
(2) Compared with the automatic feature group A, the automatic feature group B shortens the total construction time and training time by 54.3%. e number of generated features is reduced by 92.5%. e readability and interpretation ability are optimized.
(3) e goal of automatic feature engineering is to shorten the time spent on feature engineering and thereby allocate more time to model tuning and other steps. In the automatic feature group A, the long training time actually adds time cost, which goes against the goal. e automatic feature group B shortens the time spent on feature engineering, while the training cost does not increase significantly, which further approaches the actual goal of automatic feature engineering.

Discussion
Automatic feature engineering has a wide range of applications in the field of data science. In this paper, deep feature synthesis algorithms are used to improve the effect of car loan fraud detection. e depth of DFS algorithm is limited by compressing abstract and lacking interpretative features. e following problems are solved: feature dimension explosion, low interpretability, low interpretability, complex features, and other issues. Compared with traditional automatic feature engineering methods, the method proposed in this paper reduces the number of features by 92.5%. e training time is shortened by 54.3%, while the detection accuracy is increased by 23%. e existing automatic feature engineering car loan fraud detection method is effectively optimized. It is worth noting that when the numerical features in the dataset are not dominant or the features are sparse, deep automatic feature synthesis is still valid. Our future work is to study how to optimize the capacity of the model and improve the operating efficiency under this condition.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Discrete Dynamics in Nature and Society 9