Credit card fraud detection through parenclitic network analysis

The detection of frauds in credit card transactions is a major topic in financial research, of profound economic implications. While this has hitherto been tackled through data analysis techniques, the resemblances between this and other problems, like the design of recommendation systems and of diagnostic / prognostic medical tools, suggest that a complex network approach may yield important benefits. In this contribution we present a first hybrid data mining / complex network classification algorithm, able to detect illegal instances in a real card transaction data set. It is based on a recently proposed network reconstruction algorithm that allows creating representations of the deviation of one instance from a reference group. We show how the inclusion of features extracted from the network data representation improves the score obtained by a standard, neural network-based classification algorithm; and additionally how this combined approach can outperform a commercial fraud detection system in specific operation niches. Beyond these specific results, this contribution represents a new example on how complex networks and data mining can be integrated as complementary tools, with the former providing a view to data beyond the capabilities of the latter.


II. METHODS
In this section we present the main tools that are going to be used for the classification of credit card transactions between licit and illicit. Given a credit card transaction t i with features f i1 , · · · , f ik , the problem entails detecting if it is illicit or not from its features and the knowledge obtained from an historical training dataset -what is known as a supervised learning problem. From a mathematical point of view, we have to model a function H : R k −→ R and find δ > 0 such that if |H(f i1 , · · · , f ik )| ≤ δ, then t i is not illicit. Note that, while there are multiple types of illicit patterns, such aspect is here not considered, in that any suspicious transaction is considered as a potential fraudulent one.
We firstly introduce the concept of parenclitic networks in Section II A, a network reconstruction technique that allows highlighting the differences between one instance and a set of standard (i.e. baseline, or in this case licit) instances [23,24]. We subsequently describe the real data set used for validation (Section II B), including the available raw features (Table II); and the global classification model (Section II C).

A. Parenclitic networks reconstruction
As initially proposed in [23], one may hypothesise that the right classification of an observation does not only come from its features, but also from the structure of correlations between them. Following the mathematica formalism introduced before, if we consider the set L = (x 1 , · · · , x k ) ∈ R k ; |H(x 1 , · · · , x k )| ≤ δ ⊆ R k , then L is a manifold in R k such that if we take a (new) transaction t with features t 1 , · · · , t k such that (t 1 , · · · , t k ) / ∈ L, then t is considered as an illicit transaction. In general it is computationally impossible to obtain the set L directly from the training dataset, since it is a high dimensional problem. As an alternative, the parenclitic approach analyses the family of projections of L into 2-dimensional spaces corresponding to couples of features ( Hence, if we consider a training dataset with n ∈ N transactions, each of them described by k ∈ N (numeric) features, we can analyse up to k 2 = k(k − 1)/2 two-dimensional projections of pairs of different features, each of them with up to n points in R 2 . In order to quantify the correlation between pairs of features, the parenclitic approach proposes associating a network to each transaction with k nodes (as many as features considered) and the links measure the correlation between features [24]. Hence the following pre-processing must be completed: for every two-dimensional projection of L given by a couple of features (f i , f j ) with 1 ≤ i = j ≤ k, the correlation for the licit transactions in the training dataset is measured (by means of, for instance, a linear regression or other curve fitting techniques). For the shake of simplicity, we have here considered a linear regression, such that every pair of features (f i , f j ) with 1 ≤ i = j ≤ k yields a linear fitting between f i and f j for the licit transactions in the training dataset. Mathematically, this is represented by a linear equation of the form: Once these k 2 linear regression lines are computed, a threshold α > 0 is fixed. Given a new (i.e. not included in the training set) transaction t with features t 1 , · · · , t k , a network G = G(t) is associated to t as follows: • G has k nodes 1, · · · , k, • For every pair of nodes 1 ≤ i = j ≤ k we compute w i j ≥ 0 as the (euclidian) distance from (t i , t j ) to the line r ij in R 2 , i.e.
As an alternative, the euclidian distance could be replaced by any pseudo-distance function in R 2 . For the shake of simplicity, the euclidian distance will be used in this paper, but similar results can be obtained for other pseudo-distance functions.
• For every pair of nodes 1 ≤ i = j ≤ k, the (undirected) link (i, j) is in graph G if and only if w ij ≥ α.
Note that the parenclitic network G(t) summarises the couples of features whose correlation strongly differs from a typical licit transaction; the structure of this network thus contains valuable information about the (abnormal) correlation of features in the credit card transaction. Once this parenclitic network is computed, it is necessary to transform it in a set of features compatible with a data mining algorithm. Towards this end, several structural measures have been extracted, and will be considered as new features associated with the transaction (see next section for details). Among all possible structural measures that could be computed (see, for example, Ref. [25] and references therein), those here selected are summarised in Table I.

Name
Description Maximum node degree [25] Maximum degree of all nodes in the network. It is calculated as M k = maxi ki, ki being the degree of nodes i Entropy of the degree distribution [26] Shannon entropy of the distribution of nodes degrees. It is given by E = − M k i=0 pi log pi, pi being the probability of finding a node of degree i. Assortativity [25] Pearson's correlation coefficient between the degree of connected nodes. Clustering coefficient [25] Measure of the presence of triangles in the network. It is defined as the number of triangles (groups of three fully-connected nodes) over the number of connected triplets (groups of three nodes connected by at least two links). Geodesic distance [25] Average length of the shortest path connecting pairs of nodes. Efficiency [27] Inverse of the harmonic mean of the length of all shortest distances. Information Content [28] Metric assessing the presence of meso-scale structures in the network.

B. Data set description
The data set here considered includes all credit and debit card transactions of clients of the Spanish bank BBVA, from January 2011 to December 2012. Each month, an average of 15 million operations were realized by 7 million cards, for a total of 250 GB of information.
Transactions are automatically screened by an algorithm designed to detect suspected transactions, and returning a score from 0 (no suspect) to 100 (potentially illegal). Afterwards, transactions are classified in two categories, i.e. legal and illegal, as the result of a manual classification performed by the bank's legal personnel -using both information of the automatic algorithm, and customers' complaints. This allows us to detect which transactions were positively detected as frauds by the automatic algorithm, and which were false negatives.
Available fields included a time stamp of the operation, the quantity (both in Euro and in the original currency, if different), and the origin (the card) and destination (the store) of the operation; the two latter fields were anonymised, so that the exact card number and the name of the store could not be recovered. Some additional features have been synthesised from the previous ones, e.g. the average transaction size of a given user. A full list of the available fields is reported in Tab II. Additionally, a full statistical characterisation of the features can be found in Ref. [29], including the temporal evolution of the structure of the transactions network.

C. Classification models
As previously introduced, in this contribution we are going to explore two different ways of detecting illicit credit card transactions: a classical data mining approach, and the introduction of features extracted from a network representation. In both cases, the process must follow some common steps: it is first necessary to extract the expected behavior, i.e. a set of features representing the typical legal and illegal transaction; for then building a model that learns from those features, and yields an expected classification for a new transaction not yet studied. Fig. 1 depicts an overview of the whole process. It starts from the original data set, from which a set of raw features are extracted -as described in Section II B and listed in Tab. II. The features corresponding to the licit transactions are then used to recover the normal relations, as described in Section II A, and to reconstruct the parenclitic networks of all transactions. These networks are then binarised, i.e. links with weight below a given threshold are deleted, and a set of topological metrics are extracted -see Table I for a complete list. Note that, at the end of this analysis, all transactions are described by 15 features: 8 coming from the raw data, and 7 from the network analysis.
Artificial Neural Networks (ANNs), and specifically Muti-Layer Perceptrons (MLP) have been chosen as the final model for classifying new transactions. They are inspired by the structural aspects of biological neural networks, and are represented by a set of connected nodes in which each connection has a weight associated with it, and the network learns the classification function adjusting the node weights [9,30]. The output of each artificial neuron j is defined by: W being the vector of weights, and f the sigmoid activation function: Following the standard configuration, neurons were organized in three layers: an input one, with a number of neurons equal to the input features; an intermediate, or hidden one, with ten neurons; and a final output layer comprising just one computational element. The training has been performed with the standard back-propagation algorithm [31]. Finally, the reconstruction of the MLP models has been performed using the KNIME software [32].
The evaluation of the classification efficiency has been performed using both sensitivity (also known as True Positive Rate -TPR) and Receiver Operating Characteristic (ROC) curves [33].

III. RESULTS
As explained in Section II A, the parenclitic approach usually requires the definition of a threshold α, which is used to binarise the (initially weighted) networks. Instead of using an a priori approach, i.e. the definition of α using expert judgement, we here tackle the problem indirectly, by following the procedure proposed in Ref. [34]. Specifically, we optimise the network reconstruction by finding the link density (and hence the value of α) that optimises the efficacy of the classification model. Fig. 2 Left presents the evolution of the classification error (sensitivity or TPR) as a function of the considered link density, for three different scenarios: the use of only the raw features, as described in Tab. II (solid black squares); the use of the features extracted from the parenclitic representation alone (hollow black circles); and the use of the combined sets of features (solid blue triangles). Note that, in the former case, the result is constant, as the original features are not affected by the binarisation process. In order to avoid overfitting, this classification has been performed on a balanced sub data set, composed of an equal number of legal and illegal transactions.
Several conclusions can be drawn from Fig. 2. First of all, the features extracted from the parenclitic networks are not enough, alone, to reach a low classification error. This has to be expected: while important information can be codified in the interaction between raw features, some important clues may be hidden in the latter, e.g. abnormal transaction sizes or timings. At the same time, the addition of parenclitic features to the raw data set enhance the obtained results, with the error dropping from a 19.2% to a 12.23%. This is further illustrated in Fig. 2 Right, depicting the reduction in the classification error (in percentage) when considering only parenclitic features and the whole data set -note that, in the first case, the reduction is negative as the error increases. Finally, the best classification suggests that the optimal link density that should be considered is of a 60% -meaning that the 40% of links with less weight should be deleted.
If Fig. 2 is useful to detect the best link density for the analysis, it does not convey information about the real performance of the classification algorithm in an operational environment. For that, Fig. 3 Left presents three ROC curves, corresponding to the use of raw (blue line), parenclitic (green line), and combined features (black line) as before. Note that results here presented correspond to the optimal link density of 60%, as previously estimated. As previously discussed, the most interesting operational configuration is the one minimising the number of false positives, as this minimises the commercial costs of the organisation. The inset of Fig. 3 thus shows the bottom left part of the curves. It can be appreciated that, after an initial part in which results are comparable, the addition of the parenclitic features slightly increases the number of true positives -note how the black line is above the blue one. Even though this may seem a negligible difference, it is worth noting that any improvement, however small, has a significant impact due to the large number of transactions managed by the system. Increasing the fraud detection   If what previously presented illustrates that the use of a network representation can improve a fraud detection algorithm, it does not clarify how it ranks against a commercial system. As may be expected, the proposed algorithm is less efficient than the fraud score included in the original data set -see Fig. 4 Left [35]. Nevertheless, there are niches in which the opposite happens, the most important being the analysis of on-line transactions. Fig. 4 Right depicts two ROC curves, respectively for the algorithm based on parenclitic networks (black line) and the commercial system (blue line), when only transactions realised through Internet are considered. While the commercial system clearly outperforms the proposed algorithm, with an Area Under the Curve (AUC) close to 1.0, the latter is slightly better for a low ratio of False Positive -as previously explained, the plane region most interesting for real operations.

IV. CONCLUSIONS
Complex networks and data mining models share more characteristics that what may prima facie appear, most notably having similar objectives: both aim at extracting information from (potentially complex) systems to ultimately generate new compact quantifiable representations. At the same time, they approach this common problem from two different approaches: the former by extracting and quantitatively evaluating the underlying structure, the latter by creating predictive models based on historical data [22]. In this contribution we test the hypothesis that complex networks can be used as a way to improve data mining models, framed within the problem of detecting fraud instances in credit card transactions, providing a new example about how complex networks and data mining may be integrated as complementary tools in a synergistic manner in order to improve the classification rates obtained by classical data mining algorithms.
Results confirm that features extracted from a network-based representation of data, leveraging on a recently proposed parenclitic approach [23,24], can play an important role: while not effective in themselves, such features can improve the score obtained by a standard ANN classification model. We further show how the resulting model is especially efficient in detecting frauds in some niches of operations, like medium-sized and on-line transactions. Finally, we illustrate as, in the latter case, the network-based model is able to yield better results than a commercial fraud detection system. All results have been obtained with a unique data set, comprising all transactions managed during two years by a major Spanish bank, and including more than 180 million operations.