Research

. Network data mining has attracted a lot of attention since a large number of real-world problems have to deal with complex network data. In this paper, we present NOESIS, an open-source framework for network-based data mining. NOESIS features a large number of techniques and methods for the analysis of structural network properties, network visualization, community detection, link scoring, and link prediction. The proposed framework has been designed following solid design principles and exploits parallel computing using structured parallel programming. NOESIS also provides a stand-alone graphical user interface allowing the use of advanced software analysis techniques to users without prior programming experience. This framework is available under a BSD open-source software license.


Introduction
Data mining, an interdisciplinary sub eld of computer science, studies the process of extracting valuable information from data by discovering patterns or relationships. Data mining includes classi cation, regression, clustering, or anomaly detection, among other tasks [1]. A large number of tools and techniques are available for tabular data, where all data examples can be represented as tuples in a relation and they share the same set of attributes. However, many problems involve dealing with relational data, where instances are explicitly related through semantic relations. Classic data mining techniques designed for tabular data have severe limitations when we try to fully exploit the relationships available in relational data. e scienti c community has focused on the development of data mining techniques for relational data, leading to the availability of a large collection of techniques for the analysis of the structural properties of networks [2], their pleasant visualization [3], the detection of existing communities [4], and scoring existing or potential links to rank existing ones or predict their existence [5,6]. Network data mining involves di erent problems. Some examples include the prediction of previously unknown protein interactions in protein-protein interaction networks [7], the prediction of collaborations and tendencies in coauthorship networks [8], or ranking the most relevant websites according to a user query [9].
Di erent software tools for analyzing relational data have been developed, according to their main goal and the type of user they are directed to. Several tools provide their functionality to end users through closed graphical user interfaces, leading to improved usability but also neglecting the possibility of using the provided techniques in ways unforeseen by their software developers and integrating them in other software projects. Other tools were designed as software libraries that can be used in di erent software projects, providing more functionality at the cost of limiting their usage to users with prior programming experience.
Most existing frameworks are focused towards a speci c user community or task, as shown in Table 1. For example, Graphviz [10] and Cytoscape [11] are two popular alternatives for network visualization. On the contrary, igraph [12], NetworkX [13], and SNAP [14] are mainly focused on providing reusable collections of algorithms for network manipulation and analysis. Pajek [15], NodeXL [16], Gephi [17], and UCINET [18] are some of the most widely used tools for social network analysis (SNA) (a more

Design of the NOESIS Framework
NOESIS has been designed to be an easily extensible framework whose architecture provides the basis for the implementation of network data mining techniques. In order to achieve this, NOESIS is designed around abstract interfaces and a set of core classes that provide essential functions, which allows the implementation of different features as independent components with strong cohesion and loose coupling. NOESIS components are designed to be maintainable and reusable, yet highly efficient.

System Architecture.
e NOESIS framework architecture and its core subsystems are displayed in Figure 1. ese subsystems are described below. e lowest-level component is the hardware abstraction layer (HAL), which provides support for the execution of algorithms in a parallel environment and hides implementation details and much of the underlying technical complexity.
is component provides different building blocks for implementing well-studied parallel programming design patterns, such as MapReduce [19]. For example, we would just write result � (double) Parallel.reduce(index ->x [index] * y[index], ADD, 0, SIZE-1) to compute the dot product of two vectors in parallel. e HAL not only implements structured parallel programming design patterns but also is responsible for task scheduling and parallel execution. It allows the adjustment of parallel execution parameters, including the task scheduling algorithm. e reflective kernel is at the core of NOESIS and provides its main features. e reflective kernel provides the base models (data structures) and tasks (algorithms) needed to perform network data mining, as well as the corresponding metaobjects and metamodels, which can be manipulated at run time. It is the underlying layer that supports a large collection of network analysis algorithms and data mining techniques, which are described in the following section. Different types of networks are dealt with using a unified interface, allowing us to choose the particular implementation that is the most adequate for the spatial and computational requirements of each application. Algorithms provided by this subsystem are built on top of the HAL building blocks, allowing the parallelized execution of algorithms whenever possible. e data access layer (DAL) provides a unified interface to access external data sources.
is subsystem allows reading and writing networks in different file formats, providing implementations for some of the most important standardized network file formats. is module also enables the development of data access components for other kinds of data sources, such as network streaming.
Finally, an application generator is used to build a complete graphical user interface following a model-driven software development (MDSD) approach. is component provides a user-friendly interface that allows users without programming skills to use most of the NOESIS framework features.

Core Classes.
e core classes and interfaces shown in Figure 2 provide the foundation for the implementation of different types of networks with specific spatial and computational requirements. Basic network operations include adding and removing nodes, adding and removing links, or querying a node neighborhood. More complex operations are provided through specialized components.
An object-oriented purist might be surprised that nodes and links do not explicitly appear as Node and Link classes among the NOESIS core classes. is apparent violation of OO design principles has a technical reason-the performance degradation of garbage-collected programming languages (such as Java, C#, or Python) due to increased memory fragmentation. When dealing with large networks, the presence of millions of individual objects in the heap would significantly increase memory fragmentation and degrade the performance of the garbage collector. erefore, nodes and links are not considered to be core classes in NOESIS, although they can be certainly used when needed (as lightweight facades to the underlying network data structure encapsulated within particular Network subclasses).
NOESIS supports networks with attributes in both their nodes and their links. ese attributes are defined according to predefined data models, including categorical and numerical values, among others.

Supported Data
Formats. Different file formats have been proposed for network datasets. Some data formats are more space efficient, whereas others are more easily parseable.
NOESIS supports reading and writing network datasets using the most common data formats. For example, the GDF file format is a CSV-like format used by some software tools such as GUESS and Gephi. It supports attributes in both nodes and links. Another supported file format is GML, which stands for Graph Modeling Language. GML is a hierarchical ASCII-based file format. GraphML is another hierarchical file format based on XML, the ubiquitous eXtensible Markup Language developed by the W3C.
Other file formats are supported by NOESIS, such as the Pajek file format, which is similar to GDF, or the file format of the datasets from the Stanford Network Analysis Platform (SNAP) [20].

Graphical User
Interface. In order to allow users without programming knowledge to use most of the NOESIS features, a lightweight easy-to-use graphical user interface is included with the standard NOESIS framework distribution. e NOESIS GUI allows loading of nontechnical end users, visualizing them, and analyzing their own network datasets by applying all the techniques provided with NOESIS.
Some screenshots of this GUI are shown in Figure 3. A canvas is used to display the network in every moment. e network can be manipulated by clicking or dragging nodes. At the top of the window, a menu gives access to different options and data mining algorithms.
e Network menu allows loading a network from an external source and exporting the results using different file formats, as well as creating images of the current network visualization as both raster graphics (a bitmap in the PNG or JPEG file format) and vector graphics (in the SVG format). e View menu allows the customization of the network appearance by setting specific layout algorithms and custom visualization styles. In addition, this menu allows binding the visual properties of nodes and links to their attributes. e Data menu allows the exploration of attributes for each node and link. Finally, the Analysis menu gives access to most of the techniques that will be described in the following sections.

Network Analysis Tools
NOESIS is designed to ease the implementation of network analysis tools. It also includes reusable implementations of a large collection of popular network-related techniques, from graph visualization [3] and common graph algorithms to network structural properties [21] and network formation models [22]. e network analysis tools included in NOESIS and the modules that implement them are introduced in this section.

Network Models.
NOESIS implements a number of popular random network generation models, which are described by probability distributions or random processes. Such models have been found to be useful in the study and understanding of certain properties or behaviors observed in real-world networks. Some examples of these models are shown in Figure 4.
Among the models included in NOESIS, the Erdös-Rényi model [23] is one of the simplest ones. e Gilbert model [24] is similar to the Erdös-Rényi model, but a probability of existence is given for links instead. e anchored network model is also similar to the two previous models, with the advantage of reducing the occurrence of isolated nodes, but at the cost of being less than perfectly random. e connected random model is a variation of the anchored model that avoids isolated nodes.
Other models included in NOESIS exhibit specific properties often found in real-world networks. For example, the Watts-Strogatz model [25] generates networks with small-world properties, that is, low diameter and high LGPL Java/JavaScript  clustering. is model starts by creating a ring lattice with a given number of nodes and a given mean degree, where each node is connected to its nearest neighbors on both sides. In the following steps, each link is rewired to a new target node with a given probability, avoiding self-loops and link duplication.
Despite that the small-world properties exhibited by networks generated by the Watts-Strogatz model are closer to properties of real-world networks than those generated by models based on the Erdös-Rényi approach, they still lack some important properties observed in real networks. e Barabási-Albert model [26] is another well-known model  4 Complexity that generates networks whose node degree distribution follows a power law, which leads to scale-free networks. is model is driven by a preferential attachment process, where new nodes are added and connected to existing nodes with a probability proportional to their current degree. Another model with very similar properties to the Barabási-Albert model is Price's citation model [27].
In addition to random network models, a number of regular network models are included in NOESIS. ese models generate synthetic networks that are useful in the process of testing new algorithms. e regular network models include complete networks, where all nodes are interconnected; star networks, where all nodes are connected to a single hub node; ring networks, where each node is connected to its closest two neighbors along a ring; tandem networks, like the ring model but without closing the loop; mesh networks, where nodes are arranged in rows and columns and connected only to their adjacent nodes; toruses; meshes, where nodes in the extremes of the mesh are connected; hypercubes; binary trees; and isolates, a network without links.

Network Structural Properties.
Network structural properties allow the quantification of features or behaviors present in the network. ey can be used, for instance, to measure network robustness or reveal important nodes and links. NOESIS considers three types of structural properties: node properties, node pair properties (for pairs both with and without links among them), and global properties.
NOESIS provides a large number of techniques for analyzing network structural properties. Many structural properties can be computed for nodes. For example, indegree and out-degree indicate the number of incoming and outgoing links, respectively. Related to the node degree, two techniques to measure node degree assortativity have been included: biased [28] and unbiased [29] node degree assortativity. Node assortativity is a score between − 1 and 1 that measures the degree correlation between pairs of connected nodes. e clustering coefficient can also be computed for nodes. e clustering coefficient of a node is the fraction of its neighbors that are also connected among them.
Reachability scores are centrality measures that allow the analysis of how easy it is to reach a node from other nodes. e eccentricity of a node is defined as the maximum distance to any other node [30]. e closeness, however, is the inverse of the sum of the distance from a given node to all other nodes [31]. An adjusted closeness value that normalizes the closeness according to the number of reachable nodes can also be used. Inversely to closeness, the average path length is defined as the mean distance of all shortest paths to any other node. Decay is yet another reachability score, computed as the summation of a delta factor powered by the path length to any other node [22]. It is interesting to note that, with a delta factor close to 0, the measure becomes the degree of the node, whereas with a delta close to 1, the measure becomes the size of the component the node is located at. A normalized decay score is also available.
Betweenness, as reachability, is another way to measure node centrality. Betweenness, also known as Freeman's betweenness, is a score computed as the count of shortest paths the node is involved in [32]. Since this score ranges from 2n − 1 to n 2 − (n − 1) for the number of nodes n in strongly connected networks, a normalized variant is typically used.
Influence algorithms provide a different perspective on node centrality. ese techniques measure the power of each node to affect others. e most popular influence algorithm is PageRank [9] since it is used by the Google search engine. PageRank computes a probability distribution based on the likelihood of reaching a node starting from any other node. e algorithm works by iteratively updating node probability based on direct neighbor probabilities, which leads to convergence if the network satisfies certain properties. A similar algorithm is HITS [33], which stands for hyperlinkinduced topic search. It follows an iterative approach, as PageRank, but computes two scores per node: the hub, which is a score related to how many nodes a particular node links, and the authority, which is a score related to how many hubs link a particular node. Both scores are connected by an iterative updating process: authority is updated according to the hub scores of nodes connected by incoming links, and hub is updated according to authority scores of nodes connected by outgoing links. Eigenvector centrality is another iterative method closely related to PageRank, where nodes are assigned a centrality score based on the summation of the centrality of their neighbor nodes. Katz centrality considers all possible paths but penalizes long ones using a given damping factor [34]. Diffusion centrality [35] is another influence algorithm based on Katz centrality. e main difference is that while Katz considers infinite-length paths, diffusion centrality considers only paths of a given limited length.

Complexity 5
Apart from the aforementioned centrality measures, NOESIS also provides ready-to-use implementations of percolation centrality [36] and cross-clique connectivity [37], ensuring that all mainstream centrality measures are provided by our software platform. Beyond common centrality measures, NOESIS can compute a robustness coefficient [38] to analyze the structural robustness of complex networks, which is specifically useful for evaluating scenarios of targeted attacks.
Different structural properties for links can also be computed by NOESIS, for example, link betweenness, which is the count of shortest paths the link is involved in, or link rays, which is the number of possible paths between two nodes that cross a given link. Some of these properties are used by different network data mining algorithms.

Network Visualization Techniques.
Humans are still better than machines at recognizing certain patterns when analyzing data in a visual way. Network visualization is a complex task since networks tend to be huge, with thousands of nodes and links. NOESIS enables the visualization of networks by providing the functionality needed to render the network and export the resulting visualization using different image file formats.
NOESIS provides different automatic graph layout techniques, such as the well-known Fruchterman-Reingold [39] and Kamada-Kawai [40] force-based layout algorithms (see Figure 5). Force-based layout algorithms assign forces among pairs of nodes and solve the system to reach an equilibrium point, which usually leads to an aesthetic visualization.
Hierarchical layouts [3], which arrange nodes in layers trying to minimize edge crossing, are also included. Different radial layout algorithms are included as well [42]. ese layouts are similar to the hierarchical ones but arrange nodes in concentric circles. Several regular layouts are provided too since they are common for visualizing regular networks, such as meshes or stars.
NOESIS allows tuning the network visualization look and feel. e visual properties of nodes and links can be customized, including color, size, and borders. In addition, visual properties can be bound to static or dynamic properties of the network. For example, node sizes can be bound to a specific centrality score, allowing the visual display of quantitative information.
NOESIS provides an interactive graphical user interface that can be used to explore modest-sized networks and test the results different methods provide using an easy-to-use drag-and-drop interface. is complementary tool can be useful for analyzing small datasets and also for teaching courses on the subject. e current version of NOESIS includes a JavaFX network renderer that makes use of hardware acceleration, the same kind of technology employed by other network analysis tools with a strong focus on network visualization, such as Gephi or Cytoscape. e NOESIS JavaFX-based network renderer offers the same performance gains that can be obtained by other tools that also make use of hardware acceleration (e.g., Gephi). It should be noted, however, that the complementary UI tool included within the NOESIS distribution was not designed to handle large networks since it is often impractical to visualize networks with thousands or millions of nodes and edges in a million-pixel computer screen. NOESIS offers Gephi-like performance, and for visually navigating truly large networks, specialized solutions, such as Cytoscape, are recommended. In such situations, for network data analysis, the programmatic NOESIS API should be used instead of the interactive exploration tool.

Network Data Mining Techniques
Network data mining techniques exist for both unsupervised and supervised settings. NOESIS includes a wide array of community detection methods [4] and link prediction techniques [43]. ese algorithms are briefly described below.

Community Detection.
Community detection can be defined as the task of finding groups of densely connected nodes. A wide range of community detection algorithms have been proposed, exhibiting different pros and cons. NOESIS features different families of community detection techniques and implements more than ten popular community detection algorithms. Some examples of these community detection techniques are shown in Figure 6. e included algorithms, their time complexity, and their bibliographic references are shown in Table 2.
NOESIS provides hierarchical clustering algorithms. Agglomerative hierarchical clustering treats each node as a cluster and then iteratively merges clusters until all nodes are in the same cluster [56]. Different strategies for the selection of clusters to merge have been implemented, including the single-link method [45], which selects the two clusters with the smallest minimum pairwise distance; the complete-link method [46], which selects the two clusters with the smallest maximum pairwise distance; and the average-link method [47], which selects the two clusters with the smallest average pairwise distance.
Modularity-based techniques are also available in our framework. Modularity is a score that measures the strength of particular division into modules of a given network. Modularity-based techniques search for communities by attempting to maximize their modularity scores [57]. Different greedy strategies, including fast greedy [48] and multistep greedy [49], are available. ese greedy algorithms merge pairs of clusters that maximize the resulting modularity, until all possible merges would reduce the network modularity.
Partitional clustering is another common approach. Partitioning clustering decomposes the network and performs an iterative relocation of nodes between clusters. For example, Kernighan-Lin bipartitioning [50] starts with an arbitrary partition into two clusters. en, nodes are iteratively exchanged between both clusters to minimize the number of links between them. is approach can be applied multiple times to subdivide the obtained clusters. K-means community detection [51] is an application of the traditional K-means clustering algorithm to networks and another prominent example of partitioning community detection. 6 Complexity Spectral community detection [56] is another family of community detection techniques included in NOESIS. ese techniques use the Laplacian representation of the network, which is a network representation computed by subtracting the adjacency matrix of the network from a diagonal matrix where each diagonal element is equal to the degree of the corresponding node. en, the eigenvectors of the Laplacian representation of the network are computed. NOESIS includes the ratio cut (EIG1) algorithm [52], Jordan and Weiss NG (KNSC1) algorithm [53], and spectral K-means [54].
e BigClam overlapping community detector is also available in NOESIS [55]. In this algorithm, each node has a profile, which consists in a score between 0 and 1 for each cluster that is proportional to the likelihood of the node belonging to that cluster. Also, a score between pairs of nodes is defined yielding values proportional to their clustering assignment overlap. e algorithm iteratively optimizes each node profile to maximize the value between connected nodes and minimize the value among unconnected nodes.
In the following example, we show how to load a network from a data file and detect communities with the KNSC1 algorithm using NOESIS:  Figure 5: e same dolphin social network [41] represented using different network visualization algorithms: random layout (a); Kamada-Kawai layout (b); Fruchterman-Reingold layout (c); circular layout using the average path length (d).

Link Scoring and Prediction.
Link scoring and link prediction are two closely related tasks. On the one hand, link scoring aims to compute a value or weight for a link according to a specific criterion. Most link scoring techniques obtain this value by considering the overlap or In the time complexity analysis, v is the number of nodes in the network, d is the maximum node degree, and k is the desired number of clusters.
8 Complexity relationship between the neighborhood of the nodes at both ends of the link. On the other hand, link prediction computes a value, weight, or probability proportional to the likelihood of the existence of a certain link according to a given model of link formation. e NOESIS framework provides a large collection of methods for link scoring and link prediction, from local methods, which only consider the direct neighborhood of nodes, to global methods, which consider the whole network topology. Some examples are shown in Figure 7. As the amount of information considered is increased, the computational and spatial complexity of the techniques also increases. e link scoring and prediction methods available in NOESIS are shown in Table 3.
Among local methods, the most basic technique is the common neighbors score [59], which is equal to the number of shared neighbors between a pair of nodes. Most techniques are variations of the common neighbors score. For example, the Adamic-Adar score [60] is the sum of one divided by the logarithm of the degree of each shared node. e resource-allocation index [61] follows the same expression but directly considers the degree instead of the logarithm of the degree. e adaptive degree penalization score [62] also follows the same approach but automatically determines an adequate degree penalization by considering properties of the network topology. Other local measures consider the number of shared neighbors but normalize their value according to certain criteria. For example, the Jaccard score [63] normalizes the number of shared neighbors by the total number of neighbors. e local Leicht-Holme-Newman score [64] normalizes the count of shared neighbors by the product of both neighborhoods' sizes. e Salton score [65] also normalizes similarly but this time using the square root of the product of both node degrees. e Sorensen score [66] considers twice the count of shared neighbors normalized by the sum of both neighbors' sizes.
e hub promoted and hub depressed scores [67] normalize the count of shared neighbors by the minimum and the maximum of both node degrees, respectively. e preferential attachment score [68] only considers the product of both node degrees.
Global link scoring and prediction methods are more complex than local methods. For example, the Katz score [34] sums the influence of all possible paths between two nodes, incrementally penalizing paths by their length according to a given damping factor. e global Leicht-Holme-Newman score [64] is quite similar to the Katz score but resorts to the dominant eigenvalue to compute the final result.
Random walk techniques simulate a Markov chain of randomly selected nodes [69]. e idea is that starting from a seed node and randomly moving through links, we can obtain a probability vector where each element corresponds to the probability of reaching each node. e classical random walk iteratively multiplies the probability vector by the transition matrix, which is the row-normalized version of the adjacency matrix, until convergence. An interesting variant is the random walk with restart [70], which models the possibility of returning to the seed node with a given probability. Flow propagation is another variant of random walk [71], where the transition matrix is computed by performing both row and column normalization of the adjacency matrix.
Some spectral techniques are also available in NOESIS. Spectral techniques, as we mentioned when discussing community detection methods, are based on the Laplacian matrix. e pseudoinverse Laplacian score [72] is the inner product of the rows of the corresponding pair of nodes from the Laplacian matrix. Another spectral technique is the average commute time [72], which is defined as the average number of steps that a random walker starting from a particular node takes to reach another node for the first time and go back to the initial node. Despite that it models a random walk process, it is considered to be a spectral technique because it is usually computed in terms of the Laplacian matrix. Given the Laplacian matrix, it can be computed as the diagonal element of the starting node plus the diagonal element of the ending node, minus two times the element located in the row of the first node and the column of the second one.
e random forest kernel score [73] is a global technique based on the concept of spanning tree, i.e., a connected undirected subnetwork with no cycles that includes all the nodes and some or all the links of the network. e matrixtree theorem states that the number of spanning trees in the network is equal to any cofactor, which is a determinant obtained by removing the row and column of the given node, of an entry of its Laplacian representation. As a result of this, the inverse of the sum of the identity matrix and the Laplacian matrix gives us a matrix that can be interpreted as a measure of accessibility between pairs of nodes.
Using network data mining algorithms in NOESIS is simple. In the following code snippet, we show how to generate a Barabási-Albert preferential attachment network with 100 nodes and 10 links per node and then compute the resource-allocation score for each pair of nodes using NOESIS: Network network � new BarabasiAlbertNetwork(100, 10); LinkPredictionScore method � new Resource AllocationScore(network); Matrix result � method.call(); In Python, the previous example would be implemented as follows: ns � Noesis() network � ns.create_network_from_model ("Barabasi-Albert", 100, 10) predictor � ns.create_link_predictor ("ResourceAlloca-tion") result � predictor.compute(network) ns.end()

Performance Comparison
NOESIS is designed to manage large complex networks comprising thousands or even millions of nodes. It provides  In the time complexity analysis, v is the number of nodes in the network, d is the maximum node degree, and c refers to the number of iterations required by iterative global link prediction methods.
10 Complexity structures to efficiently represent and deal with relational data. While these features are often offered by other network analysis frameworks, they often lack the parallel computing features NOESIS includes.
In this section, we perform an empirical comparison of the performance of different popular network analysis frameworks for some common data analysis tasks. e igraph, SNAP, and NetworkX frameworks have been chosen for their comparison with NOESIS because they offer a programmatic API. Software developers can use them as libraries in their own software projects within the constraints imposed by their corresponding software licenses (virtually none in the BSD licenses of NetworkX, SNAP, and NOESIS). Closed-source tools, such as Pajek, offer no API, whereas other tools strongly focus on particular areas (e.g., Cytoscape on interactive network visualization). We restricted the comparison of NOESIS to those tools with the most similar use cases (i.e., igraph, SNAP, and NetworkX).
For NOESIS, different CPU scheduling techniques were tested: a work-stealing scheduler (WSS), a future scheduler (FS), and a conventional thread pool scheduler (TPS). Our experiments were performed in an Intel Core i7-3630QM processor ( [74], and a Facebook ego network with 4039 nodes and 88234 edges [75]. e experimentation carried out in this section consists of computing different computationally expensive network metrics that are commonly used in network mining: betweenness centrality (BC), link betweenness (LB), closeness (CN), and all shortest paths from every node using Dijkstra's algorithm (APSP). e APSP task could not be implemented using SNAP because of support limitations. We performed 5 runs for each measure and reported the average execution time required to complete the task for each software tool and network dataset. e average execution time for each task/dataset/tool triplet is shown in Table 4. It should be noted that NOESIS is consistently faster that well-known existing frameworks. e superior performance of the NOESIS framework can be attributed to different factors. First of all, NOESIS is written in pure Java, which can be orders of magnitude faster than the Python implementation of igraph and NetworkX. e NOESIS Java code is highly optimized taking into account the peculiarities of the Java virtual machine (e.g., trying to minimize memory fragmentation and using a properly tuned cost model for different operations). As a consequence, the NOESIS Java implementation is surprisingly faster than the C++ implementation of SNAP, which should be expected to be more efficient since highly optimized C++ applications are usually more efficient than their Java counterparts. NOESIS parallel programming design patterns let algorithm designers easily exploit the parallelism in multicore processors and multiprocessor systems. e combination of these features allows NOESIS to complete tasks associated with computing important network-related structural properties faster than some popular software packages, even orders of magnitude faster. In our parallelization experiments, we implemented different CPU scheduling techniques. NOESIS is faster than existing tools regardless of which scheduler is employed. e NOESIS conventional thread pool scheduler, which implements a common solution to the parallelization of software with the help of a single thread pool, consistently obtains the best results in the batch of experiments we have performed. An alternative implementation using Java features is somewhat slower. Similar results were obtained using a workstealing CPU scheduler [76]. In a work-stealing scheduler, each processor in a computer system has a queue of work items (computational tasks (threads)) to perform. When a processor runs out of work, it looks at the queues of other processors and "steals" their work items. As a result, work stealing distributes the scheduling work over idle processors, and as long as all processors have work to do, no scheduling overhead occurs. It is employed in the scheduler for the Cilk programming language [77], the Java fork/join framework [78], and the .NET Task Parallel Library [79]. For the tests we performed, which involved a heavy work load for a single multicore processor, the simplest strategy seemed to work better, yet typical workloads might be different and alternative CPU schedulers might be preferable. Table 4: Performance of different network analysis tools in different tasks over several network datasets (time is shown in milliseconds).

Conclusion
In this paper, we have presented the NOESIS framework for complex network data mining. NOESIS provides a large collection of data analysis techniques for networks, in a much more efficient way than competing alternatives, since NOESIS exploits the parallelism available in existing hardware using structured parallel programming design patterns [80]. As shown in our experimentation, NOESIS is significantly faster than other libraries, even orders of magnitude faster, which is extremely important when dealing with massive networks.
Currently, the NOESIS project has achieved a mature status with more than thirty-five thousand lines of Java code, hundreds of classes, and dozens of packages. In addition, the production code is covered by automated unit tests to ensure the correctness of the implemented algorithms. NOESIS includes a custom library of reusable components with more than forty thousand lines, providing different general functionalities, such as customizable collections, structured parallel programming building blocks, mathematical routines, and a model-driven application generator on top of which the NOESIS graphical user interface is built. Since Python has become a very popular programming language for scientific computing, a complete API binding for Python is provided so that NOESIS can be used from Python.
NOESIS is one of the most complete and efficient "outof-the-box" frameworks for analyzing and mining relational data. Our framework can accelerate the development of software that needs to analyse networks by providing efficient and scalable techniques that cover different aspects of network data mining. Our framework is open source and freely available from its official website, http://noesis.ikor. org, under a permissive Berkeley Software Distribution license.
Data Availability e network datasets used in our performance evaluation are from publicly available studies, which have been cited in our paper. Network datasets can be obtained from the Stanford Large Network Dataset Collection (http://snap. stanford.edu/data).

Disclosure
e NOESIS framework has served as the foundation for the development of the data mining techniques that comprise the Ph.D. thesis of Gómez [81].

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.