Scalable System for Smart Urban Transport Management

,


Introduction
Advancements in smart devices and their interoperability have led many industries to develop sensor-based solutions that collect and exploit data not only from individual devices but also from their operating environment [1][2][3]. Modern public transport vehicles, such as buses and trains, are now equipped with many sensors that gather a huge amount of information including their speed, direction, location, fuel consumption, driving behaviour, departure and arrival times, and even on-board videos [4,5]. While this already assists those vehicles' individual navigation and enhances their passengers' security [6], real-time combination and analysis of all the data collected within an entire transport network have the potential to provide extremely valuable information to transport companies and their consumers. Consequently, the demand for the development of intelligent and context-aware transport systems [7] has been driven by both increasingly demanding users, which require smartphone apps informing them on the network status, and operators, which need to monitor and predict the behaviour of the network to provide a better service. Indeed, contextawareness can be used for in-route planning to resolve conflicts between different vehicles, accurate route selection if there is congestion, and incident monitoring [8,9], suggesting alternate routes based on commuters' preferences [4,10] and helping bus drivers to find the nearest locations for emergency help and refuelling [11,12].
Provision of intelligence in such context-aware public transport systems is implemented by channelling the contextual data being collected from the transport infrastructure into prediction systems so that behaviour of the transport system can be forecast. Yet, this can only be delivered if highly efficient information management architecture is available to handle an unprecedented volume of data in real time [13][14][15][16][17]. at challenge is further exacerbated by the fact that the data to be combined and analysed are heterogeneous and come from a variety of sources and sensors through diverse infrastructures [5]. is requires a high level of coordination in the transport system: data captured in the vehicles need to be transferred in real time to a common processing platform where they are transformed in order to be usable by potential recipients. As data validity is short in a dynamic transport system, real-time processing and delivery of such information to the target audience require a high performance and scalable platform. Ease of integration, adaptability, and sustainability are other important quality factors, as new devices and their corresponding services have to be accommodated without affecting the system. In addition, high accuracy and usability, particularly in terms of prediction and data visualisation, are also very important requirements.
Traditional system architectures have been unable to meet many of those quality requirements of modern transport systems [18][19][20].
us, a unified adaptable and scalable software architecture for smart transport systems using service orientation principles, memory caching, and advanced data mining and machine learning models is proposed herein. Engineering such a component-based architecture not only has to meet the identified quality requirements but also has to consider implementation and integration of all the components and tiers in the system. e system architecture needs to be able to acquire, store, manipulate, and integrate information from heterogeneous data sources in order to produce eventually reliable predictions. is will help transport managers and planners to optimise, in real time, scheduling, stops usage, and passengers' time. e system architecture has to offer efficient data structures for quick storage and retrieval, on the fly analysis capabilities and preparation of use-ready information, to reduce processing at the time of request. Such system will be able to conduct effective analytics on the data as it comes, while keeping processed data ready to be transferred to requesting devices or vehicles.
After discussing advances in transport management systems together with data and architectural considerations, the core components of the proposed architecture are presented with a focus on integration services, data structure, and analytics components. en, experimental results using real-time data fed from an actual transport network are presented, validating both the proposed architecture and the value of smart data analytics. Finally, the proposed solution is discussed and compared with other systems.

Intelligent Transport Management Systems.
e concept of intelligent transport system (ITS) emerged in late 1970s well before the growth of digital devices and availability of platforms allowing sharing the data produced or consumed by these devices [21]. ITSs-defined as a group of technologies, systems, and services for efficient and secure transport services-cover a wide domain that includes both private and commercial transportation systems. Benefitting from intelligent technology integration, ITSs have already provided many new opportunities for building safe, reliable, and scalable service infrastructures for transport [22][23][24]. Consequently, ITSs have become a key element of smart cities and connected digital ecosystems [25]. e complexity of ITS architectures rises from the combination of many different software and hardware components and their connectivity [21]. us, it is important to identify the interfaces between those components so that communication between them can be standardised [26]. ey provide the core of the architectural framework that can be used for planning, defining, deploying, and integrating intelligent transport systems. Such framework usually defines the functional entities, the workflows to connect them, and the services offered to the user [27]. ese services typically include information about traveling, management of different traffic modes, collection of data for performance improvement, monitoring driving behaviour, incident response handling, and managing vehicle fleets in public and private settings [28]. ese user services are delivered with the support of many different entities that are connected to each of the services offered by the ITS [25]. In the context of a bus network, they comprise direct users of the system such as passengers, bus management operators, transport companies, and external entities that the ITS interacts with. ey include data providers (e.g., weather and roadworks) and the different sensors (e.g., GPS, speed, direction, temperature, and light detectors) that are installed on the vehicles [13,29]. erefore, the design of a successful ITS architecture should consider all these parameters, using information and data flows, to define the system's functional details [30,31]. Each user service offered by the ITS should be represented as a workflow where each step is bound with the required process inputs and outputs. Moreover, as an ITS is a distributed system where different entities may be quite remote from each other, design workflows for its services have to be formally followed to represent how different functions are formed, so that their compliance with standards can be verified. Like any large software, ITSs are very complex systems that require formal specification in order to ensure that the relationships between the different components are well defined and their behaviours are predictable [32]. In summary, the ITS architecture plays a key role in the identification of the system complexities and the ability to implement solutions.
Depending on the environment or region in which an ITS is implemented, different stages of deployment should be planned [24], so that the system is developed according to the knowledge of the context it belongs to. As the demand for services changes and new functionality needs to be introduced to meet them, an ITS should be able to continuously improve. is ongoing process involves evolution of existing services and integration with new features while maintaining existing functionalities [24]. Consequently, in common with other large-scale systems that have the capacity to evolve and expand their services, ITSs must offer many different quality characteristics including compatibility, extendibility, interoperability, reliability, integration, and standardisation [29,30]. In addition, they should deliver system performance and accurate predictions based on historical transport data, and scalability, that is, providing good quality of service under increased load.

Transport Data Analytics.
To take full advantage of data collected from public transport networks, smart analytics tools have to be included in the system to improve quality of service by supporting data-driven decision-making [1,[33][34][35][36]. rough integration of data visualisation, data mining, and machine learning techniques in the data acquisition and storage processes, pattern may be detected and predictions can be made. For example, not only could an ITS predict bus arrival times at specific stops, but also it could identify particular behaviours based on information available from different stops either on a given route or across different routes. is may lead to the redesign of routes or part of the network to improve the transport system. ere has been a considerable amount of research into designing effective approaches so that accurate and adaptive prediction systems can be offered to both bus companies and passengers [37][38][39]. Historical data, such as arrival times against planned times for given stops over an extended period, are essential to provide a deep insight into the behaviour of a transport network. Combined with machine learning techniques, such data allow forecasting about the network's status, even when there is no information available about external factors [23,[40][41][42][43]. Existing systems have exploited neural networks and regression and clustering techniques to predict bus arrival times [43][44][45][46][47], while others have performed route prediction using GPS data observation [48]. Overall, conversion of data into knowledge by the application of smart analytics techniques supports strategic decision-making and system automation, delivering improved operations and services. [49][50][51][52][53] have been developed according to the developed architectural concepts previously described and the standards set by the working group on ITS architecture of the International Organization for Standardization (ISO) [28,42,54]. ey include a transport-knowledge-based architecture able to optimise the routes of a network. Using a hierarchical strategy of bus nodes and connection points at different level, static, such as planned arrivals, and dynamic schedule information are combined for effective predictions [52]. Alternatively, a cloud-based design of an ITS was proposed taking advantage of cloud platform and Internet of things technologies [53]. e choice of a cloud platform was motivated by the fact that as the flow of heterogeneous data produced by a growing number of sensors increases in speed and volume, a traditional client server architecture would not be able to handle it. However, since those ITS architectures are of high level, they do not include the required implementations to support actual public transport systems.

Architectural Consideration. Many ITS architectures
Nevertheless, there have been some attempts at implementing small-scale smart systems, such as the University of Wollongong's shuttle service [21]. at service (UniShuttle) runs buses between different campuses and other key destinations including the city centre and the beach. However, that system's functionality and performance are limited: as scalability was not among its requirements, quality of service has degraded with the increase of the number of buses in the network. Moreover, since its architecture was not designed to be interoperable and compatible with external systems, it cannot integrate with any external data source. Finally, its prediction model only considers real-time data from the segment the bus is driving through, making it blind to information from previous stops such as existing delays.
In summary, there has been an increasing demand for smart context-aware public transport systems in order to deliver a better service. is relies on the development of architectures that can provide a basis for intelligent information processing and decision-making support. So far, as their implementations have been of quite high level, they have not been able to provide the essential services expected by transport systems. In particular, such architectures have to consider the integration of all data relevant to a specific network. In addition, they need to be flexible in order to adapt and evolve to rapidly changing requirements; otherwise, those systems cannot remain competitive when novel technologies become available. Attaching quality attributes to such software architecture can help in their evaluation, optimisation, and adaptation of evolving needs [55]. All those issues, especially those associated with the architectural design and the prediction models, will be addressed in the proposed architecture that is introduced in the following section.

Service-Based Intelligent
Transport Architecture

Key Principles.
As the state-of-the-art review has revealed, there is a need for a new and practical architecture for transport systems, which is designed according to both its intrinsic requirements and its expected interactions with other entities in a smart environment. In addition, it is suitable that this novel service-based intelligent transport architecture is able to allow not only the creation of new smart transport systems but also the transformation and support of existing systems. e principles on which the proposed architecture is based aim at delivering the best quality of service. More specifically, it delivers the following five key features.
Plug and Play architecture: Any component in the architecture can be replaced by a more advanced one without any additional change being required to the system. is feature is particularly critical with analytics algorithms, since they belong to an extremely dynamic research field. Consequently, any service-based intelligent architecture must be able to be updated regularly to support competitive decision-making.
(1) e Real-Time Cache Manager. While most existing transport systems process information through expensive spatial queries from the database [56,57], usage of a realtime cache manager enables the system to work on a novel data structure based on hierarchical regions so that the effort of performing complex spatial queries can be drastically reduced, contributing to the system performance. Consequently, this facilitates integration of the system with any other internal or external system without affecting the required functionality.
(2) Automation. Absence of automation may lead different processes to rely on each other, which can conduce to deadlocks and component failures when they are run in parallel [58]. erefore, by automating the whole flow of functionality, processes from data acquisition to data visualisation can be autonomous, which ensures system robustness when they are run concurrently.
(3) Genericity. Although the proposed architecture is primarily designed to be used on public transport systems, it can be easily customised to any other data-oriented domain for the production of predictive analytics.
us, the implemented analytics algorithms are generic and independent from the application and the data warehouse. In principle, database type or data format should not affect the production of predictive analysis and visualisation for the delivery of strategic decision-making and operational excellence.
(4) Flexibility. Since many systems do not have integration or expanding capabilities, they encounter performance and scalability issues [21]. e proposed architecture is easily applicable to any existing system, not only allowing data integration and offering scalability but also empowering it with predictive analytics and monitoring.

Architecture Description.
e proposed architecture is a generic system, where data processing components are conceived to be scalable and efficient. It is designed not only to deliver the provision of transport services but also to adhere to quality of service requirements and to provide data analytics capability. is architecture relies on six modules. (1) e Web Services Module. It is based on a dynamic WCF web service management system with asynchronous access allowing requests, which do not need immediate data response, not to have to wait for data to be stored and processed [59]. e data collected from the vehicles' different sensors are dropped at designated web services that add the data requests to their processing pool. ose different web service components serve different types of data requests coming into the system such as data coming from internal networks and those generated by heterogeneous external sources, for example, third-party transport and prediction data providers. For example, data arriving at a vehicle's current location and those produced by the vehicle's speed metre are linked so that both speed and location values can be combined in subsequent analyses. Such an approach allows integration of new data sources into the architecture without affecting the inner working of the system, which delivers elements of pluggability and extendibility.
(2) e Real-Time Cache Manager Module. It is essential to enhance performance and support real-time monitoring and predictions [60]. Its purpose is to reduce database lookup calls when information is needed so that data can be dispatched on the go without any delay in accessing this information.
is module is also responsible for providing short-term on-the-fly predictions based on the transport network's current situation such as arrival times. Moreover, it generates alerts when specific areas are identified as being prone to congestion and/or delay issues. To ensure real-time response for both incoming and outgoing data, the module keeps a snapshot of the current situation of the data about the location and predictions for short-term future events.
is is performed by maintaining an optimised hierarchical data structure to support storage and retrieval of the information. In particular, as queries for spatial data can be time-expensive once data volumes are important, the module's data structure enables the system to run spatial queries on filtered and context-specific data, for example, transport network's areas.
(3) e Analytics Component Module. It provides the system's intelligence. It processes data coming from different internal and external sources so that they can be converted into usable information for strategic decision-making. Once data retrieval is completed, they are fed into the data cleanup component that removes any anomalies and transforms the data's structure to make it compatible with the data structure required for visualisation, data mining, and predictions.
e visualisation component contributes to the visual representation of the data, as well as predicted probabilities. A key specificity is that, instead of only plotting the data, it also indicates accuracy associated with prediction processes. e data analytics component is comprised of many data mining and machine learning techniques so that accurate statistics and prediction models can be generated to support decision-making. Its implementation includes a variety of algorithms including Clustering, Linear Regression, Logistic Regression, Decision Trees, and Neural Networks [61][62][63][64].
(4) e Transformation Module. It is responsible for interoperability with other internal or external systems when data exchange is required. As data are available in many different formats, like JSON and XML, the transformation module offers integration between different data types and structures. Since its components implement a layer for each type, the system implementation is independent of the format of the data that are either imported from or exported to other systems through the web services.
us, the transformation module ensures data interoperability by acting as a layer either between different components or for intersystem communication.
(5) e Components Factory Module. It equips the system with dynamic flexibility for usage and integration of new components without the need for new deployment. It has the reference of the different components available in the system to conduct different functional activities such as data retrieval, clean-up, and application of prediction components. Although the current system integrates data from an SQL server database, the system can equally work with components built to work with data coming from any other database system. By using the relevant transformation components, those data can be fed into the prediction system with the expected structure regardless of which database platform they came from. e module connects with all major components of the system like the real-time cache manager, therefore adding to the system the capacity of using different real-time prediction systems.
(6) e System Components Module. It plays an important role in processing and archiving the data coming from different sources and converting them to compatible structure so that the analytics component can easily fetch data. It provides the basis of the prediction process and enables the real-time cache manager to make an on-the-fly estimation of arrival times and detection of traffic situations like congestion. Its integration component has endpoints for data coming from different external data providers about, for examples, planned trips, roadworks, and diversions. It is responsible for interconnection of different formats and establishing criteria to join information from more than one source to create a single standard schema of information that is used for analysis and prediction.

Evaluation
As the proposed system has been designed to collect, deliver, and process data in real time while being scalable, its meaningful evaluation requires an implementation on an existing, operational, and large-scale transport system. is was performed through a partnership with Mermaid Technology [65], a company that runs an urban transport system with 1200 buses in the city of Copenhagen, Denmark.
is capital city is particularly appropriate to evaluate the system since it is a typical European city with urban environment, modern establishment, heavy footfall, and streets and roads insufficient to handle the traffic requirements of the city. Although bicycle usage is high, public transport is the preferred option to save time in commuting. Despite the presence of an underground train system connecting various destinations in the city, buses remain the main means of transport.
Mermaid's buses are equipped with on-board sensors collecting in real-time bus operations, bus position (GPS based), speed and direction, status of local signs and traffic lights, and images from CCTV cameras placed on the buses.
is corresponds to a data volume of over 1 KB per second per bus. As those sensors are connected to a central hub, the data are transmitted in real time to a server where  Journal of Advanced Transportation information is stored, processed, and shared with other buses to inform them of the activity of neighbouring buses allowing some journey optimisation. In this study, the main emphasis is on the most challenging routes: those transporting the most passengers and those involved in complex scenarios, that is, routes running through the busy city centre and routes offering many connections to other routes. As a result, out of the 1298 routes operated by Mermaid's 1200 buses, the experiments focused on the data produced by 128 buses and their associated 188 routes. e rich context information captured by the on-board sensors enabled the server system to support provision of real-time versatile and customer-driven visualisation, data mining, and predictions. A sample of data stored in the system is displayed in Table 1. For each stop of a completed journey that was expected to start at 23 : 53 : 00 on 31st January 2018, the table provides the bus's predicted and actual arrival and departure times. Although the system is able to handle in real time without error (see Section 2.5.1) the flow of sensor data transmitted by the active buses, data associated with a given journey may occasionally be missing. For example, some sensors were not active or collected invalid data; in addition, some buses were rerouted or did not complete their journey. Consequently, data were preprocessed to flag such journeys so that their associated data were not considered for analysis tasks such as visualisation and predictions.
In the following sections, the proposed system is rigorously evaluated using the data produced by the Mermaid's buses. Most importantly, performance, scalability, and robustness of web services and the associated real-time cache components are assessed, including under challenging conditions. Following validation of the core of the system, the analytical values of the implemented analytic components are illustrated. First, examples of the versatility and customisation of real-time visualisation are provided. Second, benefits of real-time data mining and prediction applications are demonstrated.

Performance, Scalability, and Robustness of Web Services and Real-Time Cache Components.
e web services must provide the performance and scalability required for realtime data processing. In particular, prediction criteria have to be updated with the latest data being added along with the real-time data feed. Evaluation was conducted by connecting the web services to real client components loaded on the buses to evaluate the readiness, availability, and automatic scaling of the services provision. In addition to measuring response time according to the number of users, the presence of potential errors was monitored if any call failed. While initially between 500 and 1000 buses were connected to the system, eventually 25000 unique buses were considered in a staging environment.
As shown in Table 2, despite significant increases in the number of calls that were made, the throughput, that is, number of calls processed per minute, delivered by the web services components was only reduced moderately. is indicates that the system is highly scalable and can serve concurrent calls very efficiently. e system is able to handle more than 25,000 unique clients per minute. e response time of less than 395 milliseconds (Table 2), even when it is serving more than 70,000 requests per minute, shows that the system can handle thousands of concurrent requests without any impact on its performance. Moreover, they do not cause any interference on the response time for other requests being made to the system. It is also important to note that data are not static; they continuously change and need to be read from the caching-based database. During those experiments, none of the calls failed as each one was served with requested data, demonstrating the robustness of the system. at shows the uptime and availability of the service under stress time as requests are served within the threshold time of data availability, that is, one minute. As the server used for these experiments did not have high-end specifications-an Intel Xerox 2-core CPU with only 2 GB RAM installed was used-it is expected that usage of a machine with higher specifications would deliver significantly better response time and throughput without generating any error.

Real-Time Visualisation.
As the core of the system was validated, it can be exploited for data analytics. Real-time visualisation is an important functionality allowing human users to detect patterns that will support their decisions. However, provision of pertinent visualisation is not an easy task. For instance, by only considering 30 routes out of the 1298 existing ones, visualisation of their operational stops on a map is able to provide meaningful information (see Figure 2(a)). However, the additional mapping of live buses on the stops (Figure 2(b)) to indicate how many connections exist between different routes increases its complexity significantly making gaining clear understanding of the bus situation and their routes difficult.
To address this, the monitoring system, built in the system's architecture, creates a variety of visualisation representations at different levels, such as system, zone, and route, so that information can be viewed at the most relevant resolution using the most appropriate depiction. For example, Figure 3 presents in real time updated statuses of all the routes on the 15th of April 2018 between 7 : 00 and 9 : 00 am; the colour indicates the type of delay being reported at each stop. Such a map is particularly useful to infer possible relationship between different routes in terms of delays. As the proposed architecture is able to provide simultaneous real-time monitoring of more than 10,000 stops, continuously updating information about delays and expected arrivals, it provides a strategic advantage for decision-making.

Data Mining and Prediction
Components. While visualisations are able to present complex data so that users can gain a better insight into the status of the transport network, usage of data mining and machine learning algorithms permits going much further in terms of pattern recognition and predictions. For example, the system allows extraction of planned and actual arrival and departure times (see data samples in Table 1), so that the bus company can visualise and identify delay patterns for given times, days, routes, and/ or drivers. is is illustrated in Figure 4, where the route followed by bus "5676" starting at "18 : 56" is analysed for a given working week (Monday to Friday). Plotting of delays and early arrivals at all stops for five days reveals that, along that planned 39-minute journey, the bus reached certain stops up to 5 minutes before or after the scheduled times. Moreover, it shows that, at the start of its journey, the bus systematically accumulates delays, up to 5 minutes. en, after stop 8, when the bus leaves the busy city centre, the drivers are able to reduce that delay reaching the final stop either on time or even early. Investigation of delays associated with that journey at different times of the day confirms this pattern (data not shown). Such knowledge suggests that not only should the bus company review that route's timetable to better reflect the time needed to arrive at the first stops, but also those unrealistic arrival times put drivers under pressure to recover that delay. Eventually, this   too often leads to early arrival at the last bus stops, providing a suboptimal service to the users. e ability to predict if a bus will arrive on time is critical to improve the service provided to users. Since the system gives access to both a large amount of historic data and current real-time data, machine-learning-based classifiers can be trained to inform users of future delays. As Neural Networks (NNs) have made remarkable progress in the last few years [63,64] and they have already been used with success in similar tasks [44], they are the technology of choice for building such classifier. us, the analytic component includes an implementation of a relatively simple NN, that is, a multilayer perceptron network with an input layer passing the input variables to two hidden layers and an output layer predicting the delay status [62].  Since, intuitively, the time of the day, the delay history of the stop of interest, and the delay at the previous stop are expected to provide key information to predict the delay status at a given stop, only four input variables are considered: hour, minute, delay history, and delay duration at the last stop of the current journey. Focusing initially on a single route, 218,027 data points of past journeys completed at different times of the day were used to train the NN. More specifically, whereas the model was trained using all data captured between the 1st of January and the 31st of March 2018 (both working days and weekends), the model was evaluated using all data generated from the 1st of April to the 30th of April 2018. e experiment using those unseen data revealed that the NN-based classifier is able to perform accurate prediction-99.21% prediction probability-if a bus will be on time at the next bus stop.
Although such a classifier produces excellent results that provide value to users, it suffers from the fact that it operates like a "black box": the logic used for decision-making cannot be accessed and, therefore, no new knowledge can be gained from such predictions [66]. To address this, an alternative machine-learning-based predictor was also implemented using the Expectation Maximization (EM) algorithm [67]. As it relies on producing consistent clusters, their composition and nature can provide useful insight into the particular features that lead to the integration of a sample to a group.
Using the same dataset and features as previously, 10 clusters were produced. While delivering lower accuracy than the NN-based classifier-91.73% prediction probability-results can still be used practically. In addition, as Figure 5 shows, profiles of the generated clusters are accessible, making them available for further analysis. First, if a new instance is classified as belonging to one of 6 clusters (2, 5, 6, 7, 8, and 9), this indicates unambiguously that a delay is expected. Second, while membership of clusters 1 and 10 suggests delays, that of clusters 3 and 4 implies that the bus will be on time. Interestingly, while information about delays is a major component of cluster 10, knowledge of the time of day has priority in all the other clusters. Remarkably, cluster 4 considers the minute more important than the hour. As it is associated with the absence of delay, this may indicate that its membership is composed of stop times that are away from hourly peak traffic times. Obviously, additional data mining tools are required to extract more practical knowledge from those profiles.
Additional experiments using data extracted from 227 different routes while keeping the same experimental setup indicate that both approaches, that is, NN and EM, remain accurate in this more challenging setting. Indeed, prediction probabilities only decrease slightly: 98.89% and 88.77% for the NN-and EM-based predictors, respectively. Performances of those models are illustrated in Figure 6, where a lift chart [68]-a variation on the receiver operating characteristic (ROC) curve-plots the percentage of correct predictions according to a percentage of the overall stops.
is chart confirms the excellent accuracy of the NN-based predictor.

Discussion and Comparison
e primary aim of this work is to develop a real-time predictive monitoring system for urban transport, which is based on a highly adaptable architectural framework for intelligent transport management, therefore addressing many of the issues associated with transport systems presented in Section 2. Transport is a core and vital service for modern economies and hence it is a natural progression to adopt service orientation philosophy for its management.
us, one of the novel aspects of the developed architectural framework is integrating the service-oriented paradigm shift in information technology, together with the associated Quality of Service (QoS) factors that are necessary for the success of such complex systems.
is is combined with latest advancement in data mining and machine learning, novel performance enhancing techniques, sensor and storage technologies including data gathering and transformation, and advanced visualisation techniques.
As presented in Section 2, a number of transport management systems, architectures, and approaches have been described in the literature. ey can be classified into three categories: firstly, high-level frameworks/models and approaches, secondly, practical and special purpose systems, Journal of Advanced Transportation and finally studies that are mainly focused on the use of machine learning algorithms and techniques.
Overall, there have been a number of high-level frameworks/architectures and approaches proposed in the literature. Some of those are highly comprehensive yet personalised using standards for a set of user services based on their individual requirements [51], while others are presented as a guide for the development of future systems standards [54]. However, as their main focus is on standards and requirement policies, they do not cover aspects of practical implementation and construction of a public transport system, which are fully addressed in the proposed framework. Another approach that is adopted in [28] is kept relatively simple so that it can be used as a base for developing more complex systems. Indeed, the absence of constraints-based architecture implementation makes it suitable to other applications. However, that approach is still at relatively high level as it does not present any recommendation or details regarding actual implementation or evaluation. On the other hand, those aspects are fully covered in our system as described in Sections 2.4.2 and 2.5.
Moreover, although a few practical architectures and systems have been proposed, they do not offer full solutions. An interesting piece of work was presented in [21] for a bus management system that is based on the IETF presence model, which is a subscription and observer model. Although it reported a reasonable level of prediction, when using the historical model, there was no focus on the QoS attributes of the system, which are very important, particularly in terms of performance and scalability, when the number of buses increases in the network. In contrast, these issues have been evaluated extensively in the proposed system: the obtained results demonstrate very good performance and scalability factors ( Table 2) and higher prediction accuracy ( Figure 6). In addition, whereas the limited integration of data mining on the fly is not included in the architecture design of that previous work, this has also been addressed in the proposed system. Other works include the following: (i) the exploitation of a pillar-based architecture, which does not reflect the complexity of the implementation and construction of large transport management systems [18], and (ii) the use of mainly three layers, that is, sensors grouping for input, cloud storage, and client interface, following broadly a three-tier architectural approach [53] and, therefore, suffering from all its limitations particularly in terms of extendibility, reusability, and scalability. Alternatively, some researches have focused on back-end datadriven decision support systems rather than full implementation and evaluation [66]. Overall, unlike other proposed  systems, ours offers a fully integrated solution that is general, ensuring portability and adaptability, and is designed considering QoS factors.
Although a large number of papers have focused on the exploitation of data analytics and machine learning algorithms, ranging from clustering to the use of neural networks, they offer prediction results that are either comparable [43,47] or inferior [39,40] to the ones obtained by our fully integrated system ( Figure 6). Indeed, many of the listed algorithms and models have been integrated in our system to form one of the core components of the proposed architecture. is allows different algorithms to be evaluated and eventually adopted within the self-contained analytics component, without affecting the rest of the system. Ultimately, the proposed approach covers the complete datadriven transport management application lifecycle, starting with data gathering and transformation, followed by going through analytics and visualisation in a modular adaptable architecture. Such holistic approach is extremely important as new data sources are continuously emerging, and their integration requires a new generic modular architecture.

Conclusions
is paper presents a data integration generic architecture that provides an implementation blueprint for developing intelligent public transport systems. e architecture follows service-oriented principles and meets the quality requirements for transport management systems with real-time monitoring and forecasting. e implementation and evaluation of the proposed system architecture were conducted on the large-scale bus network of a European capital city.
e web services component of the system was benchmarked for performance for different numbers of concurrent users requesting the information through web services or pushing information into the system again through web services. en, real-time data visualisation of the data stream was also presented to illustrate the performance achieved through implementation of smart analytics. Finally, results produced by the implemented data mining and machine learning models for real-time prediction were analysed in terms of their value for decisionmaking and accuracy. e experimental evaluation of the system has demonstrated the possibility of developing a decision support system for complex urban transport management. e system is able to handle the heterogeneous data generated by thousands of buses going through tens of thousands of bus stops. Moreover, the production of informative visualisations and accurate predictions in real time allow improving the quality of service along scheduled routes. With the continuous development of sensor, smart analytics technologies, and the growing vehicle infrastructures, the plug-and-play and scalability characteristics of the architecture are essential to permit the integration of new components, while still delivering real-time performance. In particular, the adoption and customisation of additional domain-specific data mining and machine learning techniques will deliver enhanced intelligence to support further decision-making.

Data Availability
Data are available upon request.

Disclosure
Some of the materials contained in this paper were submitted to the School of Computer Science & Mathematics, Kingston University, London, by Nauman Ahmad Khan in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, August 2017 [69].

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.