Privacy-Aware Data Forensics of VRUs Using Machine Learning and Big Data Analytics

)e present spreading out of big data found the realization of AI and machine learning. With the rise of big data and machine learning, the idea of improving accuracy and enhancing the efficacy of AI applications is also gaining prominence. Machine learning solutions provide improved guard safety in hazardous traffic circumstances in the context of traffic applications. )e existing architectures have various challenges, where data privacy is the foremost challenge for vulnerable road users (VRUs). )e key reason for failure in traffic control for pedestrians is flawed in the privacy handling of the users.)e user data are at risk and are prone to several privacy and security gaps. If an invader succeeds to infiltrate the setup, exposed data can be malevolently influenced, contrived, and misrepresented for illegitimate drives. In this study, an architecture is proposed based on machine learning to analyze and process big data efficiently in a secure environment. )e proposed model considers the privacy of users during big data processing. )e proposed architecture is a layered framework with a parallel and distributed module using machine learning on big data to achieve secure big data analytics. )e proposed architecture designs a distinct unit for privacy management using a machine learning classifier. A stream processing unit is also integrated with the architecture to process the information. )e proposed system is apprehended using real-time datasets from various sources and experimentally tested with reliable datasets that disclose the effectiveness of the proposed architecture. )e data ingestion results are also highlighted along with training and validation results.


Introduction
In a recent technological globe, data are mounting rapidly, and humans are mostly relying on data. Besides the pace at which the data rise, it is becoming impracticable to stock up the data into any specific server. Today the planet holds an enormous quantity of data that persists to grow exponentially at very high speed and is insecure [1]. Moreover, the entire globe has gone online with the invention of the web, and every single action we do puts down a digital map out that is prone to vulnerability [2]. With the rise of big data and machine learning, the notion of improving accuracy and enhancing the efficacy of AI projects is also gaining importance and is largely recognized [3]. Some of these factors of the evolution of data are the enhancement of technology, social media, and Internet of ings (IoT). IoT is one of the latest concepts in the current age that is mostly applicable in traffic controlling and monitoring applications. e future of this globe is secure IoT that will be going to alter today's world objects into intelligent and smart objects [4]. Smart systems include IoT devices, such as sensors and actuators, process input connectivity, and people. Sensors and actuators are acting as a backbone for any emerging system. e interactions among all these components create a new type of smart application and service. With the rise of IoT devices, the idea of edge computing is also gaining prominence and is broadly recognized. Machine learning solutions provide improved guard safety in hazardous traffic circumstances in the context of traffic applications [5][6][7].
As several new-fangled and ground-breaking technologies pledge benefits through enhanced optimization of traffic community systems, "Smart" traffic system development chooses the best of these techniques and services to resolve traffic most imperative confronts [8][9][10]. Hence, the smart traffic trend going towards the higher side. ere are many aspects of urban from transportation management to building blueprint and community safety, which are examined as grown for reinvention. Besides, some cuttingedge and imperative technologies such as cloud computing, robotics, artificial intelligence, machine learning, big data, and particularly, machine learning seems progressively more within the reach [3,11]. e overall big data analytics process goes through several stages to serve the purpose [12].
ese stages include identification of the problem, designing the data requirements, preprocessing, data loading, performing processing and analytics, and data visualization. Firstly, all the problems are needed to be identified accurately which are required to be addressed using big data analytics. en, all the data requirements are designed to provide a logical solution to be executed in the later stages. Big data are usually very chaotic, messy, incoherent, incomplete, and inconsistent [13]. erefore, proper preprocessing is required to be done before processing the big data. Consequently, the next phase is the data loading before the data processing and analytics of big data. e smart traffic environment is based on IoT devices and objects generating gigantic data (Big Data) which requires efficient aggregation, processing, and analysis to achieve optimal results for decision-making [14,15]. Efficient exhaustive analysis of such data is not possible through traditional data analytics techniques. On the contrary, some big data analytics methods are also found in the past other than traditional methods; however, there is no allinclusive, common, and effective resolution proposed to aggregate and process the big data produced in an IoT-based smart traffic environment [16][17][18]. e existing solutions are based on traditional or classical Hadoop framework. Moreover, the data ingestion or data loading performance of big data files into Hadoop is overlooked in the existing solutions, which is one of the major factors affecting the overall processing [19,20]. Big data analytic involves smart management of the data to give real-time monitoring of the data population of the VRUs which has drastically expanded everywhere throughout the world. e solutions using IoT big data are proposed for the VRUs' information management along with traffic management. is research prefers the customization of the YARN parallel and distributed framework. However, to comprehend the reliability of the smart traffic, many challenges are required to be addressed where privacy is one of the most brutal between the imperative challenges.

Literature Review
A malevolent hit on the services of users can be extremely costly in the context of the trustworthiness of edge computing [21,22]. Hence, this article presents a secure architecture for data supervision to deal with data security challenges in smart traffic applications. e work related to the proposed architecture about data analytics and machine learning for smart traffic data management is very significant. e key problems decorated in the architecture are the use of traditional MR cluster, inadequate data piling, intangible structure, and only specific dataset [10,23]. A scheme was discussed in detail in the context of V2X connections [24]. e bog data analytics approaches are also taken into consideration including Tiers that are accountable for various steps and activities of the data analytics. ough it is a complete four-tier architecture consisting tiers from data collection to data analysis usage, it causes processing delay [25,26], and a classical map-reduce framework is used that slows down the performance. Moreover, data aggregation before data loading is focused while data loading competence is overlooked. e data aggregation of results is preferred and data loading before analysis is overlooked in this architecture.
An approach is proposed for reducing the conflict between VRUs and automated vehicles [27]. is proposal is only focusing on automated vehicles. It does not support big data processing in general. e key issue is the data ingestion performance in this model. It takes a lot of time to insert the big data into the system for processing. Some researchers proposed a model based on data analysis that promotes the notion of smart traffic and utilizes the big data to be processed but overlooks the data loading efficiency. A framework is presented to overcome the VRUs' issues, but these researchers also overlook the data loading and ingestion into a distributed environment. ere are some models proposed to deal with the same problem of Big Data analysis in the smart environment [14,28], but this solution is the utilization of the conventional cluster resource management scheme and insufficient data loading to the Hadoop server. Moreover, architecture is proposed to investigate the data in a transport environment that is more accessible and efficient [29]. However, it causes an additional delay in processing, and the said scheme is only tested for transportation datasets, and data load efficiency is overlooked while loading Big Data to Hadoop server as well. e additional delay affected the overall performance of the big data analytics.
On the contrary, a scheme is proposed using a parallel processing approach.
ough a YARN-based solution is offered, the data loading efficiency is still overlooked in this architecture. e standard practice of traditional data analytics techniques is to analyze the limited data only, which generates an open area of errors and biases in the Big Data scenario. Another challenge that needs to be addressed is insufficient data loading into the traditional cluster management framework, e.g., Hadoop. e traditional data loading challenges are time-consuming, more storage required, commands are difficult, no append, and no partial ingestion. Similarly, Hadoop processing based on traditional cluster management challenges includes scheduling issues, inefficient load balancing, scalability issues, NameNode availability, and responsibility unification. e objective of this research is to propose a framework based on edge intelligence to process enormous data efficiently and overcome the data loading and processing issue. IoT gathers data and directs the driver to follow the free lanes. A specific proposal is designed to realize the map-reduce paradigm integrated with Apache Spark for real-time data processing comprehension of big data. Spark deals with hasty computation and allows reusability. Effective data ingestion into the distributed storage mechanism is missing in the loading and storage process efficiently.
e current work has deficiencies in the big data storage and processing for IoT-enabled intelligent transportation. Furthermore, model parallelism is also missing for effective extrapolation and decision-making. e proposed research will propose a framework to overcome the existing challenges. Trust and privacy in the smart traffic application, particularly considering the VRUs, is a prejudiced experience that brings complexity in recognizing the attacks. e insecure VRUs in the smart traffic applications could cause a breakdown in the transportation monitoring and controlling services. erefore, to enhance security, we need to evaluate the level of insecurity in an application first.
is study proposed a secure architecture based on machine learning in the smart traffic domain that evaluates the privacy level of the VRUs.

Proposed Architecture
e proposed architecture based on machine learning connects the smart community departments (e.g., traffic monitoring and control department). e data sources are comprised of traffic monitoring and controlling big data. e workflow of the proposed parallel and the distributed scheme is depicted in Figure 1. Data gathering is done by the respective units collected from various traffic control sources (e.g., sensors and cameras). To devise effective parallel and distributed architecture, the data must be scrutinized before computation. e data are generated by different devices such as environmental sensors, security monitoring sensors, traffic cameras, and transportation monitoring sensors. e data are properly collected by the various departments such as the traffic-controlling authorities. is process is known as secure data collection. e data are classified using the machine learning approach. e data are given to the proposed parallel and distributed architecture to process using proposed modules. e number of parallel changes is balanced using the fixed block size of the chunk. e default block size of the utility is time-consuming and has less parallelism. e default size is optimized and modified to improve the data loading efficiency. is data collection is a part of a distributed system. It involves overall data management that includes aggregation, collection, and storage. e data are also preprocessed before injecting into the proposed scheme to remove noise and anomalies for speeding up the processing activities. Afterward, the data are divided into different chunks for parallel processing at the edge level. e distributed storage mechanism is also taken into consideration to assist the parallel processing. e YARN parallel and distributed platform for big data analytics is preferred because the cluster management is dealt with separately by the resource manager that is a part of the YARN. Premediated algorithms are applied while processing the data in the cluster. e processed results are sent for decision-making to the concerned smart society services' providers that are finally forwarded to the users. Following filtration, the Hadoop processing unit is used to process the data which are stored in the distributed storage mechanism. Lastly, the analyzed data are operated for community planning. e data are collected from the departments, and the decisions are sent back to the community development departments. e objective is to realize a smart traffic scheme to perform processing and keep the data private. e said-community departments are the data sources for the proposed system and a mediator between the system and the user. Architecturally, the anticipated solution consists of 3 modules that are data security, organization, and processing, which are shown in Figure 2.

Data Security Layer.
e proposed structural design has a security layer for keeping secure the VRUs' data from attacks. It is a part of smart traffic architecture. It recommends flexibility in opposition to the attacks. e supplier manager (SM), user manager (UM), and superviser are the components' security layer. e SM and UM pay attention to the supplier and the user, while the supervisor applies the algorithm of machine learning. e CNN DL technique is integrated that classifies secure or insecure data. e SM is accountable for the profile maintenance of every supplier, and the UM is accountable for the profile maintenance of users. e supervisor is trained using the classifier. e level of security is predicted using special classes that are highly secure (HS), fine secure (FS), moderately secure (MS), highly insecure (HIS), and partly insecure (PIS). Equation (1)   Fk. (2) Equation (1) is used to calculate the various levels of security. e purpose of the different security levels is to give the particular score to the candidate user for the prospect. e major purpose of the multiclassification is to identify the watch list of the risks in the future. It helps identify the intruders with less score to be analyzed further for future investigation.

Big Data Organization.
e big data organization system involves the overall data management including aggregation, collection, and storage. e data are distributed across various nodes for computation to get a load from the central server or cloud. Intelligent applications are supported by acquiring data via the Internet from various local devices. Several devices that include sensors, cameras, and objectmounted devices record the information of the environment in the different domains.
is data are later utilized for analysis to get insights and produce intelligent decisions. It is the first layer that is accountable for assembling the data from different community departments that are used to manage the smart community development services. A practical community does not only hold large data only but also includes versatile and wide-ranging processing areas. e smart community implementation is dependent on all forms of data processing due to their heterogeneous nature. Data collection is used to transform signals that are assessed in practical circumstances and converts outcomes to the digital form for processing. e collection is done by a special system that converts data from analog to digital form. e smart traffic centers pull out the data using various sensors in the community to gather real-time data. e data organization layer further contains the data aggregation, where the data are grouped based on the identification of the connected devices. is aggregation process is implemented due to the data size because the data are very massive and required to be assembled for efficient processing. e aggregation improves the modularity and processing.

Big Data Processing.
is unit is the main processing part that preprocesses the raw data initially including the irrational data combination, missing values, and values beyond the range which are integrated before processing. If the data are not inspected for such problems, there could be misleading results during decision-making. Hence, the transformation is also done to scale the data to a specific scale. en, the data are taken by a parallel processing unit that is the backbone of the proposed architecture. e proposed architecture is based on a parallel computing model called MapReduce that is utilized. MapReduce is introduced to realize big data analytics. is programming paradigm is composed of Map and Reduce functions. It is a useful model that exploits huge datasets and processes them in parallel. It executes processes in a distributed manner and offers high availability. e underlying system also manages machine failures, performance issues, and efficient communications. Task distribution in the cluster is carried out using the YARN distributed cluster management framework.
e YARN is equipped with dynamic programming for task distribution and cluster management.
e previous platforms such as MapReduce paradigm are only responsible for the processing. e YARN is preferred because the cluster management is dealt with separately by the resource manager that is a part of the YARN. e fair algorithm is integrated with YARN to perform scheduling. Besides, interleaving is possible between map and reduce phases; therefore, the reduced phase might begin before the map phase finishes.

Results and Discussion
e proposed scheme is implemented using the parallel and distributed platform of Hadoop version 3.0. e Hadoop is equipped with Apache Spark module. e reliable datasets are utilized. e pyspark library is utilized in Python 3.8. Similarly, the resilient agent evaluation is carried out using a detailed setting with a machine learning classification module. e machine learning library is also utilized and implemented in Python 3.8. e comparative analysis of the proposed design is provided with current proposals. e experimental results and comparison disclose the effectiveness of the proposed design. e discussion about the results is provided in this section. Results are produced using various reliable datasets to assess the proposed architecture based on parallel and distributed paradigms using premeditated algorithms. We performed a noise and anomalies removal process on data on top of our proposed architecture. e anomalies are removed using the min-max normalizations and Kilman algorithm. e data ingestion is achieved using the map-only algorithm. e traditional YARN cluster management framework is customized with improved capacity and a fair algorithm of scheduling. We applied the dynamic algorithm to set the parameters of the YARN framework dynamically. e processing is performed using MapReduce algorithms. We also optimize the MapReduce algorithm for edge computing to utilize at every edge. us, notable efficiency is achieved in the processing time. e proposed architecture implemented using the Hadoop parallel and distributed framework along with optimized algorithms. ese datasets are preferred due to the utilization of this dataset in the literature. We deliberately executed almost the same queries to compare the processing time and throughput of proposed edge-enabled IoT architecture using customized MapReduce and YARN for parallel processing.

Data Security Results.
e results and experiments of the security layer include the required training of the dataset using an ML classifier. e model is trained using secure and insecure interaction with the proposed architecture. e assessment of the security layer is performed in a specific setting. Initially, the model was trained on 365 * 925 matrices. e training process of the Naïve Bayes classifier is shown in Figure 3.
Security and Communication Networks e proposed resilient agent evaluation is also performed by setting a specific environment where the proposed model is trained using the proposed model. To assess the proficiency of the proposed model, the confusion matrix is exploited, as depicted in Table 1. To measure the effectiveness of the classifier, the confusion matrix is utilized concerning two classes (e.g., secure and insecure), as shown in Table 2. e value is considered secure if it is greater than 0.5; otherwise, it is considered insecure. e performance measures are applied to the ML technique utilized for a resilient agent. e accuracy of the technique is expressed in the form of percentages in Table 2. e specific value of percentage of each confusion matrix value is also highlighted in Table 2. Figure 4 is the confirmation of the enhanced accuracy of the validation and training.

Training and Validation Results.
e enhanced level of accuracy in training and validation is a result of the enlarged number of epochs (e.g., 200 epochs). Likewise, Figure 5 reveals the proposed model's validation and training loss that is the indication of minimal loss. e reduction in the loss is a result of the enlarged number of epochs (e.g., 200 epochs).     using the specific utility. It has been experimentally proved that it gets nearly no time to load the dataset into Hadoop when the dataset size is small. Overall, the proposed system efficiency including all the parameters' modification of data loading is shown in Figure 6. In the same way, Figure 7 demonstrates the threshold for all the parameters' modification of data loading using the proposed system. e threshold is the alarming set value that highlights the focal point where the difference between existing and proposed schemes starts. e proposed scheme is manual in the context of data ingestion and automated in the context of classification and processing.

Conclusion
A smart traffic application is considered by the extensive expansion of IoT-connected devices particularly with the rise of Big Data and machine learning. Machine learning solutions provide efficient results in the context of efficiency and accuracy of the machine learning models. However, it becomes challenging to tackle the privacy of the users in the smart traffic management and surveillance of the users because that produces an enormous amount of big data to be processed and analyzed efficiently. In this study, an architecture is proposed based on machine learning to process big data efficiently in a secure environment considering user privacy. e proposed architecture is a layered framework with a parallel and distributed module using machine learning on big data to achieve secure big data analytics. A specific privacy layer is proposed that classifies the dishonest entities using machine learning.
e proposed system is apprehended using real-time datasets from various sources and experimentally tested with reliable datasets that disclose the effectiveness of the proposed architecture. e data ingestion results are also highlighted along with training and validation results. is study proposes an architecture based on machine learning to process big data efficiently in a secure environment considering user privacy. e proposed design is the optimization of the existing parallel and distributed framework to achieve efficient processing. e current proposals lack efficient parallel data ingestion and efficient mechanism for communication overhead. erefore, the security challenges using machine learning are explored in this paper. is paper proposes a separate secure and resilient module to overcome the privacy issue of the users. e proposed architecture is equipped with a resilient agent using an ML classifier. A stream processing unit is also integrated with the architecture to process the information produced by edge devices.

Data Availability
e data used to support the findings of the study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.