Complex Power System Status Monitoring and Evaluation Using Big Data Platform and Machine Learning Algorithms: A Review and a Case Study

E ﬃ cient and valuable strategies provided by large amount of available data are urgently needed for a sustainable electricity system that includes smart grid technologies and very complex power system situations. Big Data technologies including Big Data management and utilization based on increasingly collected data from every component of the power grid are crucial for the successful deployment and monitoring of the system. This paper reviews the key technologies of Big Data management and intelligent machine learning methods for complex power systems. Based on a comprehensive study of power system and Big Data, several challenges are summarized to unlock the potential of Big Data technology in the application of smart grid. This paper proposed a modi ﬁ ed and optimized structure of the Big Data processing platform according to the power data sources and di ﬀ erent structures. Numerous open-sourced Big Data analytical tools and software are integrated as modules of the analytic engine, and self-developed advanced algorithms are also designed. The proposed framework comprises a data interface, a Big Data management, analytic engine as well as the applications, and display module. To fully investigate the proposed structure, three major applications are introduced: development of power grid topology and parallel computing using CIM ﬁ les, high-e ﬃ ciency load-shedding calculation, and power system transmission line tripping analysis using 3D visualization. The real-system cases demonstrate the e ﬀ ectiveness and great potential of the Big Data platform; therefore, data resources can achieve their full potential value for strategies and decision-making for smart grid. The proposed platform can provide a technical solution to the multidisciplinary cooperation of Big Data technology and smart grid monitoring.


Introduction
Along with the fast installation of computers and communication smart devices, the power industry is also experiencing tremendous changes both in the scale of power grid and in the system complexity.To build up a modern combined energy system of various types of energies including gas, cold, and heat, based on the smart power system, has become a trend of development in the energy industry.As discussed in many literatures [1][2][3], a modern energy system has several major features: (1) high penetration of new energy resources are supported and utilized effectively; (2) it provides complementation and integration of different types of energies such as electricity, gas, cold, and heat; and (3) an interconnected and relatively open system, distributed resources, and a consumption side are extensively involved.A huge amount of measurement data including production, operation, control, trading, and consumption are continuously collected, communicated, and processed in an amazing speed faster than any period of history [4].
Appropriate and efficient data management and analysis systems are urgently needed to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats; furthermore, useful information needs to be extracted and shared to meet the fast-growing demands of high-accuracy and real-time performance of modern power and energy systems [5].Hidden values in power system big data cannot be effectively revealed by means of traditional power system analysis; therefore, Big Data technology and analytics are also in desperate need.
The Chinese power industry has considerable interests in Big Data analytics associated with power generation and management in order to effectively cope with severe challenges such as limited resources and environmental pollutions, among many others [6].Actually, Big Data technology has already been successfully applied as a powerful data-driven tool for solving numerous new challenges in power grid, such as price forecasting [7,8], load forecasting [9], transient stability assessment [10], outlier detection [11], and fault detection and analysis [12], among others [13,14].Detailed discussions about Big Data issues and application are reviewed in [15], as well as the insights of Big Data-driven smart energy management in [16].Major tasks of the architecture for these applications are similar, which focus on two major issues: big power data modeling and big power data analysis.
1.1.Power Grid and Big Data.Supervisory control and data acquisition (SCADA) devices are mainly used in traditional power industries to collect data and to secure grid operations, providing redundant measurements including active and reactive power flows and injections and bus voltage magnitudes [17].However, the sampling rate of SCADA is slow, and unlike traditional SCADA systems, the phasor measurement unit (PMU) is able to measure the voltage phasor of the installed bus and the current phasors of all the lines connected with that bus.In particular, PMUs are collecting data at a sampling rate of 100 samples per second or higher; therefore, a huge amount of data needs to be collected and managed.To be specific, the Pacific Gas and Electric Company in the USA collects over 3 TB power data from 9 million smart meters across the state grid [18].The State Grid Corporation of China owns over 2.4 hundred million smart meters, making the total amount of collected data reach 200 TB for a year, while the total number of data in information centers can achieve up to 15 PB.Big Data is also often recognized as challenging in data volume, variety, velocity, and value in many applications [19,20], and the "4V"characteristics are reflected in the following aspects considering applications in the power system, which is illustrated in Figure 1.
It is possible to get insights from the power system overall Big Data to improve the power efficiency, potentially influence factors of the power system status, understand power consumption patterns, predict the equipment usage condition, and develop competitive marketing strategies.The 4V characteristic can support the whole process of the power system, which is illustrated in Figure 2.
1.2.Challenges.From the above-mentioned research status of Big Data technology and its application in many aspects of the power system, it is easily concluded that Big Data management and analytics are certain development trends of future smart grids.However, there are still challenges that exist in this research area, and strategies and technologies for unlocking the potential of Big Data are still at the early stage of development.First of all, most existing power system utilities are not prepared to handle the growing volume of data, both for data storage and data analytics.On the one hand, traditional machine learning or statistical computing methods are designed for single machines, and an efficient extension of these methods which can be utilized for parallel computing or for large-scale data is urgently needed.On the other hand, most of the analytic methods used in the power system are not suitable to handle Big Data; thus, the gap between Big Data analytics and power system applications still exist, and high-performance computing methods are required.Then, a big hurdle is the lack of an intelligent platform integrating advanced methods for Big Data processing, knowledge extraction and presentation, and support in decision-making.It is believed that the success combination of Big Data technologies and power system analysis will bring Power system operating data (SCADA, EMS, load, etc.) Management data (government, internet, GPS, etc.) Device data (electric equipment, sensors, information sets, etc.) Structured data, semi-structured data, and unstructured data.  2 Complexity a number of benefits to the utility grid in the abovementioned aspects.According to these challenges, this paper will present a novel Big Data platform for complex power system status monitoring and evaluation using machine learning algorithms.

Big Data Technologies for Complex Power System Monitoring
With the increasing varieties of data recording devices, much more unstructured power Big Data are being recorded continuously.Some particular data need to be collected or analyzed under different scales or projected to another dimension to describe the data.Therefore, some conflicts between data structure or semanteme need to be solved when projecting or transforming heterogeneous data into a unified form; the uncertainty and dynamics should also be taken into consideration for data fusion.Based on these concerns, the Big Data platform is designed to consist a generalized management model according to the complex logical relations between data objects, representing the data by normalization and extraction of the principal information.Challenges exist in how to design a flexible data management system architecture that accommodates multimode power data.This section introduces the state of the art of Big Data management technologies and data stream and value management.
2.1.State of the Art of Big Data Management Technology.In terms of distributed structure for Big Data management, the most popular designs are Hadoop [21] and Spark [22].
Hadoop was established in 2005, by Apache Software Foundation, with the key technologies of Map/Reduce [23], Google File System (GFS) [24] developed by Google Lab, and unrelational and high-volume data structure Bigtable [25], 3 Complexity source projects like Hive and Pig have constituted the entire Hadoop ecosystem [26].
Hadoop, based on the distributed structure idea, enjoys many advantages such as high extensibility and high fault tolerance, and it is able to process heterogeneous massive data at high efficiency and low cost.In the Haddop ecosystem, files stored in HDFS (Hadoop Distributed File System) uses the subordinate structure, which are divided into several blocks; each of them has one or more duplicates distributed on different datanodes, thus the redundancy can prevent data from any loss caused by hardware failures.MapReduce is a programming model and an associated implementation for processing and generating large datasets.The computation can be specified by a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines; the flowchart is given in Figure 3.With the high concurrent processing way, several computing processes are organized simultaneously, thus the data handling capacity can be increased from terabyte level to petabyte level.
It can be seen that Hadoop technology is able to provide a reliable storage and processing approach; however, there are still limitations due to the Map and Reduce process.For a complex computation process, MapReduce needs a massive amount of Jobs to finish, and the relationships between these Jobs are managed by developers.Moreover, MapReduce is less supportive for interactive data and realtime data processing.
Similar to the computing frame of Hadoop MapReduce, another open-source tool Spark, developed by University of California Berkeley AMP lab, has the same advantages of MapReduce.Further, Spark can keep the intermediate results in RAM rather than write them in HDFS; thus, Spark can be better suitable for recursive algorithms such as data mining and machine learning applications.As a result, Spark is usually applied as a complement to Hadoop.
The key technology to Spark is the Resilient Distributed Dataset (RDD) [27], which is an abstraction to resolving the issue of slower MapReduce frameworks by sharing the data in memory rather than in disks, saving a large amount of I/O operations performed to query the data from disks.Therefore, RDD can greatly improve the recursive operation of machine learning algorithms and the interactive data mining methods.
Recently, a number of Big Data management systems have been developed to handle Big Data issues.For example, four representatives, MongoDB [28], Hive [29], AsterixDB [30], and a commercial parallel shared-nothing relational database system, have been evaluated in [31], with the purpose of studying and comparing Big Data systems using a self-developed microbenchmark and exploring the tradeoffs between the performance of a system for different operations versus the richness of the set of features it provides.In terms of Big Data platform and tools that are suitable for power system and smart grid utilities, main contributions are made by leading IT companies like IBM [32], HP [33], and Oracle [34].A number of IBM cases are done in order to improve the energy efficiency.For example, Vestas increases wind turbine energy production using a Big Data solution to more accurately predict weather patterns and pinpoint wind turbine placement [35].CenterPoint Energy applies analytics to millions of streaming messages from intelligent grid devices enabling it to improve electric power reliability [36].In the meantime, some newly established small technology companies, like C3 IoT [37], Opower [38] which has been acquired by Oracle in June 2016, Solargis [39], and AutoGrid [40], are doing Big Data analytics research and development according to the electricity market demand.
The large Internet companies in China, namely, Baidu [41], Aliyun [42], and Tencent [43], are all developing Big Data platform, tools, and applications according to their own business.For example, Baidu has been first in the world to open its Big Data engine to the public, which consists of key technologies of Big OpenCloud, Data Factory, Baidu Brain, and others.In this way, Baidu has won the prior opportunities to cooperate with the government, organizations, manufacturing companies, medical services, finance, retail, and education fields.Other companies like Inspur [44], Huawei [45], and Lenovo [46] also provide hardware from computer servers and storages to the Big Data analytic software, which have laid a good foundation for the development of the Big Data platform.[47] that streaming data is considered to be an ordered sequence which can only be read one or a few times.Therefore, data stream management technology is the key issue to handling Big Data storage and processing.
Figure 4 shows the comparison between the traditional data processing model and the data stream processing model.For the traditional database, data storage is statical and not queried or updated often.Users send data manipulation language (DML) statements as queries, and the system will return the results after searching in the database.Therefore, there are inevitable I/O exchanges generated which will slow down the searching efficiency.For real-time processing of large amounts of streaming data, the traditional approach cannot meet the requirement.On the contrary, only synopsis data structure is stored instead of storing the entire dataset, and the data volume is much less and simpler to query compared to the traditional model.
The early research and design of the Big Data stream management system was only for single task applications.In order to handle streaming data with multiple tasks, the continuous query language was first proposed by Terry in Tapestry [48] in 1992, mainly used for filtering E-mails and the bulletin board system.Then it was followed by Mark Sullivan of Bell Labs in 1996, who designed a real-time monitoring tool named Tribeca [49] for the application of network surveillance.Tribeca was able to provided a limited number of continuous query languages and query operations.NiagaraCQ [50] was cooperatively developed by the Oregon Graduate Institute and University of Wisconsin, which support continuous query language and monitoring of durable and stable datasets in the entire wide-area network.In addition, Viglas and Naughton from the same project proposed a rate-based optimization on the issues of data streaming query speed [51].In order to meet the requirements of data stream applications, a general data stream management is needed, and the official concept of a data stream management system was proposed in [52].
Nowadays, the most popular general data stream management system can be summarized as follows: Aurora [53], which was developed by the Massachusetts Institute of Technology, University of Brown, and Brandeis University, has a simple but special frame and can be used especially for data streaming surveillance based on a key technology of trigger networks.Aurora has a good balance on accuracy, response time, resource utilization, and practicability, but with a drawback of a simple query approach using the load shedding technique.TelegraphCQ [54], developed by the University of California Berkeley, is mainly used for sensor networks, which comprise a front end, a sharing storage, and a back end.The data stream in a constantly changing and unpredictable environment can be adaptively referred in any query.However, the approximate query mechanism will be neglected when the resource is insufficient.STREAM [55], developed by Stanford University, is the prototype system based on relational database.Under the circumstances of limited resources, STREAM can extend the searching language and execute the queries with high efficiency; thus, STREAM has a better performance on the continuous query.Other very famous data stream management systems are also released to cope with data stream challenges, such as Storm by Twitter [56], Data Freeway by Facebook [57], Samza by LinkedIn [58], TimeStream by Microsoft [59], and Gigascope by AT&T [60].
Data value in power systems can provide guidance towards data acquisition, data processing, and data application.Data valuation can be determined by several factors, including data correlation, data fidelity, and data freshness [61].To be specific, data correlation can be considered from two aspects: one is how it is related with power dispatch, fault evaluation, and risk assessment; the other one is the correlation within the data itself, where the data value will be higher when the correlation is higher.Data fidelity refers to the conformance of the collected data to the real data situation.Defects of collected data always exist due to the sampling rate, noise, and data acquisition equipment from different devices across the entire grid; thus, the real data situation may not be revealed.At last, data freshness is also an important factor which determines the data value, especially in power systems where most data is streaming data, which is recorded without interrupt.

Analytical Tools and Methods for Power
System Big Data  5 Complexity in power systems as algorithms can be trained using historical data collected over time, providing useful information for system operators.As historical data is collecting at an increasing speed with large volume, effective machine learning approaches are urgently needed in discovering valuable information and providing to power system operators.Big Data is stored in a distributed way on multiple computers; thus, it is not appropriate for all machine learning methods to process.Moreover, if data analytics needs to be finished on a single computer, it may be too large to fit into the main memory.Most traditional libraries/tools, such as R [62], Weka [63], and Octave [64], implemented machine learning algorithms in a single-threaded fashion by design and are not able to analyze large volumes of distributed data.More recently, advanced modern Big Data processing platforms are designed and implemented with parallel machine learning algorithms in order to achieve high efficiency.First of all, this section gives a comprehensive literature survey of state-of-the-art machine learning libraries and tools for Big Data analytics in Table 1.
From Table 1, it can be seen that along with the rapid development of the computer technology, a hot favorite of 6 Complexity developing machine learning library/tools started in the early 1990s.In almost a decade, the research trend moved forward to distributed and large volumes of data from the traditional single machine algorithm design.Octave is the earliest developed machine learning package, performing numerical experiments using a language that is mostly compatible with Matlab.Similarly, Weka was also developed by universities, which makes this free software suitable for academic use by integrating general purpose machine learning packages.In particular, R has been widely used in both academia and industry due to the comprehensive statistical computing and graphics software environment.As mentioned above, Octave, Weka, and R are designed for single-threaded computing and thus are not able to handle large volumes of power system data.
In recent years, the R community has developed many packages for Big Data processing.For example, the biglm package [76] is able to perform linear regression for large data, and the bigrf [77] package provides a Random Forest algorithm in which trees can be grown concurrently on a single machine, and multiple forests can be built in parallel on multiple machines then merged into one.Another group of R packages, such as hive [78], focus on providing interfaces between R and Hadoop, so that developers can access HDFS and run R scripts in the MapReduce paradigm.
Among the oldest, most venerable of machine learning libraries, Shogun was created in 1999 and written in C++, but is not limited to working in C++.In terms of supported language, Shogun can be used transparently in such languages and environments: as Java, Python, C#, Ruby, R, Lua, Octave, and Matlab, thanks to the SWIG library [79].Another machine learning project designed for Hadoop, Oryx comes courtesy of the creators of the Cloudera Hadoop distribution.Oryx is designed to allow machine learning models to be deployed on real-time streamed data, enabling projects like real-time spam filters or recommendation engines.

Machine Learning Algorithms.
Besides the powerful open-source algorithms or tools mentioned above, machine learning and statistical processing methods are also applied to support handling various issues of the power data.Basic machine learning algorithms are embedded in different open-source libraries/tools.Table 2 gives a comprehensive study and comparison.
There are many benefits for the modern power system since machine learning algorithms have been applied in many aspects of power systems successfully.Firstly, system stability and reliability have been remarkably increased.Many literatures have reported impressive experimental results of various machine learning algorithms with applications in oscillation detection, voltage stability, fault or transient detection and restoration, islanding detection and restoration, postevent analysis, etc. [80][81][82][83].With the emergence of the Big Data analytics and smart grid technology, the above-mentioned monitoring and detection methods have been greatly improved, and an increasing number of novel approaches are being studied.For instance, real-time identification of dynamic events using PMUs is proposed in [84]; based on data-driven and physics models, security of power system protection and anomaly detection are greatly improved, thanks to the rich synchrophasor data.
Secondly, power equipment utilization and efficiency are greatly increased.In the power industry, the issues of waste of equipment resources are difficult to handle, and data resource is independent, thus it is impossible to evaluate the exact status of each asset.Big Data analytics can provide better validation and calibration of the models, eliminate the independence of data resources, and help operators understand the operating characteristics and life cycles of the equipments.For example, a data-driven approach for determining the maintenance priority of circuit breakers is introduced in [85]; the proposed method can consider both equipment-level condition monitoring parameters and system-level reliability impacting indices; thus, the maintenance priority list can be generated.
Thirdly, Big Data visualization can help operators improve situation awareness and assist decision-making.Machine learning and data analytics only produce numerical results or two-dimensional charts and diagrams, which need operators with professional skills or experience to give accurate and timely decision.A Big Data platform with 3D visualization in [86] manages massive power Big Data with multimode heterogeneous characters, showing the tripping lines and affected areas based on a 3D environment.Thus, the operators can make quicker and more reliable decisions and take possible preventive actions under the circumstance of thunder and lightning weather.

Statistical Processing Control
Methods.Statistical processing control methods originally are applied in industrial quality control, employing statistical methods to monitor and control a process based on historical and online data.In our early work, some data-driven methods based on linear principal component analysis (PCA) [87] were applied in power system data analysis [88], setting up a distributed adaptive learning framework for wide-area monitoring, capable of integrating machine learning and intelligent algorithms in [89].In order to handle power system dynamic data and nonlinear variables, dynamic PCA [90] and recursive PCA [91] were also developed to improve the model accuracy.It is worth mentioning that linear PCA is unable to handle all process variables due to the normal Gaussian distribution assumption imposed on them, and many extensions using neural networks have been developed [92,93].To address the challenges of handling the redundant input variables, obtaining higher model accuracy, and utilizing non-Gaussian distributed variables, an improved radial basis function neural network model-based intelligent method is also proposed in the early work [94].The neural input selection is based on a fast recursive algorithm (FRA) [95,96], which was proposed for the identification of nonlinear dynamic systems using linear-in-the-parameter models.It is possible to utilize optimization methods in order to get more accurate models by tuning algorithm-specific parameters, such as particle swarm pptimization (PSO), genetic algorithm (GA), differential evolution (DE), artificial bee colony (ABC), and ant 7 Complexity colony optimization (ACO), among other heuristic methods.The proper tuning of the algorithm-specific parameters is a very crucial factor which affects the performance of the above-mentioned algorithms.The improper tuning of algorithm-specific parameters either increases the computational effort or yields the local optimal solution.In our early work [97][98][99], teaching-learning-based optimization (TLBO) has been utilized for training an RBF neural network battery model.The TLBO method does not have any algorithmspecific parameters and significantly reduces the load of tuning work.
These methods mentioned above can be programmed and integrated as part of the analysing engine to support the processing of the power Big Data.Therefore, the data processing engine can support overall system operation and control by building a dynamic, global, and abstract power data model, based on which consequences are inferred and decisions are made.A detailed method comparison can be found in Table 3.
The fundamental assumption for many standard datadriven methods such as PCA, PLS, and LDA is that the measurement signals follow multivariate Gaussian distributions.As introduced in Table 3, PCA and PLS have similar principals to extract latent variables, but they perform in different ways.PCA tries to extract the biggest variance from the covariance matrix of the process variables, while PLS attempts to find factors or latent variables (LVs) to describe the relationship of output and input variables.PCA and LDA are also closely related in finding linear combinations of variables to explain data.However, LDA deals with the discrimination between classes, while PCA deals with the entire data samples without considering the class structure of the data.Similar to PLS, SIMs require both the input process data and the output data to form input-output relations.A brief comparison among the above-discussed basic data-driven methods is given in Table 4.
The issues of Gaussian distribution assumption on data, requirement of input-output relationships, the number of principal components or latent variables, and the computational complexity for these methods are compared in this table.In addition, LDA is comparable with PCA and the datasets should be well documented in order to  PCA summarizes the variation in a correlated multiattribute data to a set of uncorrelated components, a linear combination of the original variables.

Partial least squares (PLS)
PLS can find the fundamental relations between two data matrices, and latent variables are needed to model the covariance structure in these spaces.

Linear discriminant analysis (LDA)
LDA finds a linear combination of features that characterizes or separates two or more classes of objects or events.

Subspace identification methods (SIM)
SIMs are powerful tools for identifying the state space process model directly from data.

Time-varying
Recursive PCA RPCA is a generalization of PCA to time series; the eigenvector and eigenvalue matrices are updated with every new data sample.
Dynamic PCA DPCA includes dynamic behavior in the PCA model by applying a time lag shift method while retaining the simplicity of model construction.

Nonlinear
Kernel PCA/PLS KPCA is first to map the input space into a feature space via nonlinear mapping and then to compute the PCs in that feature space.

Neural networks
Neural networks are computational models that can be used to estimate or approximate unknown nonlinear functions.

Independent component analysis (ICA)
ICA decomposes multivariate signals into additive subcomponents which are independent non-Gaussian signals.

Gaussian mixture models (GMM)
GMM describe an industrial process by local linear models using finite GMM and Bayesian inference strategy.

Support vector data description (SVDD)
SVDD defines a boundary around normal samples with a small number of support vectors.
Classification, process monitoring [127], oscillation modes detection [128], etc.For time-varying process methods, recursive and adaptive methods are able to track slow-varying processes with a stable model structure.However, the model updating may be carried out randomly if no appropriate updating scheme is available.Meanwhile, dynamic process monitoring methods are easy to implement in practice, but the number of dynamic steps significantly affects the monitoring results and the window size is difficult to be determined.
Compared to linear monitoring methods, nonlinear approaches can be used in much wider applications due to the flexibility of nonlinear functions, which can model nonlinear relationships between variables.Especially for the kernel methods, various nonlinearities can be modelled by introducing different kernel functions.Similarly, neural networks are also capable of modeling any kind of nonlinearity theoretically.However, there are still some drawbacks; for example, the structure of the neural networks is difficult to determine and the training of the network parameters is also computationally demanding.A similar issue exists to kernelbased methods and an appropriate kernel parameter tuning method is needed, and the selection of a kernel function is not a trivial issue.A new approach to tackle the issues of representing nonlinear behavior as well as the non-Gaussian distributed variables is urgently needed.
For non-Gaussian distributed data, the basic methods cannot perform well due to the Gaussian distribution assumption.Alternatively, ICA, GMM, and SVDD are three most widely used and promising methods for non-Gaussian process monitoring.Although these methods were developed separately, they are actually highly related to each other.Sometimes, these methods can even be combined, and they are also capable of handling more than only one data characteristic.For example, ICA is used to describe the measurement signals as a linear combination of non-Gaussian variables, while GMM has a similar assumption that the process dataset can be described by several local linear models.Moreover, the calculation of control limits for ICA-based non-Gaussian process monitoring involves kernel density estimation, which is commonly used for SVDD.Detailed comparative advantages and disadvantages of these methods are listed in Table 5.

A Real-System Case
In this paper, a Big Data platform integrated with data management and analytical engine is proposed as a real-system case study.This platform was designed to meet the special condition of power grid in South China, such as large-scale, complex geographical and weather conditions and AC/DC mixed operation over long distances.Big Data technologies are applied to this power network to assist with condition monitoring and state estimation of the transmission and distribution systems, collecting multiplatform power data and realizing high-efficiency processes and analysis of data from the power grid at different levels.

The Framework of Electric
Power Big Data Platform.The framework of the electric power Big Data platform consists of database, data interface, Big Data Management system, analytic engine with various machine learning tools and algorithms, and application and 3D visualization modules; a detailed structure is given in Figure 5.The first challenge is to set up an efficient database for the large volume of multisource heterogeneous power data which are collected through different sources.A traditional power system database is designed to store structured data files using tables; thus, the size of storage is limited and the data operation efficiency is low.For Big Data platforms, various data are collected, for example, operational data collected from the production management system and energy management system, real-time data recorded from an online monitoring system and equipment monitoring system, and other forms of heterogeneous data of weather files, geography information, images, and video data.In terms of data status, historical data, real-time data, and data streaming are all needed for Big Data processing and analysis.This platform integrates several data storages according to each data structure, so that the platform can provide useful and timely information to assist decision-making by processing large amounts and different data structures with high efficiency.All the information and knowledge can be integrated to provide strategies for system operation and evaluation, system inspection, and status estimation for power equipments and the entire power grid.
In order to efficiently manage and store the multisource Big Data, this paper proposes a special data acquisition structure.For various databases, SQOOP is a tool designed for efficiently transferring bulk data between HDFS and structured datastores such as relational databases (MySQL, Oracle).For messages between databases and the platform, MQTT (message queuing telemetry transport) is chosen as part of the data interface.MQTT is well known as an "Internet of Things" connectivity protocol, and it was designed as an extremely lightweight published/subscribed Based on this data interface, power system data collected by smart devices can be managed in real time.Data preprocessing, including data verification, outlier removal, transformation, and evaluation process, can be realized to provide a solid and practical database for the analysis procedure.Moreover, other relative unstructured data such as weather condition, lighting and storms, geography information, and human activities (local population, age distribution, professionals, behavior and active pattern, internet sentiment, and so on) can be connected to a certain extent with the power load, power generation, consumptions, electricity market, and so on.These data sources mentioned above are impossible to be processed and analyzed simultaneously through the traditional way; only this novel approach using Big Data to deal with the challenges can establish a more comprehensive knowledge model of the city power grid.

High-Performance Analytical Engine.
To effectively manage the Big Data is only the first step; the key issue is to set up an analytic engine with high efficiency.Based on the functional modules and the need for power system applications, this particular analytic engine can provide with several practical functions, such as operation risk evaluation, status estimation, and decision-making support.The detailed structure of the Big Data computational engine is given in Figure 6.
This analytical engine integrates a number of opensource basic algorithm packs and self-developed algorithms.The open-source algorithm packs mentioned in Section 3 have been developed and tested by researchers and companies for many years.For example, Apache Spark, a fast and general engine for large-scale data processing, can be used interactively with Scala, Python, and R shells.Many powerful computing libraries are integrated in Apache Spark, such as numerical computing tool NumPy, science computing tool SciPy, data analysing library Pandas, scalable machine learning library MLlib [70], API for graphs and graph-parallel computation GraphX [129], and so on.In addition, this platform has combined an interactive developing and operating environment IPython and Jupyter [130].Effective power grid decision-making depends critically on anlytic methods in the platform.Therefore, effective methods for the real-time exploitation of large volumes of power data are needed urgently.Robust data analytics, high-performance computation, efficient data network management, and cloud computing techniques are critical towards the optimized operation of power systems.
For self-developed algorithms, spatial-temporal correlation analysis is able to mine both the strong and weak connections among the numerous variables in a power grid, by 11 Complexity setting up a power system spatial-temporal model and a datadriven model based on the process history database.Modelling methods are provided, including artificial neural networks, linear and nonlinear analysis methods, Gaussianbased kernel methods, regression and classification methods, and clustering methods.Pattern recognition methods for spatial-temporal correlations are provided, and the spatial proximity weights, time delay, and correlation effect are calculated and quantized [131].This idea is suitable for analysing the consumption behaviors of citizens in different locations and time, as well as the effect on power transmission lines by the power grid surroundings including geographic information, weather variations, human activities, and road vehicles and traffic situations [132].A knowledge base of interconnected factors within the entire city grid can be set up for analysis and predictions.
This proposed distributed computational engine is the key element of the entire Big Data platform; many functional modules can be developed based on these open-source tools.It is believed that this novel approach will gradually change the traditional way of power system analysis and operation, which is also the only efficient way to realize future smart grids with high level of automation and intelligence.4.3.3D Visualization.The geographic information system (GIS) has been widely used in electric power systems [133,134], which is vital for improving the operation efficiency of the electric power system.It can maintain, manage, and analyze power data and integrate power network models, maps, and related data in a solution for desktops, webs, and mobile devices.Most power GIS systems mainly adopt a two-dimensional map as the visualization model.However, 2D GIS has significant limitations in terms of presentation and analysis of geospatial and power data, and it is difficult to display panorama information of power running status.The proposed Big Data platform adopts a web-based visualization method based on Cesium and 3D City Database (3DCityDB) [135] to construct a three-dimensional panorama electric power visualization system, which is given as in Figure 7.
The 3D models of electric tower, line, equipment, and geographical entity (buildings, roads, etc.) will be visualized in Cesium scene and managed by a Cesium manager.In the server side, Java Servlet and JavaServer Pages for powerrelated data processing functions reside in Tomcat which directly communicate with web client and process client requests.The two-dimensional map requests will be submitted to the Geoserver, while three-dimensional map requests will be processed by a 3DCityDB web feature service.3DCi-tyDB is a free open-source package consisting of a database scheme and a set of software tools to import, manage, analyze, visualize, and export virtual 3D city models according to the CityGML standard.In this architecture, 3DCityDB has two important tasks: one is to convert a twodimensional electric map model to a three-dimensional model and save into the PostgreSQL database, and the second is used to provide a three-dimensional web feature service for a power system client based on Cesium.
Based on the model calculation and Big Data analytical engine, the visualization of spatial information and power system applications can be realized in the way of providing services.Thus, the power system equipments and power grid can be merged together with GIS and revealed on the map, as well as the environmental factors.Therefore, many demands 12 Complexity of power grid visualization can be reached, including realtime monitoring, analysis, and decision-making, among others.The development of the 3D visualization system can provide an optimal way of presenting the huge amount of information and improve the situation awareness of system operators as well as the novel explanation of newly appeared information; thus, the accuracy of decision-making for the entire power system can be greatly increased.

Development of Power Grid Topology and Parallel
Computing Using CIM Files.Power element data, connections, and their status are stores as common information model (CIM) files in the power system, which are significant for power system analytics.The first step is to extract the connectivity between each electric point as data to be stored in the relational database.For most of the analytic methods, the above-mentioned CIM file extraction is applied to fit in the relational database.However, a topology analysis needs plenty of correlation analyses between multiple and complex tables; it is hard to meet the demand of real-time and fastspeed processing requirement.The proposed platform in this paper develops a fast-processing scheme for the power grid topology setup; thus, the analysis can be realized with high efficiency.The diagram is given in Figure 8.The proposed platform detects any update of CIM files which were transmitted into the FTP end, load new data into memory, and correlate with other structured data using Spark SQL, generating a preprocessed data table.After that, a fast search according to "physical-electrical-physical" rules in the power grid is applied to set up a topology of the grid.The whole process is realized based on the Spark SQL database and parallel computation; thus, the analysis efficiency is greatly improved, thanks to the fast and parallel correlation analysis.Under this framework, many tasks can be done easily including analysis result extraction, power grid topology setup, power system branch model calculation, and "bus-branch" model analysis and other functions.Therefore, this platform is able to provide a database and analytical engine for power grid large-scale parallel computation, real-time status analysis for smart grids, and other useful applications.

High-Efficiency
Load-Shedding Calculation.The calculation of load-shedding in the power system can quantify how much loss the real system is undergoing after equipment failure in an objective way; thus, it can measure the operation risks and provide significant information for decisionmaking of equipment reconditioning or replacement.The actual reduction of load-shedding for different types on each electrical point is needed for the calculation; thus, it is very  13 Complexity time-consuming to calculate power grid risk evaluation with plenty of predefined fault scenarios.In the proposed platform, a calculation scheme based on Spark and Compute Unified Device Architecture (CUDA) is applied, as shown in Figure 9.
The complete load-shedding scheme contains two stages: offline test stage and online parallel computing stage.The computation tasks are firstly divided into different working regions on Spark, then Matlab algorithms are packed and called, and further processing of each computation task is transmitted to working threads on every division, where parallel computing is realized.After that, results at each step are collected progressively; thus, the risk evaluation tasks for multiple scenarios can be finished.For real-system cases, a total number of 6000 files with 1.2 GB size are calculated according to the flowchart given in Figure 9, and the comparison with calculation time on a single machine is given in Table 6.
It can be easily seen that parallel computing is able to solve the problem of low efficiency when risk evaluation in multiscenarios is taken in the power system.The load level of each electrical point can be monitored dynamically, and the topology change of power grid due to any system maintaining or drop out of multiple power system units can also be calculated with high efficiency, therefore, the computation time is greatly shortened.

Power System Transmission Line
Tripping Analysis Using 3D Visualization.With the support of the Big Data platform, transmission line trip records, power quality data, weather data, and other related data can be collected, in order to monitor and analyze the transients.In addition, a threedimensional visualization system is developed to merge together all the analysis results with geographic, landforms, and even weather conditions, then display in a very intuitive way.Therefore, situation awareness of system operators is greatly enhanced.Two main tasks are introduced in this section: firstly, the correlation between line trips and power transients is analyzed by employing statistical methods, especially the distribution patterns of line tripping and power quality voltage dips against the lasting time.Secondly, the interconnection rules among line tripping, weather condition, voltage dips, and voltage swells and other disturbances are exploited.
In order to analyze the correlation between transmission line trips and voltage dips, multisource data is needed, consisting of (1) transmission line tripping data, recording tripping time, fault description, fault type, and so on, and (2) voltage disturbance data, including monitoring location, disturbance type, happening time, lasting time, and magnitude.The first step to analyze the transients is data fusion, combining two sets of data according to the unified time tags, and the preprocess diagram is given in Figure 10.
For analysis of voltage dips at different voltage levels of 110 kV and 10 kV, the voltage dip recordings are divided into four kinds, including voltage dips caused by line trips at 110 kV and 10 kV, not by line trips at 110 kV and 10 kV.Taking 10 kV voltage level for example, the scatter plot is generated and shown in Figure 11.
In this figure, each symbol represents a transient event, with duration as the x-axis and magnitude as the y-axis.In order to separate the transient events by their causing reasons, the blue dot represents the voltage dip caused by line trips while the red x shows that the occurring voltage dip was not due to line trips, both at the 10 kV voltage level.The x-axis has taken the logarithm for the purpose of  The scattered points only show the distribution of durations against magnitudes of the voltage dips.It is necessary to combine substation coordinates, maps, and other geographic information with these transients; thus, the transmission line status and the affected substation can be shown in terms of voltage dip magnitudes and durations.Therefore, the possible influence of transmission line trippings to the substations can be visualized to system operators.The Big Data platform employs a 3D simulation display system, using data from the management layer as well as the model output directly from the computing engine, including 3D models of power line and electric equipment, 3D building models, geospatial data, and power attribute data.Geospatial data as a 3D virtual environment can show geographic objects (e.g., roads, bridges, and rivers) around the electricity network.The generated 3D virtual environment with power transmission line situation is given in Figure 12.
In Figure 12, the green line represents the normal operational transmission line, while the red lines are with the appearance of the line trips.In order to show the voltage transient status, a cylinder with blue color shows the voltage dip magnitude, and the pink cylinder is the duration, and the name of the affected transmission lines is shown in the floating red tags above the cylinders.Therefore, the affected area can be directly visualized through the 3D virtual environment, and the dynamic change of the power grid operational status is easier to control for the system operators.If any transient happened, actions can be taken in time to prevent any enlargement of the accident.

Discussion and Conclusion
This paper reviewed both the issues of Big Data technologies for power systems and employed a Big Data platform for power system monitoring and evaluation analysis.Based on the review of Big Data management technology and analytical tools and machine learning methods, a case study of the proposed novel Big Data platform for a power system is given with three application cases introduced.The framework of the power system Big Data platform consists of database collecting power data from all different parts across the grid, data interface, Big Data management system integrating different management technologies, analytic engine with various machine learning tools and algorithms, applications, and 3D visualization modules for further optimizing the strategy and decision-making assistance.
Based on the various power data sources, the proposed platform has integrated different data interfaces and distributed data storage according to the data structure; thus, the platform is able to handle traditional structured data, semistructured data, and unstructured data simultaneously.For the analytical engine, both open-source tools and selfdeveloped models are integrated as modules.In our early work, intelligent processing methods have been proved to be able to handle linear, dynamic, nonlinear, and non-Gaussian distributed variables by setting up accurate and efficient models.This has enabled the decision-making subsystem to focus on generating an optimized equipment maintenance strategy and providing a global view for situation awareness and information integration.

Tripping data
Voltage dip data Same time tag?
Add to new data file

Extract relative data
In time interval ?

Complexity
In order to demonstrate the effectiveness of the proposed platform, three real-system cases are introduced including development of power grid topology and parallel computing using CIM files, high-efficiency load-shedding calculation, and power system transmission line tripping analysis using 3D visualization.These cases are all realized based on the proposed Big Data platform; the key issue in case one is to extract the connectivity between each electric point from different databases.It is suitable to process high-volume and multimode heterogeneous data using multiple data storage methods in the proposed Big Data platform with very high efficiency.In case two, with a proposed parallel computing scheme based on Spark and CUDA, load-shedding calculation in power systems under different scenarios can be realized in a very fast-speed way, and a comparison between single-machine and multiple-machine parallel computing is given, which demonstrated the high efficiency of the scheme.A highlight in the third case is the utilization of a Big Data platform with the 3D visualization system.With the help of the Big Data virtual environment, the affected transmission lines and areas can be directly detected, with detailed dynamic information of line tripping time, location, duration, and causes.With the help of the 3D visualization system, digital results become more valuable and situation awareness of system operators is greatly enhanced, which is a reliable way to improve the safety and reliability of the entire power grid.
As mentioned in this survey, the development of future smart grid will towards a huge and complex energy system, which is deeply integrated with traditional power and renewable energies, as well as the powerful information and communication systems.The energy system also represents three levels or perspectives of the entire objective existence: physical energy level, industrial information level, and human society level.Under this big picture, more researchers are focusing on novel dimensions.For newly developed machine learning and data mining tools, deep learning, transfer learning, and multidata fusion methods are receiving extensive attention in recent years.Deep learning integrates supervised and unsupervised learning, with multiple hidden layer artificial neural network structures, and is capable of extracting abstract conceptions from data.While transfer learning makes a break through fundamental assumptions of the statistical learning theory, it can improve learning accuracy by utilizing the correlated data with different distributions.Multidata fusion technique is capable of analysing heterogeneous datasets collecting from different data sources; thus, it can extract more useful information.
By applying the above-mentioned new methods and technologies, more research directions and topics gradually appear.Firstly, the load prediction and modeling problem is the earliest application of data mining and analytics.Along with the fast installations of smart meters, much more precisely load modeling can be achieved by utilizing the equipment data and electrical measurements at both transient and steady states.More machine learning methods are available for load prediction and modeling, including feedforward artificial neural network, SVMs, recurrent artificial neural networks, and regression trees, among others.Secondly, the fusion and merging analysis of the power system and transportation system can be done along with the increasing number of electrical vehicles.Considering the load data from charging stations, traffic flow and transportation network, on-board GPS tracks of electrical vehicles, and other data related to the driving and charing behaviors, a research on the driving and charging behavior characteristics is achievable.Closely related to that, the electricity market prediction and simulation is another possible hot topic, which can also be applied in many aspects such as evaluation of market shares for the individual power company,  16 Complexity investment income for power generation, and decisionmaking for power market mechanism design.
In conclusion, this paper has demonstrated a glance of the crossover and merging of the latest Big Data technology and smart grid technology.There are still many researchworks to do in the future.From all the application aspects, Big Data technology for human behavior in panorama mode has a great and long-term potential in realtime future smart grid and energy system, even the city planning, pollution abatement, transportation planning, and other useful applications.

Figure 2 :
Figure 2: Sketch map of Big Data supporting whole process of the power system.

3. 1 .
Big Data Analytical Open-Source Tools.Data analysis approaches such as machine learning play an important role

Figure 4 :
Figure 4: Comparison between the traditional data processing model and the data stream processing model.

Figure 5 :
Figure 5: Big Data processing and analysing platform for electric power system condition monitoring.

Figure 6 :
Figure 6: Structure of Big Data platform computational engine.

Figure 10 :
Figure 10: Diagram of data files fusion preprocess.

Figure 11 :
Figure 11: Scatter plot of voltage dips and breakdowns against duration under 10 kV voltage level (half logarithmic axis).

Figure 12 :
Figure 12: 3D display of voltage dips and breakdown transmission lines with geographic information.

Table 1 :
Open-source/free software of Big Data machine learning method brief descriptions.
http://AForge.net 2008 Andrew Kirillov, Fabio Caversan It is an open-source C# framework in the fields of Computer Vision and Artificial Intelligence; image processing, neural networks, genetic algorithms, fuzzy logic, machine learning, robotics, etc. [68].

Table 2 :
Comparisons of open-source machine learning tools/algorithms for Big Data.
offer detailed information about the normal operating condition and faulty cases.SIM does not impose any special assumption on the process data since it only investigates the input-output relationship, and different threshold computation methods are available for Gaussian and non-Gaussian distributed data.The number of PCs and LVs are important design parameters in PCA and PLS methods, which can affect modeling performance.The main computation burden comes from performing SVD on the covariance matrix of different dimensions; thus, the standard PCA has lower computational cost over other basic methods.

Table 3 :
An overview of state-of-the-art intelligent processing methods.

Table 4 :
A brief comparison among basic data-driven methods.

Table 5 :
Comparisons of the non-Gaussian data methods.

Table 6 :
The comparisons of parallel computing with single machine results.
time of voltage dips caused by line trips is less than that caused by other reasons, as shown in the left ellipse, with duration around 100 ms.And the voltage dips caused by other reasons last for a longer time, as enclosed in the right ellipse.