A Retrieval Optimized Surveillance Video Storage System for Campus Application Scenarios

This paper investigates and analyzes the characteristics of video data and puts forward a campus surveillance video storage system with the university campus as the specific application environment. Aiming at the challenge that the content-based video retrieval response time is too long, the key-frame index subsystem is designed. The key frame of the video can reflect the main content of the video. Extracted from the video, key frames are associated with the metadata information to establish the storage index. The key-frame index is used in lookup operations while querying. This method can greatly reduce the amount of video data reading and effectively improves the query’s efficiency. From the above, we model the storage system by a stochastic Petri net (SPN) and verify the promotion of query performance by quantitative analysis.


Introduction
With the promotion of the smart city, the smart campus, and other projects, the demands of the video surveillance system deployment have become more fine-grained and multipoint.The development tendency of the video surveillance system is moving towards "digital, networked, high-definition, and intelligent" [1].The number of monitoring devices is increasing, the quality of video is continually improving, and the duration time of the video retention is extending.All these changes increase the amount of data produced by the video surveillance system rapidly.How to effectively organize these mass monitoring video data and how to quickly locate the relevant video in the postverification are the important requirements of surveillance video storage system.
With the development of distributed file systems and cloud storage [2], a large number of surveillance video storage systems are based on the IP-SAN technologies [3].These technologies ensure system scalability, load balancing, high availability, data backup, and recovery, but these common storage technologies only design storage systems focusing on the writing efficiency of storage.Previous designs concern the ability to write multiple video data concurrently as fast as possible, but the optimal design of video content query is lacking.Some effective video organization methods [4] are proposed, but there is little consideration about the storage of large amounts of video data.However, the campus environment surveillance video data has its unique application characteristics.The surveillance video data is a write-based data.There is no spike or trough in the generation of data.The writing of the monitored video data is sequential.Video data is streaming media data with large file size; compared to the random writing of small size file, the speed of video data writing is fast, because there is not too much OPEN/CLOSE operation.As a kind of evidential data, there is little need for modification after writing.The content of data changes regularly with the school schedule.Query operation is relatively little, but the workload is quite large when it happens.It needs to read and match the mass data which have been stored.Sometimes it needs to intervene in manual operation and the time consumption is huge.
In view of the above summary of campus surveillance video data read and write characteristics, as well as organizational storage problems, we propose a surveillance video storage system CSVS (campus surveillance video storage) for campus applications.
The main contributions of this paper can be summarized as follows: (1) We use a mature distributed file system to address the video data writing problem, the space scalable problems, data backup, and recovery issues.
(2) We put forward a video key-frame extracted function according to the school schedule time on the impact of video data.Design an index subsystem that combines video metadata with video key frames.This index system can greatly reduce the amount of data read when retrieving.And it can improve the retrieval efficiency and reduce the workload of manual intervention.
(3) We implement the prototype of the CSVS system.The system is modeled by stochastic Petri net that can help us to analyze the efficiency of the real environment in long-time running.According to performance analysis and evaluation, it proves that the number of queries fulfilled by key-frame index is 5 times that of the manual search in the same period.
The rest of this paper is organized as follows.Section 2 summarizes related works of video data storage system.We present a campus surveillance video storage system in Section 3. In particular, we illustrate the architecture of this system and explain how we solve the problem regarding scalability and security, and the organization of data and key frame extraction method are presented in this part.In Section 4, we introduce the experimental environment and conduct the performance evaluation.We conclude this paper in Section 5.

Related Work
Because of the development of security system, the surveillance video storage technology has been developed more maturely.According to the characteristics of surveillance video data and SAN storage technology, researchers have proposed a video surveillance storage system based on IP-SAN [5].Each video frame is stored in a fixed-size data area.And the memory caching technology for video metadata improves search efficiency.But this cache technology is only suitable for data sets with relatively small amount of data.In addition, the memory is a volatile storage device, so it is difficult to guarantee the data recovery after the failure.In article [6], a video data cache VDB (video data buffer) is designed based on the image group GOP, but this design is optimized only for writing.When the cache data area is full, the data will be written to disk and no longer live in the cache, so it is not helpful to the video content retrieval.To speed up writing, a cache write storage system with IO polling mechanism is proposed [7].Each video stream corresponds to a thread written to disk.When the buffer is full, the thread triggers the write mechanism and flushes all the cache data into the disk.It converts random writing to sequential writing, thereby improving data writing efficiency.However, in the support of data retrieval, only based on the time to optimize the search, it does not support the search based on video content.
A high-performance disk array called ripple-raid for continuous data storage is proposed by [8].Surveillance video data is a kind of continuous data, so the design features of the program can improve the video data writing efficiency.Their updating strategy and incremental generation of checksum data function not only improve the performance of data writing but also improve the energy efficiency of the system.A surveillance video storage system called THNVR based on SAN is proposed in [9]; this system uses the SQLite database to store metadata information of video.It saves nonstructured surveillance video data with fixed length files.Metadata and video data are indexed separately to improve storage and indexing performance.Fixed-length file can avoid the generation of disk fragmentation, but SQLite only supports relational data and has no high availability considerations.It is not suitable for large data storage.
Le et al. [10] propose a scheme for using SMR (shingled magnetic recording) disks in RAID arrays to accelerate the speed of storage system.Compared to traditional disk write, SMR technology can enlarge the density of data and is suited for saving the log-structure data which is like the surveillance video.A new block I/O scheduling scheme called BID (bulk I/O dispatch) is designed in [11].By organizing the order of block I/O requests to be served, the BID scheme changes random I/Os into sequential I/Os.This operation can save CPU wait time, so the performance is promoted.Because this scheduler is especially suited for the MapReduce kind of applications and the surveillance videos are almost sequential I/Os, it is not very suitable for video storage system.
In [12], the authors propose a distributed video recording system based on IaaS; the Hadoop HDFS is used as the storage file system, and MapReduce is proposed to analyze the video data.However, with the increasing amount of video data, there is a problem of bottleneck in the metadata center when high concurrent retrieval happened.To solve the problem of high concurrent retrieving, mass storage, and so on, Cao's team proposes a high-performance distributed storage system DVSS [1]; it uses multiple storage nodes to support system linear expansion and to solve the problem of large capacity.It uses the Redis database as metadata index to improve the efficiency of high concurrent retrieval.But its GOP-based video frame operation mode is only optimized for writing.It is not considering the function based on video content retrieval.

Design and Implementation of CSVS
The development of video surveillance system, with the digital and network trend, has transformed from the original analog signal transmission, through the digital signal transmission, to the network digital transmission in the present.The DVR (disk video recorder) as the representative of the digital surveillance system is gradually replaced by NVR (network video recorder).DVR combines video control with video storage to make the system more integrated and more applicable, but it can only store data on the disk of the local computer, which limits the size of the system data.NVR has the function of receiving IPC (IP camera) data, video codec, storage, real-time display, and so on.It can also forward the stored video data to other storage systems through the network [13] (Figure 1 shows an example).The CSVS (campus surveillance video storage) system proposed in this paper is located at the back end of NVR.It provides massive video storage for monitoring system, and it also provides fast and accurate content-based video retrieval.
The implementation goals of CSVS system are as follows: scalability, where the system needs to support massive data storage and storage space scalability, security, where it is required to support data backup, storage, and lost data recovery, Fast query, where metadata information and video data are separately stored in the system, and the index of video frames is generated asynchronously.The query function based on video content is supported.

CSVS System
Architecture.The CSVS system consists of three parts: the client, the metadata cluster, and the data storage cluster.The metadata cluster includes the video keyframe index center.System structure is shown in Figure 2.
The client of the system is responsible for the initiation of the tasks and the collection of operational results.The client provides internal and external APIs and video data import and export functions based on metadata and video content querys and other operations.All of these operations are launched by the client.
The metadata cluster is responsible for storing metadata information that describes the video sent by the client.The picture data generated by the key-frame extraction task are associated with the metadata information.It also provides search query function to the client.The cluster achieves load balance in order to make all the servers in the cluster undertake the tasks together.
The data storage cluster is responsible for reading and writing of video data.In the form of storage volumes, data replications are stored on different server nodes to ensure backup and achieve automatic recovery when data is abnormal.The entire cluster consists of multiple storage volumes, and the storage volume is composed of multiple storage service nodes.When a service node fails, the others still guarantee service.Linear extensions of space can be achieved by increasing the storage volumes.

System Scalability and Data
Security.The metadata cluster is implemented based on MongoDB database, and the metadata information of the video is organized into Bson format for storage [14].Because metadata information is mainly text, it occupies very little storage space compared to video data, so the main expansion of pressure is in the data storage cluster.
The data storage cluster is implemented based on the GlusterFS distributed file system [15].According to this application scenario of surveillance video, we select specific features to serve our system.We use the replica function to automatically backup the data and the strip ribbon function to improve the writing performance.The strip ribbon function, which is like the RAID0 technique, makes different disks write different parts at the same time, so it accelerates the speed of writing.

Organization Form of Data.
In order to provide a fast and accurate retrieval function, we save the metadata and video data in separated way.We save metadata to the Mongodb database cluster in key-value form.
The metadata field includes the video ID, the video name, the shooting position, the start time, the end time, the video duration, the video file size, the video storage path, the keyframe flag, and the key-frame storage path, as shown in Figure 3.
The ID field is the unique identifier of the metadata in the database.Video name is the file name of the video data.Position is the position information where the video data is recorded; Start time is the recording start time of the video   The Frame generate field is used to identify whether a key-frame group has been generated.The Frame path field holds the storage path of the key-frame group of the video.This path stores the picture files generated by the asynchronous key frame extraction task, and it provides the index for video based on content retrieval.

Key-Frame Extraction of Surveillance Video.
Surveillance video is different from other videos.The video shooting angle is stabilized.So the video background does not change too much.According to this feature, the difference between the frames in the video is compared with the way of calculating the histogram distance, and the video key frame is extracted [16].Because the campus has fixed life schedule, the video data content is also regular.
Before class time and at school canteen dinner time, video content is the most abundant, but during the class and night hours, there is no change in the content of the video.If we retrieve video content through the human eye, it will inevitably increase unnecessary workload.If we do match content in each frame of the video by the machines, there is also a huge amount of computation.To solve this problem, combined with the above characteristics, we calculate the histogram difference as the rule to extract the video key frame [17] and build the content index of video.
In the operation of key-frame extraction, the formula for calculating histogram distance [18] has four methods: CORREL, Chi-Square, Bhattacharyya, and INTERSECT.
The CORREL method is defined as where  1 and  2 represent two histograms and   1 () is defined as follows: where  is the number of the bins.The Chi-Square formula is as follows: Bhattacharyya is defined as At last, the INTERSECT method has the following formula: Our strategy for extracting key frames is to use as few frames as possible to summarize the complete video content.In order to improve the computational efficiency, we use the INTERSECT method, which is the minimum and fastest histogram distance calculation method.

For implementation of key-frame extraction algorithm description, see Algorithm 1.
The threshold is a floating point variable between 0 and 1.The closer the threshold is to 1, the more key-frames are written.Key-frame naming is a long integer + extension format.The long integer is the sequence number of the frames in the video.A frame sequence number can be used to locate the time point in which the frame appears in the video.
For example, if the duration of a video is 1000 seconds and 25 frames per second (frame rate), the video is composed of 25000 frames.If the key-frame to be queried is the 1024th frame (num), the frame can be matched in the 41 seconds of the video according to (6).The writing procedure is as follows:

Writing and Retrieving Data
(i) Get video data from NVR.
(ii) Read the video data file, parse out the video file, the video camera position, video start time, and end time, calculate the duration of the video, parse out the video file size, generate a video file path, set the flag as 0 which means key-frames extraction is needed, and generate a key-frame storage path.(iii) Insert data into the metadatabase.(iv) Write the video data according to the storage path and continue processing the next video data.(v) Start the asynchronous key-frame extraction task and check which key-frame flag is 0. (vi) Extract the key frames and write them into storage path and set the flag to 1; writing operation is finished.

Retrieval of Video Data.
The retrieval of video data is divided into ordinary retrieval and video content retrieval.Ordinary retrieval only needs to operate on the metadata database.The retrieval of video content is based on the keyframe index.

Ordinary Retrieval Process
(i) According to the search conditions, such as start time, end time, duration, shooting point, or the combination of these conditions, a query can be initiated to the metadatabase by a lookup operation.(ii) The metadatabase returns all the data entries that match the conditions.(iii) Parse the video storage path from all of the data entries.(iv) Read video file directly based on the video storage path.

Video Content Retrieval Process
(i) Search in database based on metadata conditions (the same process as the ordinary retrieval process).(ii) The metadatabase returns all the data entries that match the conditions.(iii) Parse the key frame storage paths from the returned items one by one.(iv) Compare the video content to key-frame group saved in storage path.(v) Return the closest key frame, parse the video storage path of the frame, and calculate the time appearing in the video.(vi) Locate the video based on the storage path and the time when it appears.

Experiment and Performance Evaluation
4.1.Experimental Environment.CSVS system relies on campus video surveillance system.We use 11 Hikvision DS-2CD5026XYD HD cameras and 1 Hikvision DS-8632N-I8 network video recorder in our system.We use 5 virtual machines to build the cluster of metadata and storage center.Each machine has Intel Xeon 2.5 GHz Core * 4, CPU, 12 GB memory, and 100 GB hard disk.

Threshold and Key
Frames.According to our key-frame extracting algorithm, the threshold parameter decides the quantity of key frames generated from video.The content of each frame in video changes slowly with time.For calculating the extent of difference between each of the frames, we use histogram distance to measure them.The threshold is the criterion to identify which histogram distance is large enough to represent the frame that needs to be saved.The bigger threshold is, the more frames will be saved.
In Figure 4, we use three different size videos to verify the relationship between threshold and number of key frames.The test video sizes are 256 MB, 512 MB, and 1024 MB.We choose five numerical values of threshold from 0.915 to 0.955 to generate the key frames.The result shows that number of key frames will increase when threshold value gets bigger.Shown as 1024 MB video file, the number of key frames increases from 113 to 1018.A large number of key frames make the index more accurate but take up more storage space.So it is very useful to tune the threshold value to weigh the storage space and the index efficiency.

Efficiency of Content Retrieval.
We have done some exercises to check the working efficiency of key-frame index function.The retrieval time is a criterion to compare our method with the GOP index method without key-frame index.We have obtained the result shown in Figure 5.
We save a group of video files in our CSVS storage system and DVSS system.These files' size is from 64 MB to 2048 MB.We choose a frame of video as target to retrieve in these two systems.The difference value of time cost between two methods will become bigger and bigger when the video files' size is increasing.The retrieval time of common method will rise sharply.By contrast, the key-frame index method will lift the time cost line gently.To do the performance evaluation, a transient stochastic state classes method is proposed [19].The authors provide an approach to continuous time transient analysis.Transitive closure of transitions identifies a transient stochastic graph.It is convenient to map a transient stochastic tree to do the classes analysis.The approach is applicable for any Generalized Semi-Markov Process.It is also suitable for performance evaluation in real-time systems.While this method is complex, Balsamo and his colleagues provide a powerful, general, and rigorous route to product forms in large stochastic models [20].They present a building block concept composed by a group of logically related places and transitions.The state probability of building block equals the sum of the places' state probability in the group.This technique can effectively avoid the problem of the state space explosion.If the model is complex and difficult to solve, we can adopt this method to address the state space explosion problem.Our purpose is to obtain the stationary Stochastic Petri nets is a powerful tool for system performance evaluation [21][22][23].In this paper, the basic theory of stochastic Petri nets is applied to model and evaluate performance of storage systems.According to [24], we assume that the firing rates of transitions are independent random variables with (negative) exponential distributions and represent the frequency of the data process for each function in our system.The isomorphic relationship between the stochastic Petri net and the Markov chain is used to calculate the stationary state probability, and the performance evaluation of query efficiency in storage system is provided.
0 represents query conditions based on metadata, 1 represents the query result based on the video content, 2 represents the query result returned by the metadata, 3 represents the video storage path, 5 represents the keyframe storage path, 4 represents the time of the key-frame in the video, 6 represents the ordinary retrieval and getting the results, and 7 represents the video content retrieval by matching key frame group and finding the results.
Table 2 shows the meaning of each transition in the CSVS system.
The reciprocal of firing rate is service time.We set  = { 0 ,  1 ,  2 ,  3 ,  4 ,  5 ,  6 } as service time.We compare every cost time of data processing functions in CSVS which represent the service time of transitions.Then we get the ratio of them, 0, 1, 4, 5 and 6 cost one unit of time to process the same amount of data.Retrieving data by path (i.e., 2) will cost ten units of time.Matching key frames (i.e., 3) will cost two units of time.Above all,  = {1, 1, 10, 2, 1, 1, 1}.All the of system running.In the light of quantitative analysis above, we can find that the CSVS system with key-frame index can improve the efficiency of query based on video content.
In the follow-up works of CSVS system, the method of key-frame extraction will be studied continually.We will improve the algorithm based on histogram distance calculation and find a more accurate key-frame extraction technique.We also plan to optimize the algorithm for the matching process of the key-frame querying.And we need to further improve the efficiency of retrieval based on video content.

Figure 1 :
Figure 1: Video surveillance system deployment and the position of CSVS system.

3. 5 . 1 .
Data Writing.The writing of the video data is initiated by the client.The client acquires the video data from the NVR.It generates the metadata information based on the video data and starts the writing operation.

Figure 4 :
Figure 4: Threshold and number of key frames.

Table 1 .
4.5.Performance Evaluation.Our surveillance video storage system is built in campus and operated by our team.It runs well but the workload is low.How to evaluate the performance in long-time runs if the workload is increasing?

Table 2 :
Meaning of transition.
state probability.Therefore we adopt ordinary stochastic Petri nets (SPNs) to model and evaluate the performance.