Tracking Full-Body Motion of Multiple Fish with Midline Subspace Constrained Multicue Optimization

Capturing the body motion of ﬁsh has been gaining considerable attention from scientists of various ﬁelds. In this paper, we propose a method which is able to track the full-body motion of multiple ﬁsh with frequent interactions. We ﬁrstly propose to model the midline subspace of a ﬁsh body which gives a compact low-dimensional representation of the complex shape and motion. Then we propose a particle swarm-based optimization framework whose objective function takes into account multiple sources of information. The proposed multicue objective function is able to describe the details of ﬁsh appearance and is also eﬀective through mutual occlusions. Excessive experimental results have demonstrated the eﬀectiveness and robustness of the proposed method.


Introduction
e most effective way to quantitatively research behavior patterns and underlying rules of fish schools is tracking each fish. It is also helpful in many related applications such as robotics and virtual reality. For example, based on the body motion of fish, people can build man-made fish-like swimming robots and create vivid virtual fish in computers [1][2][3][4][5][6][7][8][9]. Top-view video and 2D tracking obtain sufficiently informative motion data for behavior investigation because a shallow water tank is used in many fish behavior research experiment and the fish typically swim around the same horizontal plane. However, tracking the full-body motion of fish typically multiple fish with interactions is still a challenging task due to (1) full-body motion of fish is highly complex which is difficult to model with a few parameters. (2) Fish may move abruptly thus the motion continuity assumption no longer hold, which causes conventional tracking approaches to fail. (3) Multiple fish cause frequent mutual occlusions which corrupt the appearance models of the tracking approaches.
Visual tracking is a hot research topic during the past two decades, and significant improvements have been made in all aspects of visual tracking such as appearance model [10,11,12] and estimation method [13,14]. And multiple trackers can be utilized for tracking multiple targets [15][16][17]. Nevertheless, conventional visual trackers which are designed for tracking the positions of generic objects are not applicable to the full-body motion tracking problem here. Another branch of multitarget tracking methods follows the detection and association framework [18,19], i.e., the outputs of detectors are associated with trackers across time. In this problem, however, the detectors may fail to give correct output during mutual occlusions which frequently happen and last for sufficiently long time, causing difficulties for the subsequent data association.
To be truly helpful for biological research, many numbers of automatic software were developed for multiobject tracking, such as ANY-maze and EthoVision [2,[20][21][22]. But only a few targets can be tracked and professional experiment setup are needed. A multiple tracking system of fish on the basis of a scale-space determinant of Hessian (DoH) fish head detector and Kalman filter is developed by Qian et al. [23]. Delcourt et al.'s system can track as many as 100 fish simultaneously but is not suitable for long period tracking [2]. However, these approaches highly depend on detection results and motion continuity for data association, and discriminative information of the head is not fully exploited. e body modeling and motion tracking of multiple fish is preliminarily discussed and attempted [24]; however, the effect declines when the number of fish increases.
Problems of tracking will be more difficult when severe occlusions occur: individuals may be assigned wrong identities and these errors would propagate throughout the rest of the video. Several existing tracking methods combine detection and tracking stages together to correct detection errors timely. Prior knowledge on possible articulations and temporal coherency is used by Andriluka et al. to associate the detection of each individual across frames, which depends on the motion model and object specific codebooks constructed by clustering local features. Kalal et al. [10] proposed the framework tracking-learning-detection (TLD) to track single object, which is divided into tracking, learning, and detection 3 subtasks. e tracker follows the target across frames; the detector localizes the object in each frame and is corrected, and it is updated online by P-N learning. Wang et al. [25] proposed an effective tracking method using convolutional neural network (CNN) for head identification. ey firstly detect fish heads using a scale-space method, and data association across frames is then achieved via identifying the head image pattern of each individual fish in each frame via CNN specially tailored to suit this task. Finally they combine prediction of the motion state and the recognition result by CNN to associate detection across frames. But samples must be collected for training CNN, and when new targets occur in the video, their method cannot work.
We present in this paper a method that is capable of tracking the full-body deformation of multiple fish with frequent mutual interactions. Since in most of the time, fish motion and deformation are horizontal, we capture the videos from a top view. In order to model the complex fish body deformation, a midline subspace model is firstly learned from a large number of training samples which give a compact representation of fish body and thus facilitate subsequent motion estimation. en we propose a multicue cost function which is able to characterize the subtitle appearance details of fish body during swimming. Moreover, this cost function is able to work under partial occlusions, making the system fully automatic. Extensive experimental results have demonstrated the effectiveness and robustness of the method. e contribution of the paper can be summarized as follows: (i) We introduce a midline subspace model learned from large amount of data to model the complex shape and deformation of a fish body. is subspace model is compact and low-dimensional thus greatly facilitating parameter optimization. (ii) We propose a highly discriminative and robust multicue objective function which models different aspects of the image structures of fish region. (iii) We have conducted systematic experiments to demonstrate the effectiveness of the proposed method.

Midline Representation.
Since the videos are captured from a top view, the shape of fish on images is approximately symmetrical about its midline as shown in Figure 1. So the deformation of the body can be viewed as driven by the midline. We will show later that once the midline is determined, the contour of the whole body shape can be recovered easily according to a reference shape. As shown in Figure 1, a midline can be approximated with a chain of n − 1 articulated equal-length line segments. us a midline is made up of n joints p i n i�1 : a head point, a tail point, and n − 2 middle joints. During one-time-step deformation, the length of each segment is kept fixed. We use Θ � (x, y, θ 1 , θ 2 , . . . , θ N ) to denote the parameters of a midline, where (x, y) is the position of the head point, θ 1 is the orientation of the first line segment, and θ i (i > 1) is the rotation angle of the ith segment relative to the first one (i.e., the absolute rotation angle is θ 1 + θ i ). e first three parameters Θ r � (x, y, θ 1 ) determine a rigid body transform, and the rest of the parameters Θ d � (θ 2 , θ 2 , . . . θ n ) account for the nonrigid deformation of body shape. Given the parameter vector Θ, the ith midline points can be recovered as e body width of the shape (d i ) keeps fixed in one time step, so a pair of contour points q ± i can be recovered as us with a reference contour and a midline represented by n joints, a set of 2n contour points can be recovered. Given the image at t as I t , the tracking problem can be formulated as maximizing probability: p(Θ t | I t ).

Learning Subspace of Midlines.
e major defect of the above representation is the high dimensionality, as sufficient number of line segments is essential to guarantee accurate approximation of the fish body shape. In fact, such a representation is redundant as the deformation of the fish is governed by fewer factors. So we seek to embed the deformation parameters into a lower dimensional linear subspace, which can be learned from large amount of training samples.
We collect N t � 1800 midlines of various postures and perform principal component analysis (PCA) on their nonrigid deformation parameters Θ i d N t i�1 . We choose k � 6 basis Φ � (ϕ 1 , ϕ 2 , . . . , ϕ k ) from the PCA results, and thus each Θ d is the linear combination of the following basis: where the coefficient ξ i � (Θ d − Θ d ) T ϕ i and Θ d is the sample mean. We find that 6 bases are sufficient to approximate a midline to a satisfactory accuracy. In fact, 6 bases account for 99.9% of the variance. Figure 2 shows the training samples of PCA and the rst four principal components of PCA. Now Θ d is replaced with 6 parameters, and thus the parameters to be estimated can be written as Θ ′ (x, y, θ 1 , ξ 1 , . . . , ξ k ).
Tracking problem becomes nding the maximum probability of p(Θ t ′ | I t ).

Multicue Objective Function
To take into consideration the image cues from the whole sh body area as well as some surrounding context, the midlines are extended as straight lines at the head to guarantee su cient coverage. And each line segment p i p i+1 → is associated with a rectangular region whose width is |p i p i+1 | and length is two times the body width at p i (i.e., 2d i ). Each rectangle moves in rigid transforms with p i p i+1 → .
Considering the observation that the image likelihood of di erent parts of the sh body should play di erent roles in tracking, we divide the points into four parts: Ω ∪ 4 i 1 Ω i as shown in Figure 3 and sample points are uniformly picked in each rectangle. We compute image likelihood function for each of the four parts, respectively, and then the weighted sum of the functions is computed as the nal objective function value. ree kinds of image likelihood which characterize three kinds of information are considered, and they are temporal appearance coherence, segmentation compatibility, and shape self-symmetry.

Temporal Appearance Coherence.
e appearance coherence is the basic assumption in visual tracking, which enforces the appearance of the estimated target state to be consistent with a reference appearance model. We compute the similarity between the pixel values in each part at t and their correspondences at a reference frame t 0 . e normalized cross-correlation (NCC) is adopted as the similarity metric. For example, the similarity score of the rst part is e scores of the rest three parts can be computed likewise.

Segmentation Compatibility.
Segmentation compatibility is introduced to enforce the estimated shape be compatible with the segmentation result. Since segmentation performance is stable across time, enforcing segmentation compatibility will prevent the tracker from drifting. As we select a larger region which contains some context pixels, both the foreground and background should be compatible with the reference. Let B t denote a segmented binary image of I t , x t i is the ith point in Ω t 1 , and then the segmentation compatibility score of the ith part can be computed as E 1 seg fg and E 1 seg bg : where the rst term E seg fg forces the estimated shape cover more foreground pixels and the second term E seg bg (x, y) Scienti c Programming encourages the shape to leave less background pixels uncovered. e segmentation compatibility scores of three other parts can be computed likewise.

Shape Self-Symmetry.
So far, we have exploited the appearance coherency and segmentation cues; however, the internal structure of the shape has been ignored. As discussed previously, the shape is self-symmetrical about the midline. So if a midline is correctly estimated, the image structures on the two sides of the midline should be symmetrical. is prior knowledge o ers a stable and drifting free guidance for tracking. Like the two previous cues, we compute a selfsymmetrical score for each of the four parts. Take the third part Ω 3 , for example, the self-symmetrical score is computed as the NCC between two pixel value vectors, which are formed by the pixel values of sample points on the two sizes of midline (as shown in Figure 4). e order of the pixel values should be adjusted so that two points that are symmetrical about the midline are in the same position of the array. Formally, the score can be written as where x + i and x − i are the ith point pair which are symmetrical about the midline. e shape self-symmetry scores for the other parts can be computed likewise.

Combination of Multiple Cues.
e nal objective function is the weighted sum of the scores of all the four parts: where w i is the weight of part i, which is set empirically in the experiments. Di erent parts should play di erent roles in tracking, and this has been proven in our experiments. We nd that the head (part 1) and tail (part 4) should be associated with larger weights than parts 2 and 3, and this is possibly because the image regions of head and tail contain more discriminative features than the other two parts.

Sequential Particle Swarm Optimization
With the objective function de ned, the tracking problem becomes maximizing E(Θ t ) with respect to the parameters Θ t ′ . However, this objective function is highly complex and nondi erentiable. Particle Swarm Optimization is a stochastic optimization technique which has received much attention due to its ability in nding optimum of complex problems.
We adopt a standard particle swarm optimization procedure [26]. A set of candidate solutions are maintained as particles, which move in the parameter space in the inuence of a global best solution and a particle's local best solution. In each time step, the tracker generates initial solutions using a second-order dynamic model: where n t−1 ∼ N(0, Σ) is the Gaussian noise.

Experiments
We captured the data for evaluation: 20 zebra sh were placed in a water tank and a video camera was placed on top of the tank which recorded the movements of the sh. e resolution of the camera is 1024 by 1024, and the frame rate is 100 fps. We rst evaluate the method following the manner of conventional multiple target tracking. e tracking of a target is considered to be completed if no ID switch or target lost occurs during the entire time step. We have manually labeled a 650-frame long video for automatic evaluation. A tracker is considered to be failed if both estimated shape's head and tail positions are more than 20 pixels away from the labeled positions which are considered as groundtruth. We get an 800-frame video with 10 zebra sh, which is shot in front light. e resolution of the video is 2048 W by 2040 H, and the frame rate is 100 fps. On this data set, we evaluate the adaptability of our method.
In order to evaluate the role of di erent cues, we evaluate di erent combinations of cues on the labeled data. e evaluation results are listed in Tables 1 and 2. We evaluate di erent methods on the labeled data and the 800-frame video. e evaluation results of di erent methods are listed in Tables 3 and 4. Qian's method is more prone to IDS and lost when the number of frames occluded by each other increase. In Wang's method, the lower image resolution possibly leads to the worse e ect of CNN (1024 * 1024 vs 2048 * 2040). e proposed method utilizes more body shape information and thus has better performance.
To evaluate the accuracy of the estimated shape, we give a plot of errors of the head and tail point of one sh in Figure 5. From the gure, we see that the error does not accumulate as time grows. e estimated tail point vibrates more violently than the head point, and this is possibly because the appearance of the tail is not quite stable as that of the head point.
We also give some qualitative results of the tracked shape under complex mutual interactions in Figure 6. With the Scienti c Programming designed multicue objective function and midline-based subspace model, our method is able to overcome medium degree of partial occlusions. And we find our method may fail if the occluded area is too large. Finally, we plot the trajectories of the estimated head position of 20 fish in Figure 7. For better visualization, we added time as the third dimension of the plot.

Conclusion
We present in this paper a method that is capable of tracking the full-body deformation of multiple fish with frequent occlusions and interactions. We propose a midline subspacebased model to represent each complex shape and deformation of the fish. And we further propose a PSO-based     Data Availability e captured video data used to support the ndings of this study are available online at https://pan.baidu.com/s/ 15Ijt5bpud_aqA1Smro1CdQ and the code is jagt.

Conflicts of Interest
e authors declare no con icts of interest.  Scienti c Programming