MCMC-based Human Tracking with Stereo Cameras Under Frequent Interaction and Occlusion Pak Ming, Cheung Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology Hong Kong, China E-mail: cpming@ust.hk Kam Tim, Woo Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology Hong Kong, China E-mail: eetim@ust.hk Abstract —Human Tracking in a video sequence is an important task in civilian surveillance. Successful human tracking provides data for security-purposes. However, human tracking in video sequences is always a challenging problem. Due to the rapid changes in shape with irregular motion, typical methods may not have good results, especially under occlusion and interaction. Recently, methods based on multiple cameras have been proposed. However, this requires a high computation cost. In the stereo cameras approach, 3D information is obtained and projected onto an occupancy map. In this paper, we propose an algorithm combining the occlusion and interaction model and Markov Chain Monte Carlo (MCMC) such that humans can be tracked under frequent interaction and occlusion. We have successfully reduced the number of failures by 78%. We present an efficient and effective algorithm under frequent interaction and occlusion. Keywords-component; human, tracking, Stereo, interaction, occlusion, MCMC I. I NTRODUCTION Recently, human tracking in video sequences has become popular. It can be used for crime prevention as it is possible to detect suspected behavior from movement [10]. Human tracking is one of the hottest topics in computer vision. Yet, human tracking present some difficulties for which there are no practical solution. One of the possible reasons is occlusion. While a single object can be tracked fairly well, video sequences with multiple objects and persistent occlusion and interaction are still challenging [2]. In [6], occlusion of a fixed object is handled by a 3D model. There is no practical solution for an environment with persistent occlusion and interaction. This paper focus on two challenging tasks: 1) handling occlusion and interaction by modeling them, 2) keeping the algorithm computationally tractable. Occlusion on the tracked human creates difficulty for the tracking as part of the human cannot be captured by the cameras. It is a challenging problem as the observed pixels and the colour of the occluded object are reduced so that tracking failures such as loss of tracking and switching identities are more likely to occur. Interaction is another challenging problem. It occurs when two, or multiple, objects are close. Without an interaction model, nearby targets are “hijacked” by objects with high likelihood [5]. To handle occlusion and interaction, we have proposed two models for occlusion and interaction respectively. The models are in the global observation model. The need for building a global observation model can be found in [12]. Single Camera is commonly used in human tracking systems [1][2][4][12][13]. In [2], [4] and [12], a global observation function is introduced. The authors use blob tracking with a particle filter. In [1] and [13], without the observation function, the authors proposed different approaches to handle interaction and occlusion. In [1], the blob merging and splitting technique is introduced, the blob will merge when occlusion and interaction occur. In [13], occlusion is handled in the measure noise of the Kalman filter that tracks the position of a human. However, due to the leaking of 3D information and consistent changes in gesture, human tracking and handling occlusion and interaction are challenging. Multi-view is one of the possible approaches [3][8][9][10]. The 3D information can be obtained by combining different views. Occlusion can be handled by images from different viewing angles of cameras installed in different locations. In [8], [9] and [10], when a human is occluded by others in a view, the system will track that human in another view and the problem of occlusion can be solved. In [3], an occupancy map of the ground plane is built. The tracking is based on the occupancy map. We follow the approach in [3]. The occupancy map represents the likelihood that the position is occupied by a human. However, our work is different from [3]. Instead of multi camera, we use a stereo cameras approach. This is because multi-view approach requires a high computation cost to process a set of images of the multi-view. Also, approaches assume that the videos from different cameras are synchronized. An additional computation is required for unsynchronized videos, and synchronization may require extra hardware and software [11]. With stereo cameras, 3D information can still be obtained. Many tracking algorithms are based on stereo cameras [6][7][14]. Also, the computation cost is much lower than in the case of multi-view so the computation in stereo cameras case is computationally tractable. However, the occlusion and interaction still need to be handled. In [6], occlusion is handled by reducing the tracking window to visible parts. However, the noise may be large in this case as the size of the tracking window is reduced. In [14], the stereo cameras are mounted on the ceiling with a view from the top down. Occlusion is not a problem as people in the scene are not occluded. The setting is uncommon as it requires a certain environment. We use the approach in [7]. The occlusion model is introduced in an occupancy map. 978-1-4673-1417-6/12/$31.00 ©2012 IEEE Figure 2: Background subtraction Disparity Map Disparity Binary Image Intensity Binary Image Foreground Image Rectified Image The model estimates the occluded area and compensates for the reduction in likelihood. However, the approach is equal to multiple single object trackers. Without a joint likelihood, occlusion and interaction are difficult to model, so we have proposed a joint likelihood in tracking. As sampling from high dimension space is difficult, we use Markov chain Monte Carlo (MCMC), which can effectively generate samples in a high dimension [2][4][5][6]. Interaction is another difficult problem in a stereo approach. It can be handled in different ways. In [2], a system is proposed without an interaction model. It records the appearance of each object with additional complexity. The computation time may be long in this case. Interaction models are introduced in [4] and [5] to handle the interaction. Under occlusion, likelihood for occluded humans is greatly reduced so that the tracking windows of those humans are likely to move to other’s positions. We have adapted the approach in [5]. The principle of the model is that a penalty is given to reduce the likelihood of two objects occupying the same area. In MCMC, reducing likelihood in those situations will reduce the acceptance ratios. We propose an interaction model to penalize interaction and reduce the likelihood in MCMC. Table 1 is the notations in this paper. Note that ぇ is equal to { ⡩ぇ ⡰ぇ ぁぇ }, where n is the number of humans in the system at time T ABLE 1. N OTATION The paper is organized as follows: the detail of the Bayesian formulation can be found in section II, including the model and the global observation model. Markov Chain Monte Carlo, which is used in tracking, can be found in section III. II. B AYESIAN F ORMULATION In this section, the Bayesian formulation of the proposed system will be discussed. Bayesian Formulation is one of the most well-known solutions to tracking problems. The formulation of the model is the following: ᡨ䙦 ぇ #ᠵ ⡩%ぇ 䙧 (1) where ぇ is the system parameter at given time t, while ᠵ ⡩%ぇ is the images sequence taken from the stereo cameras from time 1 to t. We are interested in the question: given the image of the video sequence from time equal to 1 to t, what ぇ is that maximized ᡨ䙦 ぇ #ᠵ ⡩%ぇ 䙧 . The problem is a maximum a posteriori (MAP). Figure 1 is the overall system diagram. The input of the system at time is the right ( ᠵ ぇ 䙦〙䙧 ) and left ( ᠵ ぇ 䙦〓䙧 ) frame captured by the stereo cameras. The background subtraction and projection (Section II.A) are carried out to remove the background and obtain the 3D information of the points in the foreground. An occupancy map ( ᠲ ぇ ), which is the probability of occupancy of the ground plane, is calculated from the projection . The final step is maximizing a posteriori estimation to obtain the system parameters ( ぇ ) at time (Section II.B). A. Background Subtraction and Projection We follow the approach in [7]. A background model is built based on the intensity and disparity. It is followed by a background subtraction. The first part of the background model is an intensity based method. It is operated as in Figure 2. The mean of the intensity is calculated in the pre-calculation. If the absolute difference between the values of the corresponding pixel of background is greater than a threshold, it will be a background image at that point. The output of the intensity background is a binary image to indicate which pixel is background. A similar operation is carried out for disparity. After stereo matching and the calculation of the disparity map of the current stereo images, the absolute difference between the value on the disparity map of each pixel and the value of the disparity background model is calculated. The disparity binary image, which indicates background pixel, is obtained in the model. The intensity based background model does not work well in cases such as where the intensity is affected by the shadows of people in the video and the lighting conditions. The change in the intensity caused by these factors may result in a wrong indication of foreground image. On the other hand, the change in the intensity of a pixel will not change the disparity of that pixel. The disparity background model can solve the problem of changes in intensity. The foreground is obtained in the background subtraction. ᠵ ぇ 䙦〙䙧 The Right Input Frame at time t ᠵ ぇ 䙦〓䙧 The Left Input Frame at time t ᠵ ⡩%ぇ The Right and Left input Frame from time 1 to time t ᠲ ぇ The occupancy map ぇ The system parameter at time t. It consists of the coordinates of each tracked human. 〒ぇ The parameters of human i at time t ぇ 䙦ぅ䙧 The system parameter of iteration r at time t 〒ぇ 䙦ぅ䙧 The parameter of human of iteration r at time t ᡈ 〖〩う Observation Value ᡈ 〖〰〰 Occlusion Value ᠹ 〉〖 Horizontal Occlusion Multiplier ᠹ 〣〖 Vertical Occlusion Multiplier ᡆ 〖〩う Threshold in Observation Model Figure 1: Overall system diagram ᠵ ぇ 䙦〙䙧 ぇ ᠹᠧᡂ 78 ᠵ ぇ 䙦〓䙧 ᠲ ぇ Figure 6: Occupancy map Figure 4: Observed Pixel and likelihood Position Likelihood Figure 5: observation mask Pixels in the foreground will be projected onto the occupancy map according to the X-Y coordinates of the 3D coordinates. The occupancy map indicates how many observed pixels are at that 9 - point. With a higher value, the point is more likely to be taken by a human. B. MAP (Maximum a Prior) In maximum a posteriori (MAP), the system parameter ぇ is maximized where ጔ ぇ is the maximized value: ጔ ぇ 㐄 ᡥᡓᡶ む ㊕ ᡓᡰᡙ ᡨ䙦 ぇ #ᠵ ⡩%ぇ 䙧 ጔ ぇ 㐄 ᡥᡓᡶ む ㊕ ᡓᡰᡙ ᡨ䙦ᠵ ⡩%ぇ # ぇ 䙧ᡨ䙦 ぇ 䙧 With the assumption on Markov property, the equation becomes: ጔ ぇ 㐄 ᡥᡓᡶ む ㊕ ᡓᡰᡙ ᡨ㐵ᠵ ぇ 䙦〙䙧 ᠵ ぇ 䙦〓䙧 㘧 ぇ 㐹ᡨ䙦 ぇ 䙧 ᠲ ぇ can be obtained from background subtraction and Projection by using ᠵ ぇ 䙦〙䙧 and ᠵ ぇ 䙦〓䙧 , the equation becomes: ጔ ぇ 㐄 ᡥᡓᡶ む ㊕ ᡓᡰᡙ ᡨ䙦ᠲ ぇ # ぇ 䙧ᡨ䙦 ぇ 䙧 (2) The equation is divided into two parts: the likelihood ᡨ䙦ᠲ ぇ # ぇ 䙧 , which reflects how likely it is that there is a human at that position, and is calculated in the global observation model (Section IID); and the estimation of prior ᡨ䙦 ぇ 䙧 . Figure 3 shows the calculation of the posterior: In the global observation model, occlusion reduces the likelihood and affects the result. As a result, an occlusion model (section IIE) is proposed. The prior consists of different parameters, and can be represented by the following: ᡨ䙦 ぇ 䙧 㐄 䙰㔷 ᡨ㐵 〒ぇ 㐹䙱 〒 ‰䙦 ぇ 䙧 (3) where 〒ぇ is the parameter of object ᡡ at time while ‰䙦 ぇ 䙧 is the interacting model (Section IIF). The detail of 〒ぇ can be found in the next section. C. Human Model The dimension of the parameter of a human ( 〒ぇ ) is 2. The parameters are: • 9 : x coordinate on occupancy map • : y coordinate on occupancy map where x and y represent the position of an object on the floor. They are the position of the tracking window. The height of the human is assumed to be the same. It is assumed that the camera is installed in a fixed location. D. Global Observation Model In the Global Observation Model, the likelihood ᡨ䙦ᠲ ぇ # ぇ 䙧 is calculated. It represents how likely it is that humans are located at the positions. Here, we define the observation value ( ᡈ 〖〩う 䙧 as the weighted sum of the number of observed pixels of the square on the position of the human on the occupancy map (An example is the purple square with broken line in Figure 6). ᡨ䙦ᠲ ぇ # ぇ 䙧 is calculated as follows: ᡨ䙦ᠲ ぇ # ぇ 䙧 㐄 ᡗᡶᡨ 䙨‒ ⡩ ᒙ 㐶ᡈ 〖〩う ᡆ 〖〩う ㎘ Q㑀䙩 (4) where ‒ ⡩ is a positive constant. ᡨ䙦ᠲ ぇ # ぇ 䙧 grows slowly when the observation value is small while it grows much faster when the observation value is large. This is because when the observation value is small, it is likely that that part is occupied by noise. On the other hand, if the value is large, it is more likely that the value of the pixels is projected from a human. The idea is illustrated in 1D in figure 4. The number of observed pixels (blue curve with broken line) of different points is calculated. The larger the number is, the more likely it is to be a human. So the likelihood (green curve) is high in the mean of the bell-sharped region of the blue curve while close to zero otherwise. When the images are projected on to the occupancy map, the pixels on a human will be projected around the position of that human so that the tracking window will keep track of the human. In the observation, we would like to obtain a larger weighting factor for those points closer to the human’s position. As a result, an observation mask (Figure 5) is proposed in the global observation model. The mask multiplies with the pixels in the tracking window (Purple Square in Figure 6). It is clear that the center of the mask has a larger weight, and it keeps the tracking window around the position of the human. Occlusion occurs during tracking, and reduces the observation value in the global observation model. In the next section, an occlusion model is introduced to handle the occlusion. Figure 3: Posterior Calculation TU T ᠲ ぇ 7 Multiplication ᡨ䙦ᠲ ぇ # ぇ 䙧ᡨ䙦 ぇ 䙧 ᡨ䙦ᠲ ぇ # ぇ 䙧 ᡨ䙦 ぇ 䙧 ᡲ ‰䙦 ぇ 䙧 ᡈ 〖〰〰 Figure 8: Vertical occlusion. R G B Figure 7: Horizontal occlusion. G B w R d E. Occlusion Model When humans captured by cameras are moving, occlusion occurs frequently. When one human blocks another, the pixels of the person being blocked cannot be projected onto the occupancy map. As a result, the observed pixels and the observation value will be greatly reduced. In this section, we focus on how to compensate for the reduction in the observation value. In the model, there are two types of occlusion: vertical and horizontal. The vertical occlusion depends on the distance between two humans. The closer one will block a large part of the other. The horizontal occlusion depends on the angle formed by the camera and the human in the front. We define the occlusion value ( ᡈ 〖〰〰 ) as the value of compensation for the reduction in the observation value. Horizontal and vertical occlusion multipliers are proposed to compensate for the horizontal and the vertical occlusion, respectively. The horizontal occlusion (Figure 7) occurs when a human (green and blue spots in figure) enter the shaded region formed by Human R (red spot in the figure). We define the horizontal occlusion multiplier ( ᠹ 〉〖 ) as the following: ᠹ ᠴ 〖 㐄 䚈 Qᡡᡘᡕᡧᡥᡨᡤᡗᡲᡗᡤᡷᡔᡤᡧᡕᡣᡗᡖ 㐶a ᒙ ᡖ ᡵ 㑀 ᡡᡘ ᡨᡓᡰᡲᡡᡓᡤᡔᡤᡧᡕᡣᡗᡖ (5) where above is the perpendicular distance from the edge of the human to the edge of the shaded region and ᡵ is the width of human. The width of the human is the length of the side of the square in Figure 6. This represents the frontal width. The horizontal occlusion multiplier is maximized when it is in the shadowed area. When a human is not blocked completely, the multiplier depends on the distance from the center of the human to the edge of the shadow. Human B (blue spot) in Figure 7 is an example. It is on the edge of the shadow which is blocked partially. The other type of occlusion is vertical (Figure 8). It occurs when a human stands in front of another one. An illustration can be found in Figure 8. In the illustration, human R (the red one) is the closest to the camera. Human G (the green one) and Human B (the blue one) are two humans blocked by human R at different distances from the camera. Based on the trigonometry, the vertical occlusion multiplier ( ᠹ 〣〖 ) is formulated as below: the derivation can be found in the appendix. ᠹ 〣〖 㐄䙦ᡖ 〰〩 cᡖ 〰〳 䙧e䙦Q f䙧㎗ (6) where is the ratio between the height of the human and the camera, and ᡖ 〰〳 is the distance between the camera and the human that is in the front (Human R in the example), while ᡖ 〰〩 is the distance from the camera to the human that is blocked (Human G and B in the example). The occlusion value ( ᡈ 〖〰〰 ) is as follows: ᡈ 〖〰〰 㐄 e ᠹ 〉〖 e ᠹ 〣〖 (7) where is a constant. The occlusion value is then added to the observation value at that point. The likelihood ᡨ䙦ᠲ ぇ # ぇ 䙧 under occlusion is modified as follows: ᡨ䙦ᠲ ぇ # ぇ 䙧 㐄 ᡗᡶᡨ 䙨‒ ⡩ ᒙ 㐶ᡈ 〖〩う ㎗ ᡈ 〖〰〰 ᡆ 〖〩う ㎘ Q㑀䙩 (8) F. Interaction Model When humans are moving, the distances between the humans are varying. The term “interaction” in this paper is defined as the positions of the tracking windows of two humans who are close. It occurs frequently when the moving area is small. We assume that there are small movements in the position of a given human from the last frame to the current one. Based on this assumption, at the beginning of each frame, the starting position of each searching widow is the position of the last frame. The searching is based on the likelihood and the prior. The windows search in the region around the starting position. However, when two humans are close, the searching window of a human may be trapped by a local maximum in the likelihood (large observation value). This is illustrated in a 1D case in Figure 9. In the figure, there are two humans and they are close. The starting positions are shown in Figure 9A. There are more observed pixels (a higher likelihood) at the true position of human G. In this case, since the likelihood at the position is higher, the tracking window of human R (the red square with broken line) may be trapped by that position and cannot track the true position (this case is shown in Figure 9B). The two tracking windows occupy the same position which essentially is a failure. This is more likely to occur when the difference between the two likelihoods is large. In occlusion, the number of observed pixels of an occluded human is greatly reduced. As a result, the likelihood is much lower than for humans in non-occluded cases. The tracking Figure 9: likelihood of interaction From top the bottom: starting positions, R is trapped and the true position and the corresponding position of the tracking windows G R R G G R G Observed Pixels Figure 9C Position Figure 9A Figure 9B True position of G True position of R Figure 10: Interaction under occlusion R G window of the occluded human may move and overlay with other tracking windows. Figure 10 is an illustration in 2-D. Human R (at the position of the red square) is blocked by human G (at the position of the green square). The number of observed pixels and likelihood of R are reduced. We would like to introduce this term to penalize those tracking windows which overlay others to reduce the likelihood. Based on the idea that two humans and their tracking windows cannot occupy the same area, we define the penalty function as the following: ‰䙦 ぇ 䙧 㐄 㐠ᡗᡶᡨ㐵㎘‒ ⡰ ᒙ 䙦ᡖ ぇ ㎘ ᡖ 〵 䙧㐹 ᡡᡘᡖ 〵 㐇 ᡖ ぇ Q ᡧᡲᡠᡗᡰᡵᡡᡱᡗ (9) where ‒ ⡰ is a positive constant, ᡖ ぇ is the threshold and ᡖ 〵 is the distance between two humans. If the distance between two humans is smaller than the threshold, a penalty will be given to that position so that the likelihood at that position is reduced. III. C OMPUTATIONAL P ROBLEM Computing MAP is an optimization problem [2]. The posterior can be computed by a Markov chain which has a stationary distribution as the target distribution. We have adopted the approach in [5]: The chain is obtained by the Metropolis-Hasting algorithm. A. MCMC-Based Particle Filter After initialization, the initial state ( ぇ 䙦⡨䙧 ) at time of the system is obtained. Then, B+N times MCMC sampling steps will be performed. The first B steps are known as the burn in period and are discarded. The next N steps are used in MAP. The Metropolis-Hasting algorithm is used to design the Markov Chain: ᡂ 㐄 ᡥᡡᡦ䙦Q ᡓ䙧 (10) where P is the probability that a state is accepted and ᡓ is the acceptance ratio. If the step is accepted, it will be used to calculate the MAP. If it is rejected, the original one will be used. After ㎗ o times iteration, MAP is estimated by: ⡩ 〕 q ぇ 䙦ぅ䙧 〃⡸〕 ぅ⢀〃⡸⡩ (11) The dimension of the filter is an important issue. The dimension of the filter can be changed when an object either enters or exits the scene. In [2] and [5], the authors use MCMC to track the objects in the scene. In [5], the system is designed so that it tracks a fixed number of objects. In [2], the system is designed such that it can handle a varying number of objects on scene. In this paper, the dimension is “fixed”, which means that the number of humans is constant during tracking but the beginning. We have adopted the approach in [5]. The acceptance ratio ( ᡓ ) is defined as: ᡓ 㐄 ᡂ䙦ᠲ ぇ # ぇ 䖓 䙧ᡂ 㕒䙦 ぇ 䖓 #ᠲ ぇ⡹⡩ 䙧ᡃ䙦 ぇ 䙦ぅ䙧 # ぇ 䖓 䙧 ᡂ㐵ᠲ ぇ # ぇ 䙦ぅ䙧 㐹ᡂ 㕒㐵 ぇ 䙦ぅ䙧 #ᠲ ぇ⡹⡩ 㐹ᡃ䙦 ぇ 䖓 # ぇ 䙦ぅ䙧 䙧 (12) y ぇ is the proposed parameters while ぇ is the original parameter. ᡂ䙦ᠲ ぇ # ぇ 䙧 is the global observation model. ᡂ 㕒䙦{ #{ 䙧 is the sample approximation to the predictive prior. The detail can be found in Section II.B. B. Markov Chain dynamics We have designed the following dynamic for the Markov chain. We assume that we have ぇ 䙦ぅ䙧 for the r-th iteration. The proposed parameters of r+1-th’s iteration ぇ 䖓 are as follows: ぇ 䖓 㐄 䙸ᡶጔ ᡷጔ䙹 㐄 䙸 ᡶ ㎗ ‒ け ᡷ ㎗ ‒ げ 䙹 (13) where ぇ 䙦ぅ䙧 㐄 䙲 け げ 䙳 ‒ け and ‒ げ are generated from Gaussian distribution with zero mean. When the ᡨ䙦ᠲ ぇ # ぇ 䙧 (obtained in the global observation model) is high, that is the confidence level is high, the current parameters ( ぇ ) are likely to be true. In this case, we set the variance of the Gaussian distribution to be small. On the other hand, when the ᡨ䙦ᠲ ぇ # ぇ 䙧 is small, we set the variance of the distribution to be larger. This prevents the tracking window from being “trapped” in local maxima. C. Acceptance Ratio In the paper, we defined predictive prior as follows: ᡂ 㕒䙦 ぇ #ᠲ ぇ⡹⡩ 䙧 㐄 ‰䙦 ぇ 䙧 ᒙ 㔳 㔷 ᡂ䙦 〒ぇ # 䙦ぅ䙧ぇ⡹⡩〒 䙧 〒 ぅ (14) where ‰䙦 ぇ 䙧 is the interaction model to penalize overlaying in the occupancy map. ᡂ䙦 〒ぇ # 䙦ぅ䙧ぇ⡹⡩〒 䙧 is, given the past state of a human, the probability of the proposed position. Figure 12: Failure True Position Tracking window Figure 11: Occlusion in the setting Figure 11A Figure 11B R G Camera R G Camera The other term in the acceptance ratio is ᡃ䙦 ぇ 䙦ぅ䙧 # ぇ 䖓 䙧 , which is the proposal density. It is defined as the probability that ぇ 䖓 is moved to ⤰⤙ in the proposal. ᡃ䙦 ぇ 䙦ぅ䙧 # ぇ 䖓 䙧 equals ᡃ㐵 ぇ 䖓 㘧 ぇ 䙦ぅ䙧 㐹 and the accepting ratio ( ᡓ䙧 becomes: ᡓ 㐄 ᡂ䙦ᠲ ぇ # ぇ 䖓 䙧ᡂ 㕒䙦 ぇ 䖓 #ᠲ ぇ⡹⡩ #䙧 ᡂ㐵ᠲ ぇ # ぇ 䙦ぅ䙧 㐹ᡂ 㕒㐵 ぇ 䙦ぅ䙧 #ᠲ ぇ⡹⡩ #㐹 (15) IV. E XPERIMENT AND R ESULT We use a pair of stereo cameras, viewing downward at a small angle. Figure 11 is another illustration. In Figure 11A, the setting is the same as the one in the experiment. Part of human G (the one in green) can still be seen. However, if the setup is changed to that in Figure 11B with the cameras at the level of the head, human G will be blocked completely. In the model, the parameters of the stereo cameras that need to be estimated are the position of the cameras in real world coordinates and their heights. If the cameras are relocated, only these two parameters are required when the projection matrix from the camera coordinates to the real world coordinates is calculated, that is, the extrinsic parameters The following is the experimental environment: • Number of video sets: 6 • Video length: 3 to 4 minutes each • Frame rate: 10FPS • Size of tracked area: 5.7m 2 • Number of people in the experiments: 1-6 • Severe occlusion occurs during experiment In the experiment, 6 videos are taken from same position with a different number of people. Each video contains a fixed number of humans. That is, once all the humans have entered the room, they will stay in it until the end of the video. Since the size of the tracked area is small, severe occlusion and interactions occur during the recording. Figure 12 is an example of failure. The purple square is the true position while the red one is the position of the tracking window. It is a failure when there is no overlapping within the two windows. Without loss of generosity, most of the errors are caused by occlusion. In the test, we record the number of failures in the tracking. This reflects the performance of the models. The number of failures can be reduced with a better occlusion and interaction model. In the experiment there are several parameters that needed to be set, namely, the variance in the Gaussian distribution of ‒ け and ‒ げ in (13) and the parameters in the global observation model, the occluded model and the interaction model. The parameters in the models are set according to the performance. For ‒ け and ‒ げ , they are set according to the correlation of samples generated from MCMC. In order to reduce the correlation between iterations, there are two possible ways. The first one is to increase the variance. It makes the distance of the positions of the tracking window between proposed sample and the previous sample to be larger and hence the correlation can be reduced. However, since the difference between the two samples is larger, the proposed sample is less likely to be accepted. If the proposed sample is rejected, the sample will be the same as the previous one and the correlation increases. The second way is to increase the acceptance ratio so that the proposed sample is more likely to be accepted. This can be achieved by reducing the variance in generating ‒ け and ‒ げ . However, the difference between the samples is reduced. A test is carried out to obtain a suitable value. The results of the experiment are summarized in Table 2. They are the averages of 10 trails on 6 people in the room (approximately 1 person per meter). There are 1888 frames in this sequence. T ABLE 2. N UMBER OF F AILURES From Table 2, we can see that with the proposed algorithm, the number of failures drops from 118.5 to 25.9. The number of failures is reduced by 78% when the occlusion model and interaction model are introduced. The following table shows the results with a different number of people with the occlusion model and interaction model. The results are the average of 10 trails: T ABLE 3. : N UMBER OF FAILURE AND NUMBER HUMAN No. of humans No. of Failures No. of Frame FPS 6 25.9 1888 65.3 5 11.9 1872 68.1 4 3.3 1806 71.0 3 1 2020 76.2 2 0 2123 82.6 1 0 1641 85.0 Without Occlusion Model With Occlusion Model (Section II.E) Without Interaction Model 118.5 116.8 With Interaction Model (Section II.F) 29.8 25.9 Figure 13: Left image and the occupancy map with tracking result in the test sequence Figure 13A Figure 13B Figure 13D Figure 13E Figure 13F Figure 13C Note that the algorithm is tested under Intel® Core™ i5 with 2GB of Ram. Since we focus on the tracking and the setting is the same as in [7], the FPS records the time for MAP only. The performance is improved when the number of humans in the sequence decreases. We have compared our results with [7]. The following table is the testing on computation time. In the test, we use the same input for tracking as [7], and we only focus the test on tracking. We use a Pentium 4 (same level of CPU as [7]) and a Core™ i5 computer for the testing, and test the computation time of the algorithm in [7] and the one in this paper. T ABLE 4. R UN T IME C OMPARISON Algorithm In [7] Proposed CPU Pentium 4 Core™ i5 Time(per Frame) 83ms 40ms 15ms This shows that the time required for tracking is reduced by 51.8% when using the proposed algorithm with the same grade of CPU. The computation time can be further reduced by using an updated computer. In Figure 13, the left views of the stereo cameras and the corresponding occupancy maps are shown. The tracking result is also drawn on the left views and the occupancy maps. The input of the background model is shown in Figure 13A. There is no human. In Figure 13B, Figure 13C and Figure 13D, people are entering the room. Since the density of human in the room is high, occlusion and interaction occur. For example, in Figure 13D, the human with a green square occludes the human with a blue square. Figure 13E shows a scene with 6 people. They may be occluded and occlude others at the same time. It makes the tracking challenging. Figure 13F is an example of failure. The human with a green square is occluded completely. The tracking window mistakenly tracks the human with a red square. It is because the likelihood of the occluded humans is greatly reduced. V. C ONCLUSION A ND FUTURE DEVELOPMENT In this paper, we have successfully built a human tracking system with stereo cameras under crowded environmental conditions with frequent occlusion and interaction. Our contributions in this work are: 1) The introduction of an occlusion model to handle occluded humans in tracking. 2) The introduction of an interaction model to handle the case when multiple humans are close or where there are a small number of observed pixels. 3) Effective MCMC-based human tracking on an occupancy map. 4) An experiment on data to show the improvement in tracking through the introduction of the occlusion and interaction models. The occlusion model in this paper can be extended to non-human objects such as desks and cabinets. They can be treated as a special case of human with a different width and height. In this case, the likelihood of a human being blocked by those fixed objects can be compensated for. Occlusion created by the fixed objects can be solved by extending the occlusion model. Figure 14: Vertical occlusion model. ᡖ 〰〩 ᡖ 〰〳 ᡠ 〰 ᡠ 〩 ᡠ A PPENDIX A. Derivation of Vertical Occlusion In Figure 14, ᡠ 〩 , ᡠ 〰 and ᡠ are the height being occluded, the height of the camera and the height of the human, respectively. ᡖ 〰〳 and ᡖ 〰〩 are the distances from the camera to the human in the front and the human being occluded, respectively. In vertical occlusion, we are interested in estimating the number of blocked pixels which is directly proportional to the height being blocked ( ᡠ 〩 in Figure 14). Let p be the observed value when there is no occlusion, ᡨ あ be the observed value when there is occlusion and ᡨ ぐ be the occlusion value. Then the occlusion model is: ᡨ 㐄 ᡨ あ ㎗ ᡨ ぐ (16) The vertical occlusion multiplier is as follow: ᡨ ぐ 㐄 ᠹ 〣〖 ᒙ ‒ (17) where ‒ is a positive constant. The vertical occlusion multiplier is the percentage of pixels being occluded which is equal to ᡠ 〩 cᡠ By using similar triangles: ᡠ ㎘ ᡠ 〰 ᡠ 〰 ㎘ ᡠ 〩 㐄 ᡖ 〰〩 ㎘ ᡖ 〰〳 ᡖ 〰〩 (18) By changing the subject to ᡠ 〩 , we have: ᡠ 〩 㐄 䙦ᡠ ㎘ ᡠ 〰 䙧 ᡖ 〰〩 ᡖ 〰〳 ㎗ ᡠ 〰 (19) Dividing both sides by h and let ᡰ 㐄 ᡠ 〰 cᡠ , we have ᠹ 〣〖 㐄 ᡠ 〩 ᡠ 㐄 䙸ᡖ 〰〩 ᡖ 〰〳 䙹 ᒙ 䙦Q ㎘ ᡰ䙧 ㎗ ᡰ (20) A CKNOWLEDGMENT We would like to thank Mr. Cheung [7] for providing his data for testing and comparison. R EFERENCES [1] Tao Yang; Quan Pan; Jing Li; Li, S.Z.; , "Real-time multiple objects tracking with occlusion handling in dynamic scenes," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.1, no., pp. 970- 975 vol. 1, 20-25 June 2005 [2] Tao Zhao; Nevatia, R.;, "Tracking multiple humans in crowded environment," Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on , vol.2, no., pp. II-406- II-413 Vol.2, 27 June-2 July 2004 [3] Francois Fleuret Jerome Berclaz Richard Lengagne Pascal Fua , Multicamera People Tracking with a Probabilistic Occupancy Map, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vil.30, No. 2, 2008 [4] K. Smith, D. Gatica-Perez, and J. M. Odobez, “Using particles to trackvarying number of interacting people,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1. 2005, pp. 962–969 [5] Zia Khan, Balch, T., Dellaert, F., "MCMC-based particle filtering for tracking a variable number of interacting targets", Pattern Analysis and Machine Intelligence, IEEE Transactions on, On page(s): 1805 - 1819, Volume: 27 Issue: 11, Nov. 2005 [6] Osawa, T., Xiaojun Wu, Sudo, K., Wakabayashi, K., Arai, H., Yasuno, T., "MCMC based multi-body tracking using full 3D model of both target and environment", Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on, On page(s): 224 - 229, Volume: Issue: , 5-7 Sept. 2007 [7] Cheung, T.K.S.; Woo, K.T.; , "Human tracking in crowded environment with stereo cameras," Digital Signal Processing (DSP), 2011 17th International Conference on , vol., no., pp.1-6, 6-8 July 2011 [8] Q. Cai and J.K. Aggarwal, “Automatic Tracking of Human Motion in Indoor Scenes across Multiple Synchronized Video Streams,” Proc. Sixth IEEE Int’l Conf. Computer Vision, 1998. [9] Dockstader, S.L.; Tekalp, A.M.; , "Multiple camera tracking of interacting and occluded human motion ," Proceedings of the IEEE , vol.89, no.10, pp.1441-1455, Oct 2001 [10] Calderara, S.; Cucchiara, R.; Prati, A.; , "A Distributed Outdoor Video Surveillance System for Detection of Abnormal People Trajectories," Distributed Smart Cameras, 2007. ICDSC '07. First ACM/IEEE International Conference on , vol., no., pp.364-371, 25-28 Sept. 2007 [11] Stoykova, E.; Alatan, A.A.; Benzie, P.; Grammalidis, N.; Malassiotis, S.; Ostermann, J.; Piekh, S.; Sainov, V.; Theobalt, C.; Thevar, T.; Zabulis, X.; , "3-D Time-Varying Scene Capture Technologies—A Survey," Circuits and Systems for Video Technology, IEEE Transactions on , vol.17, no.11, pp.1568-1586, Nov. 2007 [12] Isard, M.; MacCormick, J.; , "BraMBLe: a Bayesian multiple-blob tracker," Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on , vol.2, no., pp.34-41 vol.2, 2001 [13] Tao Zhao; Nevatia, R.; , "Tracking multiple humans in complex situations," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.26, no.9, pp.1208-1221, Sept. 2004 [14] Beymer, D.; , "Person counting using stereo," Human Motion, 2000. Proceedings. Workshop on , vol., no., pp.127-133, 2000