Human Action Recognition Using Temporal Partitioning of activities and Maximum Average Correlation Height Filter

We proposed a method for Human action Recognition. It is based on the construction of a set of templates for each activity. Each template is constructed based on the Accumulated Motion Image of the Video. Each template contains where motion has occurred in the video. FFT Transform is applied to each template. A 3D Spatiotemporal Volume is generated for each class. A Single action Maximum average Correlation height Filter is generated for each class. The filter is applied to the test video and using the threshold the actions are classified. The experiments are conducted on Weizmann dataset.


INTRODUCTION
Human motion analysis and recognition have become a highly active area in computer vision, due to the presence of Surveillance cameras and personal video devices.Yet effective solution is not obtained because of high-dimension of video data, intra-class variability caused by scale, viewpoint and illumination changes, low resolution and video quality.Human action recognition is the process of identifying human actions that occur in the video sequences.The application domains are in Surveillance footage, Userinterfaces, Robotics, Automatic video organization, patient monitoring systems, athletic performance analysis etc. Classification of human actions is not done accurately because of several reasons.Aggarwal and Ryoo (2011) provided an overview of the current approaches to Human activity Recognition.They have explored the various methodologies that is used in human action recognition.Kim et al. (2010) used AMI and Energy Histograms for Human Action Recognition.Davis and Bobick (1997) computed hu moments of Motion Energy Image and Motion History Images to create action template.Ahad et al. (2009Ahad et al. ( , 2010) ) presented all important variants of the Motion History image Method and suggests some areas for further research.Chandrashekhar and Venkatesh (2006) construct the Eigen Activity Space by performing PCA on AEIs of various activities and use it for recognition.Shrivastava and Singh (2012) analysed the performance of three methods of human action recognition.Ping and Zhenijiang (2008) adopt the ideas of spatiotemporal analysis and the use of local features for motion description and they are computationally simple and suitable for various application.Mahalanobis et al. (1987)

MATERIALS AND METHODS
Recently a lot of attention is shown towards analyzing human actions in spatiotemporal space instead of analyzing each frame.The proposed method is computationally simple.

Materials used in this research work:
The experiments were conducted with Weizmann dataset.The proposed method was implemented using Matlab.
Proposed method: Our proposed methodology for human action recognition is shown in Fig. 1.It has the main components of the proposed recognition.As shown in the block diagram for the Training Video 4 Temporal segments are constructed and 3D FFT is applied and MACH filter is constructed.In testing Videos 2 temporal segments are used out of 4 temporal segments and 3D FFT is applied and MACH filter is applied and the peak value is calculated.The peak value Temporal segmentation of activities: An activity can be performed by the same person or by different person in different ways because of the variation in the speed of the action.Even the same person can vary the speed during the activity is performed.So this temporal variability in performing the activity initiates the deployment of methods which are robust to variations.Because of this reason we divided the given action video in to four stages and for each stage a template is constructed.

Accumulated Motion Image (AMI):
In the proposed system to represent the spatio-temporal features of human actions, we define AMI and it is computed by using frame differences.AMI is computed using frame differences as in Eq. ( 1): where, D(x, y, t) = I(x, y, t) − I(x, y, t − 1) and T denotes the total number of frames present in a single action video.AMI gives where motion has occurred in the video.It captures the pose details in the given activity.
Algorithm for temporal segmentation: • An activity video is divided in to four equal temporal segments.
Fig. 2: Temporal segmentation for each action • AMI is calculated for each temporal segment where each segment has equal number of frames.• Each AMI act as a template.So four templates has been generated for each activity.• Template 2 and 3 provides more information when compared to 1 and 4. • The four stage template will be referred to as spatiotemporal profiles.• Temporal segmentation is shown in Fig. 2.

Fast Fourier Transform (FFT):
The FFT is the sampled Fourier Transform and therefore does not contain all frequencies forming an image, but only a set of samples which is large enough to fully describe the spatial domain image.The number of frequencies corresponds to the number of pixels in the spatial domain image.

Maximum average correlation height filter:
The MACH filter is used in Object Classification, Palm print identification problems.The MACH filter produces a single composite template from the instances of a class by optimizing the four metrics: The process results in a 2D template that gives the shape or appearance of an object in the video.

MACH filter for the action:
The Process of training the MACH filter with the creation of Spatio-temporal volumes by concatenating the templates of an action.A 3D FFT operation is performed to represent each spatio-temporal volumes in the frequency domain as shown in Eq. ( 2): where, f(x, y, t) is the volume corresponding to the templates of the input sequences.F(u, v, w) is the frequency volume.

L = Number of columns M = Number of rows N = Number of Frames
The Resultant 3D FFT matrix is converted in to a Single column vector denoted by x i .
The MACH filter is created in the frequency domain as follows in Eq. ( 3): where, m x = The mean of all x i h = The filter in the frequency domain C = The diagonal noise covariance matrix α = The Standard deviation parameter D x represents average power spectral density of the training video and is defined in Eq. ( 4): where, x i = A diagonal matrix * = The conjugate operations S x is the diagonal average similarity matrix defined as in Eq. ( 5): M x is a diagonal matrix.α, β, χ are the parameters that can be set to obtain the performances.Finally the 1-D filter h is designed.

Action classification:
The MACH filter is applied to the test video in which the 2 nd and 3 rd template alone is used.Here the entire test video is not used with the MACH filter.The response is calculated as shown in Eq. ( 6 where, S is the spatio temporal volume of the test video.H is the MACH filter.The response C is normalized and its value lies within 0 and 1.The peak value of the response filter is compared with the threshold.If the response of the filter is greater than the threshold we inferred the action has occurred and it is classified.
Weizmann dataset: The Weizmann dataset has been used in the proposed system which consists of relatively larger in terms of the number of subjects and actions.It includes 81 low-resolution videos from 9 different people, each performing 10 natural actions.A sample is shown in the Fig. 3.

RESULTS AND DISCUSSION
The proposed system is worked out with Weizmann dataset consists of 10 actions of different persons.The actions including bend, jumping jack, jumping, walk, run, skip, gallop side and wave.The Threshold that is used to classify the action is shown in the Table 1 for  The percentage of video giving correct output is 92% and the percentage of video giving wrong output is 8%.The accuracy given above is obtained by using Accumulated motion image with MACH Filter.This proposed method is able to recognize 9 out of 10 actions.

CONCLUSION
An activity is divided in to four temporal segments.AMI is generated for each temporal segment and it acts a template for each segment.The templates along with MACH filter is used for Human action Recognition.The computation is simpler and less time consuming as no classifier is used in the system.3-D FFT is applied only to the templates and not for the entire video.The system performance can be enhanced by fusing multiple features.

Fig. 1 :
Fig. 1: Main structure of the proposed system is compared with the threshold and actions are classified.

Fig. 3 :
Fig. 3: Some sample of 10 action classes in Weizmann dataset Weizmann dataset: Number of actions taken for Testing = NThe total classification rate of the proposed system is calculated as follows:

Table 1 :
Activity and threshold matrix