Download PDFOpen PDF in browserA Temporal Difference Pyramid Network for Action RecognitionEasyChair Preprint 107977 pages•Date: August 29, 2023AbstractThe visual rhythm of human actions can distinguish human actions with high visual similarity by expressing the dynamic and rhythmic scales of activities. In traditional convolutional neural networks, most of the input videos are sampled at different rates using local receptive fields, and there are also methods using multi-layer networks to process videos at different rates, but this requires high computational costs and additional computational resources. The previous methods cannot deal with the problem of feature fusion between different levels of features when identifying the relationship between visual speed and high-level attributes and fine-grained rhythm. In our study, we propose a Temporal Difference Pyramid Network (TDPN). What is significantly extracted from the action recognition network is multi-layer backbone features. The two key points are the global temporal modeling module and the local temporal modeling module. The global temporal modeling module is responsible for extracting low-level features with more location information, while the local temporal modeling module is responsible for extracting high-level features with more semantic information.Meanwhile, a multi-head attention mechanism has been used to simultaneously focus on different levels of feature information, better capturing the relative relationships between different features in the video sequence.Then, by aggregating different levels of features, the pixel-level fine-grained rhythm dynamics of action visual rhythm with rich information is obtained.Tests on various action recognition benchmarks, including Something Something V1 and V2 and Kinetics-400 , have shown that the TDPN network we suggested significantly boosts the performance of the current video-based action recognition model. Keyphrases: Pyramid network, TDPN, Temporal Difference, Visual tempo, action recognition
|