Author: Hakim, Bassel Safwat Chawky./ Title: Action Recognition in videos using<br>Deep Learning /

Search In this Thesis

العنوان

Action Recognition in videos using
Deep Learning /

المؤلف

Hakim, Bassel Safwat Chawky.

هيئة الاعداد

باحث / Bassel Safwat Chawky Hakim

مشرف / Howida Abdel Fattah Shedeed

مشرف / Mohammed Abd El-Rahman Marey

مناقش / Ahmed Samir

تاريخ النشر

2019.

عدد الصفحات

119 P. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Information Systems

تاريخ الإجازة

1/1/2019

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - قسم الحسابات العلمية

الفهرس

Only 14 pages are availabe for public view

from

119

from

119

Abstract

Human action recognition for a given video is a difficult problem containing many challenges ranging from partial occlusion to variations in the action speeds and viewpoints. However, this problem is at the very core of various systems like abnormal behavior detection, action localization and/or online video analysis, and that is why it is a very important problem which got the attention from many researchers in the past decades and till now.
Despite the number of studies in the literature, the action recognition problem remains a difficult problem. Traditional approaches require a lot of efforts to find the best combination of features not only to represent the action in a compact form but also to handling the so many existing challenges. Recent methodologies relied on deep learning models to learn and extract good representation from the datasets. Although the deep learning-based models requires a large training datasets and costly training time, this type of models illustrated advances on several action recognition dataset. Moreover, several techniques like transfer learning allowed faster convergence of the accuracy by pretraining the model.
At first, a novel ranking and listing for 14 action recognition datasets is set. The ranking is based on the number of challenges each dataset covers. Therefore, the higher the dataset’s rank, the more realistic the dataset is, indicating its ability to provide a realistic measurement for the models. Based on these datasets, a comparison between the advances in traditional approaches and deep-learning based models is illustrated. Based on this deep survey, deep learning baselines models, namely two stream convolutional neural network and 3D convolutional neural networks, are described. These baseline models are almost used by all the other studies including state of the art models.
Secondly, a novel action recognition benchmark has been set and used to study several action recognition challenges like the change in viewpoints and shaky videos effects using the baseline models described. Moreover, it is used to study the overfitting problem of the deep learning models with potential solutions.
Finally, two main techniques are suggested where both can be integrated by the several existing action recognition models in order to improve the accuracy. These techniques leverage the video temporal dimension to learn several features representing varying temporal lengths. The first technique is called Single Temporal Resolution Single Model (STR-SM) which suggests training the desired model on one specific temporal resolution of the video. The temporal resolution is defined as the number of frames for a specific amount of time. Therefore, a low temporal resolution means that a small number of frames is used to represent the action while for a high temporal resolution, a large number of frames is used. Therefore, a good model that uses the STR-SM technique uses a temporal resolution that is low enough to represent long temporal duration but also, high enough to capture the motion details. Such technique is faster when compared to traditional approaches as it tackles long temporal range at once with better accuracy as it covers more information. On The second technique is called Multi Temporal Resolution Multi Model (MTR-MM) which tackles the problem of varying action speeds in a novel way. Applying the MTR-MM technique on the desired model requires building several STR model versions, each trained on a specific temporal resolution with a late fusion. This leverage the different existing information in each temporal resolution leading to a better and improved accuracy. Additionally, the STR-SM and the MTR-MM techniques are applied on 3D Convolutional Neural Network model and have improvements over the traditional training approach of 3.63% and 6% video-wise accuracy respectively.