Author: Othman, Noorhan Khaled Fawzy Ibrahim./ Title: Developing a Methodology for Semantic description of Dynamic Scene /

Search In this Thesis

العنوان

Developing a Methodology for Semantic description of Dynamic Scene /

المؤلف

Othman, Noorhan Khaled Fawzy Ibrahim.

هيئة الاعداد

باحث / نورهان خالد فوزي ابراهيم

مشرف / مصطفي محمود عارف

مشرف / محمد عبد الرحمن مرعي

تاريخ النشر

2021.

عدد الصفحات

172 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

1/1/2021

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

172

from

172

Abstract

Video is one of the most popular visual media for communication and entertainment between human beings. The recent exponential growth of online videos uploaded on different social Media applica-tions as Facebook and twitter, the free video sharing service websites like YouTube and the presence of Surveillance cameras everywhere led to the explosion in the amount of video data at an incredible pace. Dynamic Scene understanding is the act of perceiving visual semantics from an observed se-quence of video frames. The word dynamic scene, in the context of our work, refers to frames con-taining moving human agents behaving and performing actions, interacting with each other or with their environment, in an input video.
The main aim of this work is to develop a deep learning framework whose ultimate goal is generating semantically correlated natural language description for human actions events in input video. Indeed, most input videos are untrimmed and contain numerous events that are interdependent and range across multiple time scales. The need for both localizing temporal event segments boundaries and generating a sentence per event is referred to dense video captioning.
Dense video Captioning can enable video streams to be found by search engines, including Google, through Search Engine Optimization techniques. Security officers can do an excellent job at detecting and annotating relevant information, however they simply cannot keep up with the terabytes of vid-eo being uploaded daily. Automated video analytics can be very helpful to organize and index such very large video repositories. Also Captions can support audiences that are deaf or hard-of-hearing to follow visual contents and improves the comprehension. Automatic video captioning helps support-ing blind disabled people, when coupled with a text to speech system.
The thesis discusses recent methods related to recognizing, locating and automatic captioning of video-based human actions in long, untrimmed videos, using shallow and deep machine learning methods. The thesis provides a methodology based on an end to end deep neural network framework, built upon the encoder-decoder applied in machine translation, that identifies event clips, uses attended visual context information, extracts semantic concepts
(verbs and nouns) that describes the event content effectively and considers inter-event sentence dependency between detected events for providing more diverse and interconnected captions. We conducted experiments to demonstrate that modelling both intra-event (visual and semantic) context information and using attended fusion mechanism, showed superior results in captioning compared to using each contextual information alone. Also inter-event context boosts the performance but functions in a sequential trait during training and inference and leads to slackened speed.