Video Summarisation: To generate a summary of the content of a longer video

Research Story NU

Currently, Video Surveillance is used to monitor the safety-critical places round the clock. The industry alone is making a significant contribution to the massive volume of video production. This large amount of video makes it difficult to browse and analyse the content of the video. As to compute these individual frames of the video, computer vision techniques such as video classification, object or event detection from the video get slower with an increase in the video length. This challenge demands video summarisation where relevant frames, known as keyframes/segments are separated from the video to form a summary. Only checking this summary saves a lot of time processing the undesired sections of the video.

Dr Madhushree B, a PhD student at the Department of Computer Science & Engineering, ITNU, under the guidance of Dr Priyanka Sharma, worked on the research project. She attempts to explore techniques and address the problem in an unsupervised approach. The objective of the research work has three different components, all independently investigated.

The first objective is to test the feasibility of employing colour-based local visual features in the keyframe selection process. Local visual features are conspicuously used in object detection and image classification successfully. It would be intriguing to observe what happens if this aids in identifying the keyframes. The second objective is to explore the role of deep visual features in keyframe extraction. However, there are multiple ways to identify a frame as a valid keyframe. The deep visual features obtained from a pre-trained Convolutional Neural Network (CNN) are critical when extracting keyframes. The third objective focuses only on the keyframes in which anomalous events occur. The volume of the surveillance videos makes it challenging to go through in search of anomalous events. Use an autoencoder based video anomaly detection system to generate such video summaries.

The thesis targets to generate a generic and an event-based summary of surveillance videos in both static (on existing videos) and real-time recordings using unsupervised machine learning techniques. The study showed that unsupervised methods are more practical solutions for video summarisation.

Both the generic and event-based models of unsupervised video summarisation were implemented using the concepts of transfer learning, convolutional neural networks, recurrent neural networks and autoencoder. A k-means clustering-based video summarisation approach called KSUMM, that makes use of colour based visual features of the constituent video frames obtained from partial decoding of videos and unsupervised machine learning techniques to perform summarisation tasks is proposed. The designed GVSUM (Generic Video Summarisation) approach uses the deep learning technique of CNN to extract deep visual features by the transfer learning method and summarise the videos based on these features. The proposed automatic Anomalous Video Summary Generation (AEVSG) method uses a deep autoencoder to find out the anomaly clips in the surveillance videos. It presents the summary with only the clips involving anomalous events of the input video. The above methods generated, generic summaries with an F1 score of 0.78 and event-based summaries with an AUC (Area Under the Curve) of 97%. The results show a satisfactory outcome of the experiment on both standard and custom datasets.