AI Video Summarization

Project Idea & Goals

This python tool was a group project for an artificial intelligence course; its inspiration came from the commercial product called BriefCam. The goal of the project was to improve the experience of reviewing security camera footage. Security cameras generally either produce continuous footage, or a set of shorter clips containing significant motion, or detected objects. The system we produced can improve the review experience of either of these types of inputs. It works by taking any number of footage clips and combining them into a single clip of the same length. The following examples come from my inexpensive front door camera, representing an increasingly common use case. These three one-minute summarizations are combinations of all the one-minute clips containing motion over two daily periods and one night-time period.

Object Detection: YOLO & OpenCV

Any detected objects that moved significantly, are all overlayed in the summary clip so that the viewer receives a highly condensed version of all the clips. People, vehicles, and about 90 other classes of objects can be detected using the YOLO object detection network that was pre-trained on the COCO dataset. Without a GPU for image detection, the tool was only able to process footage at a real-time rate while processing every fifth frame of video, making it only useful to the extent that computer time is less valuable than a human’s. However, when paired with a powerful GPU or other specific neural network optimized hardware, the tool would be a legitimate timesaver for reviewing critical security camera footage.

We decided to use YOLOv3 because it produced very accurate results for the type of objects we cared about, namely people and vehicles, using poor quality cheap security camera footage. One unfortunate result of this is we did not receive the valuable and difficult experience of training our own neural network. Most other groups focused on this aspect, whereas the majority of our project involved manipulating and tracking image data. This was mostly done using OpenCV, an open source computer vision and image processing library.

Implementation Pipeline

The process runs as five stage a pipeline:

Detector Stage:

This stage runs YOLOv3 on every 5 input frames, bounding boxes are stored for detections with confidence levels over 30%. This bounding box data is serialized into a file so that future stages can be processed without having to re-run the detector.

Track Stage:

The previous stage generates a large array that contains the detections for many frames of the input video. The track stage passes this information into an open source object tracker. This object tracker computes and interpolates tracks. Tracks are made up of unique objects and their respective paths through the frames in which they appear.

Filter Motion Stage:

This stage filters down the tracks above to only tracks that move more than a certain threshold. This keeps static detected objects out of the final summarization.

Frame Assign Stage:

The output of the previous stage is a mapping of each unique object to the frames in which it appears and its location. In order to visualize the output we need the opposite mapping. That is, given a frame number, we need the unique objects and their bounding boxes. This step computes that mapping.

Visualize Stage:

The output of the previous stage is a mapping of frame number to unique, moving objects in the frame and their positions. This stage reads in the video files and draws bounding boxes onto each frame in which we have identified objects then writes them to the output video file.