Umbo Light: Comparing Artificial Intelligence versus Traditional IVS

Umbot
Umbo Computer Vision
6 min readJun 17, 2017

--

Introduction

Customers in the security industry have long had to deal with the challenge of monitoring and processing millions of hours of video security camera footage. Human operators are expensive, find the job exhausting and are frequently ineffective. To automate the monitoring and alerting of threats in security footage saves companies money and allows security managers to deal with the more pressing parts of their jobs.

A number of different solutions have gone to the market with the goal of filling these needs. These traditional “Intelligent Video Surveillance” or IVS systems as they are called are subject to “false alarms”, inundating security managers to the point they are unable to derive real value from its use.

The problem is that just putting a box around a single person as many “intelligent video surveillance” systems do is not good enough. A box gives very little context about what is actually there in the footage. In order to see a human in security video, Umbo’s computer vision models had to “see” the contours of the human shape.

Background

Virtually every commercial intelligent video surveillance system is built on top of motion detection — responding to and flagging moving objects in a video scene. There are a number of approaches toward accomplishing this.

The most widely used approach is “background subtraction”. The reason for its popularity is because of its accuracy and fast computation. The algorithm looks at the difference between the current image (such as a still frame in a video) and a reference image — often called the “background image”. The background image is basically a picture of the scene without anyone or anything moving within it. Differences between what the camera is currently seeing and the background image is implied to be a moving object like a person. There are a number of different variations of this algorithm with tweaks applied for specific reasons — to distinguish between subjects and their shadows for example.

“Temporal differencing” is a second commonly used technique. It looks at the differences between two or three different images — frames in a video — on a pixel by pixel basis. Mathematically, it takes the first frame and “subtracts” the positions of the pixels in the frame right before it.

Temporal differencing would identify “motion” by subtracting the stationary pixels in these two frames

The advantages of this algorithm is that it is fast to compute and adapts quickly to changing environments — a real issue for a security camera. However, if the scene changes (like if the camera is one of the types that move) then a second algorithm would be required to compensate for this.

The main shared flaw in these algorithms’ object recognition capabilities (especially that of humans) is that they operate under a symbolic A.I. paradigm — their computer vision is rules-based. They rely on human programmers to create the rules to be triggered in the video image. To illustrate, a rules based approach for human body modeling first sees a human as a cylinder, ribbon, or a blob. But in each case a programmer has to tell the computer to look for a sequence of cardboards, ribbons, or blobs in the video image. The result works fine for a simple scene with well-defined activities but falls apart in complex situations where we need to be flexible in our interpretations.

In contrast to symbolic artificial intelligence, statistical artificial intelligence approaches these difficult problems in a wholly different way. These methods represent scientific and mathematical attempts to “program” an algorithm to make decisions that would otherwise be deemed too difficult or complex by a symbolic approach. Basically, the computer programs itself. Statistical A.I. has been behind many of the recent successes in the industry including that of AlphaGo and Google’s self-driving car. It is also the technology behind Umbo Light.

Light and IVS: Head to Head

The traditional IVS system identified many things as human that were not. One of the major difficulties that these systems faced was determining whether or not the moving image was a human or a human’s reflection in a mirrored plate of glass.

Determining a mirrored reflection without context would be difficult even for a human person. The vast majority of the traditional IVS’ false positive errors could be attributed to these reflections. While both types of systems would flag a mirror reflection at one point or another within the test time span, Umbo Light’s neural network-based models were able to do at a lower frequency. Unlike with the traditional IVS, Umbo Light would be able to improve its error rates with a little human feedback — security managers are able to flag false alerts which tunes the camera’s individual “model” and rapidly improves its performance.

This reflected image was accurately rejected by the Umbo Light model as being not human.

One interesting challenge that motion tracking traditional IVS systems were not able to overcome was to avoid flagging reflected light as motion.

This error was something that Umbo Light was able to easily avoid. This is due to the way how the specific ways in which motion detection is programmed into the traditional IVS system.

One major thing that the table’s statistics are not able to convey is the richness of the data quality between the two alerts.

Identical shots between a traditional IVS and Umbo Light show the difference in the richness of information that both systems provide to the user

The traditional IVS image on the left shows that the motion tracking system had been able to “identify” some of the people in the image. But the way it in which it does this — by placing a box around the relevant area — tells very little about the context of the motion within the box itself. If there is a large set of pixels moving, then is it a crowd of people or a car? If it is a crowd of people, how many?

Because Umbo Light’s neural network models rely specifically on the visible video image data, it registers things the same way that a human does. It is able to alert on specific human shapes like a human operator does — correctly identifying 19 out of the 21 humans in the scene.

Beyond this simple demonstration are a number of exciting capabilities to be built out in the future. A box is just a box, but Umbo Light’s visual models can be then trained to identify behaviors. Identifying and alerting on behaviors such as loitering, fighting, and items left behind would be just the start of what visual-based IVS systems like Umbo Light could do.

Citations

Kim, In Su, et al. “Intelligent visual surveillance — a survey.” International Journal of Control, Automation and Systems 8.5 (2010): 926–939.

Hu, Weiming, et al. “A survey on visual surveillance of object motion and behaviors.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34.3 (2004): 334–352.

Lipton, Alan J., Hironobu Fujiyoshi, and Raju S. Patil. “Moving target classification and tracking from real-time video.” Applications of Computer Vision, 1998. WACV’98. Proceedings., Fourth IEEE Workshop on. IEEE, 1998.

--

--