For humans, vision is a primary sense that enables us to perceive our world – we use sight to navigate, identify threats and interpret behaviour. The eyes are by far our most important sense organ; while many other species rely heavily on their sense of smell to gather information, humans perceive up to 80% of all impressions through sight.
Computer vision aims to give machines the ability to see. This field has become increasingly important as our expectations for modern machines rise. If we want self-driving cars, industrial pick-and-place robots, and lifelike assistants that perform natural speech patterns, then we need to build machines with the same visual capacities that humans enjoy.
So, how do we enable machines to see? At their most basic, computer vision systems analyze each pixel in an image to determine whether a given feature is present. This process is called feature extraction. Generally, approaches to feature extraction fall under two broad categories: model-driven and data-driven methods.
Traditional computer vision techniques are model-driven and involve hand-coding features one at a time. However, the rise of deep learning in the last decade has prompted a shift towards data-driven methodologies. In the following sections, we’ll describe the strengths and weaknesses of each approach and indicate the path forward for modern computer vision systems.
Model-Driven Approaches to Computer Vision
Traditional model-driven algorithms look for specified features hand-coded by an expert engineer and may contain many parameters. The idea is to identify all the features that define one class of object, and then use this set of features as a definition of the object. We can then use that definition to search for the object in other images.
For example, if we want to identify images that contain dogs, we would first identify all the features that dogs have in common such as fur or ears. We would also need to think of all the features that distinguish dogs from cats or horses. Once we specify which features define dogs, we can look for them in images.
Traditional model-driven approaches to computer vision are powerful because they rely on a strong understanding of the system and don’t require a large dataset to implement. However, making an exhaustive list of all the rules, exceptions and scenarios you need to accurately identify an object is hard and time-consuming.
To identify dogs, for example, we would need to hand label the features of each breed, including size, shape and fur characteristics. It sounds overwhelming, but the advent of artificial neural networks turned this traditional approach on its head – today’s computer vision systems are so close to 100% accuracy that they sometimes beat the human eye!
Data-Driven Approaches to Computer Vision
Neural networks are algorithms that are loosely modelled on the structure of the human brain. They consist of millions of simple processing nodes that are organized in layers and deeply interconnected. To train a network, each node is initialized to a random weight and data is passed through the network in one direction. The network output is then compared to ground truth, and the weights are adjusted until the network closes the gap. Through this process, the network ‘learns’ to correctly label input data.
This method introduced the concept of end-to-end learning, where the machine works out the most descriptive features for each object definition on its own. With neural networks, you don’t have to manually decide which features are important – the machine does the work for you.
If you want to teach a neural network to recognize a dog, you don’t tell it to look for fur or a tail. Instead, you show it thousands of images containing dogs, and eventually, the network learns by example what a dog looks like. This process is called training and requires a human supervisor. If your network misclassifies cats as dogs, you simply label more training images and feed those to the network until its prediction accuracy improves.
Modern neural networks independently uncover patterns in an image by learning from labelled training images.
While adding complexity to a traditional computer vision model requires additional code, only the data and annotations change when amending a neural network – the framework remains the same. For this reason, deep neural networks are considered a data-driven approach to computer vision.
Which Approach to Use
There are clear compromises between model- and data-driven computer vision systems. In data-driven models, engineers can feed raw pixels to a deep neural network to classify objects of interest in images with higher accuracy and lower overhead than traditional computer vision systems. They’re far more versatile and can more readily accommodate complexity than traditional systems.
But since deep learning algorithms discover features by learning from examples, they need large data sets to achieve accurate results. To match the performance of a well-designed traditional computer vision algorithm, a neural network needs enough labelled training data to cover the full range of base cases and expected variations.
At Motion Metrics, we use deep neural networks to detect missing teeth on mining shovels. These networks are trained to account for occlusion and handle expected variations in lighting, pose, etc.
In comparison, traditional approaches are mature and proven and don’t need high-end hardware to get the job done. These systems are also fully transparent, whereas a deep neural network is a black box containing millions of parameters that are tuned during training.
Although deep neural networks are fabulous tools that have rapidly advanced the field of computer vision, they are not a panacea. Whether or not a computer vision problem is best solved with a deep neural network architecture or a more traditional algorithm depends on several factors, including your access to data and hardware or team resources.
Deep neural networks are better at handling large datasets with high dimensionality, but problems with limited expected variation are often best solved with traditional computer vision techniques that don’t consume excessive computing resources. In many cases, a hybrid approach can offer the best of both worlds through higher performance and better computing efficiency.