Community Computer Vision Course documentation

Vision

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Vision

Most of us know that sunlight is responsible for sustaining life on our planet, but have you ever wondered how this shaped our lives? For starters, almost every being on the earth has some way to sense it (even some bacteria and single-celled organisms). Humans share this ability, but we have a much more complicated system for interacting with light. We capture light through a lens that then emits an electrical signal in our eyes that passes through a cable-like structure (our nervous system), and that signal gets reconstructed to tell us what our surroundings look like in our brains.

This process is what we call vision. It is a fundamental step in our evolution. It is so important that scientists have hypothesized that the development of centralized nervous systems (which ultimately lead us to our big brains) follows the advent of vision. It makes sense without sensors capturing such vast information, why waste resources on making the machinery required to develop it?

Significance of Vision to Humans

Watercolor image of three people in a park playing with a ball

If you ever spontaneously kicked a ball, your brain performs a myriad of tasks unconsciously in a split second. It correctly identifies the ball, tracks its movement, predicts its trajectory, calculates the speed at which the ball would arrive at your location, predicts your foot trajectory, adjusts the strength and angle of impact, and sends the signal from your brain to your foot to change its position. To take an image as an input (in this case, the signal captured by our retina) and transform it into information (kicking the ball) is the core of computer vision. We will go into more detail about this in the next chapter.

Shockingly, we don’t need any formal education for this. We don’t attend classes for most of the decisions we make daily. No mental math 101 can estimate the foot strength required for kicking a ball. We learned that from trial and error growing up. And some of us might never have learned at all. This is a striking contrast to the way we built programs. Programs are mostly rule-based.

Let’s try to replicate just the first task that our brain did: detecting that there is a ball. One way to do it is to define what a ball is and then exhaustively search for one in the image. Defining what a ball is is actually difficult. Balls can be as small as tennis balls but as big as Zorb balls, so size won’t help us much. We could try to describe its shape, but some balls, like rugby, are not always perfectly spherical. Not everything spherical is a ball either, otherwise bubbles, candies, and even our planet would all be considered balls.

Balls

Pure Programming vs Machine Learning Approach

We could do a tentative definition and say “A ball is a sphere-like object used in sports or in play”. It seems correct, but we run into another problem. How do you know they are playing a sport? What do you use to detect that they are doing so? What if it was a dog with a ball? Is that not a ball? What if it is a ball on its own, with no people, and no sports? But what about something like the shuttlecock? It is something that we use to play with, that is not perfectly spherical, but we do not consider it a ball. All those nuances add up so that a simple problem that humans solve unconsciously is already hard to break down into simple rules.

We know these things ourselves. This implicit understanding comes from the mental images we construct over the years of what balls look like. While a shuttlecock does not fit into that mental image of a ball, it is hard to explain why. It is not just due to its size or due to the feathers. There are similar-sized balls and even if we cover a ball with feathers, we would still recognize it as a ball.

A ball covered with feathers

All of this is to show you that our ability to distinguish objects extends beyond strict definitions; we often generalize from related concepts and rely on context clues. When a familiar concept takes on a different form, we can still identify it without significant discomfort—this ability is natural to us. However, it is not inherent in systems governed by rigid, hard-coded rules.

This underscores the necessity for more robust systems—ones capable of adapting to a variety of scenarios. This is why the field is so closely related to artificial intelligence. Vision is context-rich, and we need models that are capable of leveraging these clues similarly to what we do.

Let’s take the example of Indiana Jones running from a boulder. There is a ball and there is running, but no one would rarely call that a sport! We know this because we rely on some context clues. The ball Indiana Jones is running away from looks heavy and twice his size. His face reflects his distress. The space is very narrow and it looks like a cave which is unusual for sports. Plus, we recognize his attire and that is not usually how players dress themselves.

The Motivation Behind Creating Artificial Systems Capable of Simulating Human Vision and Cognition

Albeit they have similar input and output, human vision and computer vision are different processes. Sometimes they overlap. However, computer vision is primarily concerned with developing and understanding algorithms and models in vision systems and their decisions. It is not constrained to the creation of systems that replicate human vision. It can be used for problems that would be too tedious, time-consuming, expensive, or error-prone for humans to do. Our ball example is still a simple one, and you might not think that is super useful. However, a model capable of tracking a ball can be used in sports events to provide faster and more fair decisions during gameplay. With the popularization of image-to-text and text-to-speech models, we could also make live sports events more accessible for people who have vision disabilities by automatically tracking the ball and its players and describing it in real time. Thus, even simple use cases can have a positive impact on society. We will discuss more about this in Section 3.

We are now on the cusp of an AI renaissance. A moment in time when we can train, deploy, and share our model freely. A moment when our models can detect things in images that we would not be able to see ourselves.

The limits of computer vision have been expanded, too. We can now generate images from text and describe images to text. And we can do that from our smartphones. Computer vision applications are everywhere. The possibilities are ours to explore and that is precisely what we will do in this course.

We welcome you to the field of computer vision. Take a seat. Enjoy the ride. It is going to be amazing.

< > Update on GitHub