Where do CAPTCHA images come from?

Where do CAPTCHA images come from?

CAPTCHA examples. Credit: Google.

«Select all images with a bus». Who knows how many times you have had to pass an online security check, the so-called CAPTCHAand having to select all the boxes containing a bus, traffic lights, pedestrian crossings, or a bicycle. Have you ever wondered Where do these photographs come from and why are they submitted to us? Many of the images used are from Google Street View. These tests serve to block access to bots, i.e. automated programs designed to carry out repetitive and often malicious actions.

The original system was designed in 2007 by researchers of Carnegie Mellon University and became the property of Google in 2009with the acquisition of the company reCAPTCHA Inc.. In its early incarnations, the system did not use photographs, but relied entirely on the interpretation of images containing deliberately distorted typographical characters.

The goal was to harness human visual capabilities to improve technology OCR (Optical Character Recognition), or optical character recognition, by having users transcribe words that are difficult to recognize automatically, so as to train optical character recognition systems. According to what was declared by TechCrunch in a 2012 article, «the system is designed to reduce spam and fraud, but it also helps digitize the text of printed materials, such as books and newspapers. Google uses reCAPTCHA, for example, to digitize the contents of Google Books and Google News archives».

Starting from 2012the approach changed with the introduction of photographs extracted from the Google Street View project. After the first signs of the news regarding the presence of images in CAPTCHAs, a Google spokesperson confirmed the news with the following statement:

(At Google) we extract data like street names and street signs from Street View images to enhance Google Maps with useful information like business addresses and locations. Based on the data and results of these reCAPTCHA tests, we will determine whether using images can also be an effective way to further refine our tools to combat online abuse caused by bots and algorithms.

Some experts had even hypothesized that the interaction of users in recognizing traffic lights, pedestrian crossings, vehicles, etc. served to train the artificial intelligence algorithms underlying the autonomous driving system of Waymoa subsidiary of Google, but around the middle of 2021 company representatives had stated a Vox That “the company does not use this image data to train its autonomous cars».