April 13, 2026

We imagine what a person looks like just by hearing their voice: studies on the role of hormones

“Ready? Who is?” From the other end of the receiver a female voice answers, with a deep timbre and a rhythmic rhythm. Even before knowing exactly who it is the interlocutorour mind has already set to work tirelessly: an adult woman, confident, perhaps with dark hair, marked features and a calm expression. Imagine the face of those who speak to us is not magic, but biology: the development of facial features and of vocal cords it is in fact jointly led by them hormones. Our brain exploits this profound evolutionary link to instantly translate sonic cues into a coherent visual and psychological identity. Furthermore, at the brain level the areas that process faces and voices are closely connected to each other, in a complex multisensory network.

Imagining the face from the voice: a shared biology between sounds and features

Face and voice they are not independent systems, but evolve under their influence chemical messengersespecially the hormones sexual during puberty. Researchers have found that these substances simultaneously shape both our vocal cords and the bone structure of our face.

Let’s take for example the testosterone. Higher levels of this male hormone lead to a thickening of the vocal cordswhich translate into a lower and deeper tone of voice. At the same time, the same hormone promotes the development of specific facial featuressuch as a wider jaw and a more prominent brow bone. On the contrary, estrogens they limit the swelling of the vocal folds, maintaining the higher pitched voiceand help to delineate faces with bigger eyes and fuller lips.

In evolutionary biology, this mechanism is called “backup signal” hypothesis. It means that evolution has ensured that face and voice transmit redundant information, as if they were two backup copies of the same message, for communicate reliably our core physical characteristics, such as gender, age and health. The efficiency of this system is astounding: recent experiments published on Scientific Reports, a portfolio magazine Nature show that a listener only needs to hear a single vowel (for example a simple “a”) pronounced by a stranger to guess their gender and age with very high precision.

The imprint of movement and personality: the Cognition study

Beyond basic biology, there is a dynamic element that inextricably links audio to video: the specific way we move our muscles to speak. When we articulate words, our face contracts, creating completely personal expressions. The very mechanics of speech production shape our vocal tract, simultaneously determining both facial movements and the actual sound that comes out of the lips. So it exists a sort of “signature” unique and peculiar in our style of communication that the brain is able to grasp and translate from one sense to another.

As you can imagine, we don’t just imagine the external appearance. When we are exposed to a face or a voice, our mind immediately leaps forward, trying to understand the personality of those in front of us. A study on it talks about it Cognition who observes that we automatically take for granted that a relaxed and sweet voice belongs to a friendly face.

The voice-face association is processed by a multisensory brain network

On an anatomical level, they exist direct structural connections between the areas of the brain responsible for visual processing of faces and those dedicated to listening to voices, which give life to a sophisticated multisensory neural network. An entire line of research is in fact intent on demonstrating that, when faced with the voice of a stranger and photographs of faces never seen before, participants are able to correctly match the voice to the face of the real speaker with a precision statistically superior to pure chance. Our brain analyzes acoustic details, detects hormonal and behavioral cues in the background and instinctively searches for the face that fits instinctively with those characteristics.

The face-voice connection is so ingrained that hearing a newly familiar voice accelerates and improves our ability to visually recognize the face associated with it at a later time. Imagine shaking hands with a stranger who, upon introducing himself, shows off a unmistakable voiceperhaps due to an unusually high tone or a very particular cadence. If a few days later you happen to see a photograph of him again immediately after hearing his voice again, your brain will recognize that face in a fraction of a second, so quickly that it can even be measured, resulting faster compared to recognizing a person met the same evening but with a more common or “average” voice. This proves to all intents and purposes that the voice and the face are not stored as watertight compartments in our memorybut they are inextricably intertwined from the very first moments we meet a person.

Sources

Bülthoff & Newell, 2017, Crossmodal priming of unfamiliar faces supports early interactions between voices and faces in person perception Nagrani et al., 2018, Seeing Voices and Hearing Faces: Cross-modal biometric matching Lavan, 2023, How do we describe other people from voices and faces? Sorokowski et al., 2023, Comparing accuracy in voice-based assessments of biological speaker traits across speech types Smith et al., 2016, Concordant Cues in Faces and Voices: Testing the Backup Signal Hypothesis Masi et al., Multimodal Cues to Change Your Mind: The Intertwining of Faces, Voices, and Behaviors in Impression Updating

Alexander Marchall

Alexander Marchall is a distinguished journalist with over 15 years of experience in the realm of international media. A graduate of the Columbia School of Journalism, Alex has a fervent passion for global affairs and geopolitics. Prior to founding The Journal, he contributed his expertise to several leading publications.