カイロス時間:All Dogs Live in the Moment--飼い主と一緒なら、何歳からでも犬は変われる

[2019]犬は人間の音声におけるフォルマント関連の話者と母音の違いを知覚し、自発的に正規化する

Dogs perceive and spontaneously normalize formant-related speaker and vowel differences in human speech sounds

書誌情報Holly Root-Gutteridge, Victoria F. Ratcliffe, Anna T. Korzeniowska and David Reby, Biology Letters, Volume 15, Issue 12, December 2019, https://doi.org/10.1098/rsbl.2019.0555

Notice

表題の論文を日本語訳してみました。翻訳アプリにかけた日本語訳を英文に照らして修正していますが、表記のゆれや訳の間違いがあるかもしれません。正確に内容を知りたい方は、原文をご覧ください。

概要

家畜は人間の音声から基本的な音素情報を認識し、その音声から聞き覚えのある話者を認識することが示されている。

しかし、動物が知らない話者の間で自発的に単語を識別できるか(話者正規化)、あるいは知らない話者の間で自発的に単語を識別できるかは、まだ検討されていない。本研究では、家庭用犬を用いて、これらの能力を馴化-脱馴化パラダイムで評価しました。

その結果、同一人物の異なる短単語の提示には慣れるが、同性の新規話者の新規単語提示には著しく脱習慣化することがわかった。

このことは、犬が自発的に異なる単語間で最初の話者を分類していることを示唆している。

逆に、同性の異なる話者による同じ短い単語に慣れた犬は、新しい単語に対して有意に不慣れであり、異なる話者間で自発的に単語を分類していたことが示唆された。

この結果は、異なる話者の間で同じ音素を自発的に認識する能力と、知らない話者の発話を識別する手がかりを自発的に認識する能力は、家庭犬にも存在することを示しており、人間特有の特性ではないことを示唆している。

1. Background

Speech sounds vary among speakers owing to differences in body size, age, gender and other idiosyncratic attributes, and thus effective speech perception relies on a listener’s ability to recognize phonemes independent of such speaker variability, a perceptual mechanism known as speaker normalization. In human speech, vowels are represented by specific formant frequency patterns, but the absolute values of the formants vary across speakers owing to size-, age- or other individual differences in vocal tract length. Yet these speaker-related differences in formant values encode socially relevant indexical and identity cues across phonemes. Thus, human listeners must normalize these two dimensions of speech variation to recognize words across different speakers and to identify individual speakers across different words, an ability that was once posited to be uniquely human. Although some non-human animals can be trained to recognize phonemes across speakers and have also been shown to recognize familiar humans from their voices (review), both the extent to which animals can spontaneously perform speaker normalization to recognize words across unfamiliar speakers and their ability to spontaneously discriminate between unfamiliar speakers across speech sounds remain to be investigated.

Here, we use domestic dogs (Canis familiaris) to investigate these abilities in a non-human mammal that is regularly exposed to human speech utterances that function as interspecific signals. Indeed, dogs are known to recognize basic phonemic information, for example, when following commands (even in the absence of tonal cues), and can recognize familiar human voices speaking known phrases. However, in order to recognize words across speakers, dogs must attend to the relative positions of formants in human speech rather to than their absolute values by normalizing variation in the acoustic signal that is related to speaker identity or gender. Moreover, to discriminate between unfamiliar speakers, dogs must also be able to attend to these same speaker cues across different phonemes. As performing one task could preclude the other, we investigated whether dogs would spontaneously normalize variation in human speech to recognize words across speakers, and speakers across words, using the habituation–dishabituation paradigm. This paradigm has been used widely in perceptual studies involving animal or non-verbal participants, and has been used previously to explore dogs’ ability to discriminate conspecific barks produced by different individuals.

To investigate dogs’ ability to spontaneously discriminate between unfamiliar speakers, we tested whether dogs would habituate to a short series of different single syllable words [i.e. H-vowel-D] that varied only in the vowel and were produced by the same unfamiliar speaker, then dishabituate to the presentation of a new [H-vowel-D] word from a different speaker, then re-habituate to a final novel [H-vowel-D] word from the original speaker (electronic supplementary material, figure S1A). We predicted that if the dogs spontaneously categorized the identity of the initial speaker across words and recognized a change in speaker, then they would show a longer response to the dishabituation stimulus word than to the final habituation or re-habituation stimuli words.

Next, we investigated dogs’ ability to spontaneously normalize voice differences across speakers in order to discriminate between phonemes. We exposed them to four examples of the same word produced by four unfamiliar, same-gender speakers, then introduced a new speaker producing a new word (electronic supplementary material, figure S1B). We predicted that if the dogs spontaneously categorized the word produced by the different speakers, then they would show an increase in response duration to the dishabituation stimulus, demonstrating that they recognized the change in word and had spontaneously normalized production across speakers.

2. Methods and materials

Voices from 13 adult men and 14 adult women who were not familiar to the dogs were sampled with a randomized presentation of voices across conditions. We used four habituation, one dishabituation and one re-habituation sound stimulus trials with 6 s of silence between each audio stimulus presentation. Speaker identity and order of presentation of vowels were all pseudo-randomized across stimuli. For further details, see electronic supplementary material, Methods.

For trials in condition 1 (speaker discrimination), the discrimination of unfamiliar voices was tested with sequences using the voices of four unfamiliar speakers who produced monosyllabic words. Each stimulus word started with ‘h’ and ended in ‘d’, following, and included one of nine vowel-sounds: ‘had’, ‘head’, ‘heard’, ‘heed’, ‘hid’, ‘hod’, ‘hood’, ‘whod’ and ‘hud’. In condition 2 (speaker normalization), the discrimination of the vowels [a], [i] and [o] was tested using ‘had’, ‘hid’ and ‘whod’. These vowels were chosen and paired so as to be clearly distinct from one another and difficult for dogs to confuse. In both conditions, half of the stimulus sequences involved female voices and the other half involved male voices. While these short words may be familiar to dogs, they are not typically used in commands in the English language.

A total of 70 dogs participated in the between-subject design study. Each dog heard six sounds, with 24 dogs retained in each of the two conditions (see electronic supplementary material for demographic details). Videos were assessed before coding and discarded if the dog either did not visibly respond to the stimulus by moving any part of their face or body including their eyes (n = 4 dogs) or was distracted during trials by non-stimulus sounds or events (n = 18). The stimuli were presented from an Apple iBook Air through a Behringer Europort MPA40BT-PRO speaker that was set to conversational volume (approx. 65 dB) and placed on one side of the dog, counterbalanced across subjects. The dogs’ reactions were filmed on a Sony FDR-AX100 camcorder positioned on a tripod. Duration was measured as the time between the initial onset of response (e.g. looking, ears moving into forward position, eyes looking in direction of the speaker, head turning or moving towards the speaker), until the dog stopped visibly responding or the beginning of the next trial. All above-mentioned responses were coded as ‘change in behaviour’. Lack of response was coded as duration equals zero. All videos were coded blind in Sportscode Gamebreaker 11 (Sportstec, Warriewood, NSW, Australia) by H.R.G. with 25% double-coded blind by A.T.K. (see electronic supplementary material for details).

Statistical tests were performed in SPSS v. 25 (SPSS Inc., Chicago, IL., USA). Linear mixed effect models (LMEs) fitted with restricted-maximum-likelihood estimation were used to examine the effect of trial on listener response duration. Dog identity was included as a random effect and fixed effects included trial, dog sex, age in years, breed-group, recording location and speaker-gender, with significance threshold calculated at p < 0.007 using Bonferroni to correct for multiple comparisons. The variables met LME assumptions and residuals were normal as indicated by Shapiro–Wilks tests.

3. Results

Duration of the dogs’ responses in each trial was not significantly different across conditions (F1,187.5 = 5.961, p = 0.016, with corrected threshold of p = 0.007). For both conditions, only the habituation trial factor had a significant effect on response duration, while there were no other significant fixed effects (p > 0.05 for all other variables, see electronic supplementary material for details).

The LME results were similar for both conditions: habituation trial had a significant effect on response duration (condition 1, speaker discrimination: F5,115 = 4.271, p = 0.001; condition 2, speaker normalization: F5,115 = 5.421, p < 0.001). Response duration decreased in both conditions from habituation trial 1 to trial 4 (condition 1: p = 0.047; figure 1a; condition 2: p = 0.001, figure 1b), showing that dogs habituated to the stimuli over time.

Figure 1.

Figure 1. Boxplots of duration of response to stimulus sounds for (a) condition 1: speaker discrimination (n = 24 dogs), and (b) condition 2: speaker normalization (n = 24 dogs). P-values < 0.05 marked by *, p < 0.01 marked by **, p < 0.001 marked by ***, and outliers are marked by circles. H, habituation trial; DH, dishabituation trial; RH, re-habituation trial.

Download figureOpen in new tabDownload PowerPoint
For both conditions, dogs’ response durations increased significantly for the dishabituation trial compared to final habituation trial 4 (condition 1: p = 0.007, condition 2: p = 0.001) and the re-habituation trial (condition 1: p = 0.001, condition 2: p < 0.001), showing that they dishabituated to the change in stimulus and re-habituated to the repeated stimulus. Response duration in the re-habituation trial was not significantly different from the final habituation trial 4 (condition 1: p = 0.413, condition 2: p = 0.778), while the dishabituation trial response duration was not significantly different from habituation trial 1 (condition 1: p = 0.467; condition 2: p = 0.953). Thus, the duration of dogs’ responses to the dishabituation trial was similar to that of their original response to the first stimulus.

These results show that dogs habituated to the same speaker producing four different words dishabituated to a new speaker producing a new word (figure 1a). This demonstrates that dogs can spontaneously categorize short words as belonging to the same unfamiliar speaker based on the presentation of a very limited set of four stimuli, and are thus able to detect a change in speaker identity when a new speaker produces a new word that was not used in the habituation sequence. Conversely, dogs habituated to the same word spoken by four different speakers of the same gender and then dishabituated to a new word spoken by a new speaker that differed only in its vowel, demonstrating that dogs detected a change in the vowel sound, which can only be achieved by categorizing the vowels as similar in the habituation sequence, despite speaker differences in formant frequencies (figure 1b).

4. Discussion

Our results provide the first demonstration that spontaneous speaker normalization is not unique to humans, as we show that domestic dogs can spontaneously discriminate the same words across speakers. We also show that dogs are capable of spontaneously discriminating between unfamiliar speakers of the same gender across different words, suggesting that they have the ability to extract identity information from unfamiliar human voices on the basis of very little acoustic exposure. As interindividual differences in pitch were removed from vocal stimuli, dogs could only discriminate the speakers based on filter-related cues common to the different vowels, and/or on subtle idiosyncratic information encoded in the surrounding consonants.

Previous work on speaker normalization in non-human animals has relied on training the animal to give a behavioural cue when they have successfully discriminated (for a review, see [20]). Our work builds on that of Baru [21], who trained dogs to discriminate between synthesized vowels [a] and [i] through recognizing formants as patterns, and responding by lifting a corresponding paw. However, as Baru’s result used only synthesized voices and required the dogs to participate in up to 400 conditioning/reinforcement trials with negative reinforcement electric shocks to achieve accuracy, this level of discrimination was unlikely to represent a spontaneous ability in dogs [21]. Other experiments using natural voices have demonstrated that such diverse species as zebra finches (Taeniopygia guttata) [22] and chinchillas (Chinchilla lanigera) [23], among others, can be trained to normalize speaker differences to discriminate vowels. However, these studies too do not represent spontaneous responses as the research paradigms likewise relied on trained behaviours to indicate discrimination. Here, we measured spontaneous responses to natural voice stimuli in a habituation–dishabituation experiment and found that dogs did not require special or extensive training to spontaneously normalize speakers and vowels.

Speech perception depends on the ability to parse relatively small differences in sounds and recognize these as meaningful [24]. Originally, it was believed that speech production and speech perception were inextricably linked abilities, and that perception required the brain to create a mental model of the articulatory gestures that produced the speech to recognize and categorize the sounds [24]. This ‘motor theory’ posited that speech perception was unique to modern humans, as earlier hominins and other animals could not articulate their vocal apparatus to produce speech sounds and therefore could not make the mental connection between articulatory motions and the perceived sounds [25,26]. However, Kuhl & Miller [23] hypothesized that the two mechanisms of production and perception are in fact separate, and, furthermore, suggested that speech perception may at least be partly independent of speech production. This was based on evidence that the ability to perceive speech sound differences is present in both very young human infants (less than 1 month old) and also non-human animals including chinchillas, neither of which can produce normal speech sounds [23,27]. Thus, their ‘general auditory ability’ hypothesis decoupled perception from production and suggested that humans have evolved speech that can exploit existing perceptual categories rather than originating new abilities [23,27]. Because dogs are not capable of speech production, our result that dogs can normalize speaker differences to categorize vowels from formants lends some support to this theory, suggesting that the ability to perform speaker normalization may be a latent ancestral trait. However, as dogs have undergone a long period of domestication of at least 13 000 years [28], it is possible that these normalization abilities result from artificial selection by humans for dogs that were more responsive to human vocal cues. Testing speaker normalization abilities in captive grey wolves (Canis lupus) that do not share the same domestication history may help to clarify this point.

We also show that dogs can spontaneously discriminate between unfamiliar human voices, even when the words spoken are not meaningful to the dogs, on the basis of very limited exposure to just four words. This builds on previous results for familiar voice recognition by both dogs [12] and cats [29,30]. Further investigations could establish which aspects of the human voice are most important for the dogs’ perception of speaker identity, and the effects that changing language, pitch or other forms of speech modulation have on dogs’ perceptions of speaker identity. It is known that wolves can recognize familiar conspecifics from their howls [16] and that dogs can recognize familiar humans by their speech [12], but it has not yet been established if this cross-species ability was present in wolves or was specifically selected for during the domestication process.

In conclusion, dogs were found to spontaneously discriminate between both phonemic and identity cues in human speech. Dogs normalized differences in vocal production between same-gender speakers to recognize vowels and they could also use these differences to help to discriminate between unfamiliar speakers within genders. Thus, spontaneous speaker normalization to recognize vowels from formant patterns is not a uniquely human trait.

関連記事