Nonlinear dynamics in auditory cortical activity reveal the neural basis of perceptual warping in speech categorization

Surrounding context influences speech listening, resulting in dynamic shifts to category percepts. To examine its neural basis, event-related potentials (ERPs) were recorded during vowel identification with continua presented in random, forward, and backward orders to induce perceptual warping. Behaviorally, sequential order shifted individual listeners’ categorical boundary, versus random delivery, revealing perceptual warping (biasing) of the heard phonetic category dependent on recent stimulus history. ERPs revealed later (∼300 ms) activity localized to superior temporal and middle/inferior frontal gyri that predicted listeners’ hysteresis/enhanced contrast magnitudes. Findings demonstrate that interactions between frontotemporal brain regions govern top-down, stimulus history effects on speech categorization.


Introduction
In speech perception, listeners group similar sensory cues to form discrete phonetic labels-the process of categorical perception (CP). Spectral features vary continuously. However, reducing acoustic cues to discrete categories enables more efficient use of speech sounds for linguistic processing. 1,2 The extent to which phonetic speech categories from acousticsensory cues are influenced by perceptual biasing (top-down influences) has been debated. On one hand, categories might arise due to innate psychophysiological constraints. 3 Alternatively, there is ample evidence that top-down processing influences speech categorization as suggested by enhancements observed in highly proficient listeners [4][5][6][7] and biasing effects, when individuals hear a different category depending on the surrounding speech context. 8 Changes in auditory-perceptual categories due to stimulus history are a form of nonlinear dynamics. Nonlinear dynamics in CP are especially prominent at the perceptual boundary, where different patterns of behavioral identification can result for otherwise identical speech sounds: hysteresis (i.e., percept continuing in the same category beyond the theoretical boundary) or enhanced contrast (i.e., percept changing to the other category before the theoretical boundary). [9][10][11] Both stop consonant and vowel continua can produce context-dependent shifts in perception, though stronger perceptual warping occurs with more ambiguous speech sounds. 12 Event-related brain potentials (ERPs) have been used to examine the neural underpinnings of speech categorization. [13][14][15] ERPs reveal the brain performs its acoustic-to-phonetic conversion within $150 ms and differentiates even the same speech sounds when categorized with different perceptual labels. 13 Yet it remains unknown how neural representations of categories change with recent state history as seen in hysteresis and other perceptual nonlinearities inherent to speech perception. 10 Shifting percepts near a categorical boundary due to presentation order (i.e., how stimuli are sequenced) should yield measurable neural signatures if speech perception is indeed warped dynamically.
Here, we evaluated the effects of nonlinear dynamics on speech categorization and its brain basis. We aimed to resolve whether perceptual hysteresis in CP occurs at early (i.e., auditory-sensory) or later (i.e., higher-order, linguistic) stages of speech analysis. We measured behavioral and multichannel EEG responses during rapid phoneme identification tasks where tokens along an identical continuum were presented in random versus serial (forward or backward) order. Based on previous a) Invited paper. b) Also at: School of Communication Sciences and Disorders, University of Memphis, Memphis, TN 38152, USA. Author to whom correspondence should be addressed. c) Also at: Institute for Intelligent Systems, University of Memphis, Memphis, TN 38152, USA. studies examining nonlinear dynamics 9,10 and top-down influences in speech CP, 4,5,7 we hypothesized (1) the location of listeners' perceptual boundary would shift according to the direction of stimulus presentation (i.e., random versus forward versus backward) and (2) perceptual warping would be accompanied by late modulations in the ERPs.

Stimuli and task
We used a 7-token (hereafter "Tk1-Tk7") vowel continuum from /u/ to /A/ synthesized in MATLAB (Natick, MA) via a conventional source-filter implementation. Each 100 ms token had a fundamental frequency of 100 Hz (i.e., male voice). Adjacent tokens were separated by equidistant steps in first formant (F1) frequency spanning from 430 (/u/) to 730 Hz (/A/). We selected vowels over consonant-vowel (CV) syllables because pilot data suggested vowels were more prone to nonlinear perceptual effects (see supplementary material for details in Fig. S1 17 ). We delivered stimuli binaurally through insert earphones (ER-2; Etymotic Research, Elk Grove Village, IL) at 76 dB A SPL. Sounds were controlled by MATLAB coupled to a TDT RP2 signal processor (Tucker-Davis Technologies, Alachua, FL).
There were three conditions based on how tokens were sequenced: (1) random presentation, and two sequential orderings presented serially between continuum end points and F1 frequencies, (2) forward /u/ to /A/, 430-730 Hz (i.e., vowel lowering), and (3) backward /A/ to /u/, 730-430 Hz (i.e., vowel raising). Forward and backward directions on such a continuum were expected to produce perceptual warpings (i.e., hysteresis). 10 Random and serial order conditions were presented in three different blocks (1 random, 1 forward, 1 backward), randomized between participants. We allowed breaks between blocks to avoid fatigue.
Within each condition, listeners heard 100 presentations of each vowel (total ¼ 700 per block). On each trial, listeners rapidly reported which phoneme they heard with a binary keyboard response ("u" or "a"). Following their response, the interstimulus interval was jittered randomly between 800 and 1000 ms (20 ms steps, uniform distribution).

Psychometric function analysis
Identification scores were fit with sigmoid P ¼ 1/[1 þ e Àb1(xÀb0) ], where P is the proportion of trials identified as a given vowel, x is the step number along the continuum, and b0 and b1 are the location and slope of the logistic fit estimated using non-linear least squares regression. 14,18 Leftward/rightward shifts in b0 location for the sequential versus random stimulus orderings would reveal changes in the perceptual boundary characteristic of perceptual nonlinearity. 10 These metrics were analyzed using a one-way mixed-model analysis of variance (ANOVA) (subjects ¼ random factor) with a fixed effect of condition (three levels: random, forward, and backward) and Tukey-Kramer adjustments for multiple comparisons. Reaction times (RTs) were computed as the median response latency for each token per condition. RTs outside of 250-2000 ms were considered outliers (i.e., guesses or attentional lapses) and were excluded from analysis [n ¼ 2487 trials ($7%) across all tokens/conditions/listeners]. 13,14 RTs were analyzed using a two-way, mixed model ANOVA (subjects ¼ random) with fixed effects of condition (three levels: random, forward, and backward) and token (seven levels).

Cross-classification analysis of behavioral response sequences
To determine the effect of sequential presentation order (i.e., forward versus backward F1) on behavioral responses, we performed cross-classification analysis on single-runs of the identification data (i.e., responses from tokens 1-7 or 7-1) in the Generalized Sequential Querier program. 19 This compared listeners' category labels for each continuum token (e.g., instances where Tk 3 presentations were labeled as "u" versus "a") when the stimulus continuum was presented in the forward (i.e., rising F1) versus backward (i.e., falling F1) direction. Biasing due to presentation order was quantified using Yule's Q, an index of standardized effect size transformed from an odds ratio; it varies from -1 to 1, which is superior to the odds ratio because it is relatively unskewed, affording more direct statistical analysis. 20 In the current application, a Q of þ1 means "u" selected more in the forward F1 condition and "a" selected more in the backward F1 condition; a Q of À1 indicates the opposite pattern; and values effectively equal to 0 indicate presentation order had no effect on response selection. This analysis allowed us to determine whether the direction of stimulus presentation (i.e., increasing/decreasing F1) shifted listeners' category labels towards one end point of the continuum or the other (i.e., evidence of perceptual hysteresis). The non-0 responses at Tk3/Tk5 were used to classify participants as "hysteresis" versus "enhanced contrast" listeners (i.e., those showing late versus early biasing in their category labeling). See supplementary material for details on the cross-classification analysis results in Table S1. 17

EEG recording
Continuous EEGs were recorded during the speech identification task from 64 sintered Ag/AgCl electrodes at standard 10-10 scalp locations (NeuroScan Quik-Cap array). 21 Continuous data were sampled at 500 Hz (SynAmps RT amplifiers; Compumedics NeuroScan; Charlotte, NC) with an online passband of DC-200 Hz. Electrodes placed on the outer canthi of the eyes and superior/inferior orbit monitored ocular movements. Contact impedances were <10 kX. During acquisition, electrodes were referenced to an additional sensor placed $1 cm posterior to the Cz channel. Data were common average referenced for analysis.

Cluster-based permutation analysis
To reduce data dimensionality, channel clusters were computed by averaging adjacent electrodes over 5 a priori left/right frontocentral scalp areas as defined in previous speech ERP studies (see Fig. 2). 14, 22 We used cluster-based permutation statistics 23 implemented in BESA V R Statistics 2.1 (BESA, GmbH) to determine whether channel cluster ERP amplitudes differed with presentation order. This ran an initial F-test across the whole waveform (i.e., -200-800 ms), contrasting random, forward, and backward F1 conditions. This step identified time samples and channel clusters where neural activity differed between conditions (p < 0.05). Critically, BESA corrects for multiple comparisons across space and time. This was then followed by a second level analysis using permutation testing (N ¼ 1000 resamples) to identify significant post hoc differences between pairwise stimulus conditions (i.e., random/forward/backward stimulus orderings). Contrasts were corrected with Scheffe's test using Bonferroni-Holm adjustments. Last, we repeated this analysis for tokens 3-5, representing stimuli surrounding the categorical boundary where warping was expected.

Distributed source analysis
We used Classical LORETA Analysis Recursively Applied (CLARA) distributed imaging with a 4 shell ellipsoidal head model [conductivities of 0.33 (brain), 0.33 (scalp), 0.0042 (bone), and 1.00 (cerebrospinal fluid)] on difference waves to determine the intracerebral sources that account for perceptual non-linearities in speech categorization. 24 Difference waves were computed as the voltage difference in ERPs for each of the three pairwise stimulus contrasts (i.e., randomforward; random-backward; forward-backward). All 64 electrodes were used (rather than the channel cluster subset) since full head coverage is needed to reconstruct inverse solutions. Source images were computed at a latency of 320 ms, where the scalp ERPs maximally differentiated stimulus order based on the cluster-based statistics [see Fig. 3(A)]. Correlations between changes in b0 and CLARA activations evaluated which source regions predicted listeners' perceptual warping of speech categories.

Psychometric function data
Listeners perceived vowels categorically in all three presentation orderings as evidenced by their sigmoidal identification functions [ Fig. 1(A)]. Slopes varied with presentation order (F 2,28 ¼ 6.96, p ¼ 0.0463); this was driven by the forward condition producing stronger categorization than random (p ¼ 0.0364) [ Fig. 1(C)]. The categorical boundary did not appear to change with condition when analyzed at the group level (F 2,28 ¼ 1.78, p ¼ 0.1875) [ Fig. 1(D)].
Despite limited changes in boundary location at the group level [ Fig. 1(D)], perceptual nonlinearities were subject to stark individual differences [ Figs. 1(E)-1(G)]. Some listeners were consistent in their percept of individual tokens regardless of presentation order (i.e., "critical boundary" response pattern) (n ¼ 1); others persisted with responses well beyond the putative category boundary at continuum midpoint (i.e., hysteresis) (n ¼ 9); and other listeners changed responses earlier than expected (i.e., enhanced contrast) (n ¼ 4). Response patterns were, however, highly stable within listener; a split-half analysis showed b0 locations were strongly correlated between the first and last half of task trials (r ¼ 0.86, p < 0.0001). This suggests that while perceptual nonlinearities (i.e., b0 shifts) varied across listeners, response patterns were highly repeatable within individuals.
We performed further cross-classification analysis to characterize these individual differences in categorization nonlinearities. Table S1 shows participants' Yule's Q values Tk3/5 (i.e., tokens flanking the b0), and, thus, their predominant "mode" of hearing the speech continua (see supplementary material for details on individual listening strategies 17 ). ARTICLE asa.scitation.org/journal/jel Individuals with negative Qs showed hysteresis response patterns (n ¼ 9), while those with positive Qs showed enhanced contrast patterns in perception (n ¼ 4). Still others (n ¼ 2) did not show perceptual nonlinearities and demonstrated neither hysteresis nor enhanced contrast. Figure 2 shows scalp ERP channel clusters to token 4 (critical stimulus at the perceptual boundary) across presentation orders (see supplementary material for raw ERP data in Fig. S2 17 ). Cluster based permutation tests 23 also revealed nonlinear (stimulus order) effects emerging $320 ms after speech onset, localized to left temporal areas of the scalp (omnibus ANOVA; p ¼ 0.03) [ Fig. 3(A), shading]. Condition effects were not observed in other channel clusters. Post hoc contrasts revealed order effects were driven by larger neural responses for the random versus forward F1 condition (p ¼ 0.003). CLARA source reconstruction localized this nonlinear effect (i.e., ERP random@Tk4 > ERP forward@Tk4 ) to underlying brain regions in bilateral superior temporal gyri (STG) and medial (MFG) and inferior (IFG) frontal gyri [ Fig. 3(B)]. No differences were found when grouping neural responses by behavioral response patterns, including when accounting for differences in the listeners' categorical boundary. However, this might be expected given the low sample size ("n") within each subgroup.

Electrophysiological data
We assessed the behavioral relevance of these neural findings via correlations between regional source activations (i.e., CLARA amplitudes at 320 ms) [Figs. 3(C) and 3(D)] and listeners' behavioral CP boundary locations (b0). We found modulations in right MFG (rMFG) and left IFG with stimulus order were associated with behavioral b0 shifts characteristic of perceptual warping but in opposite directions. Listeners with increased rMFG activation from random versus ordered (forward) stimulus presentation showed lesser movement of their perceptual boundary [Pearson's r ¼ À0.72, p ¼ 0.0027]. In

Discussion
By measuring EEG to acoustic-phonetic continua presented in different contexts (random, serial orderings), our data expose the brain mechanisms by which listeners assign otherwise identical speech tokens to categories depending on context. Behaviorally, perceptual nonlinearities were more prominent for vowels compared to CVs (see supplementary material 17 ) and were subject to stark individual differences. Behavioral warping corresponded with neural effects emerging $300 ms over left hemisphere with underlying sources in a frontotemporal circuit (bilateral STG, right MFG, left IFG). Our findings reveal stimulus presentation order strongly influences the neural encoding of phonemes and suggest that sequential warpings in speech perception emerge from top-down, dynamic modulation of early auditory cortical activity via frontal brain regions.

Perceptual nonlinearities in categorization are stronger for vowels than CVs
We found vowels elicited stronger perceptual warping (i.e., changes in the CP boundary) than CV tokens (see supplementary material 17 ). Vowels are generally perceived less categorically than CVs. 1,12,27,28 With the vowel state space already being more flexible than consonants, listeners are more free to alter perception based on history of other vowels. Formant frequencies intrinsic to vowels are relatively continuous in their variations, but also static. In contrast, formant transitions in CVs allow frequency comparisons within the stimulus itself. 29,30 Vowel percepts are thus more ambiguous categorically, and consequently, more susceptible to contextual influences and individual differences. 31 Indeed, we find the magnitude and direction of perceptual warping strongly varies across listeners, consistent with prior work on perceptual hysteresis in both the auditory and visual domains. 10,32 4.2 Perceptual warping of categories is subject to stark individual differences Behaviorally, we found minimal group-level differences in psychometric functions, with only an increase in slope when in the forward /u/ to /A/ direction versus random presentation. A change in identification slope indicates sequential Fig. 3. Perceptual nonlinearities in the auditory cortical ERPs emerge by $320 ms via interplay between frontotemporal cortices. (A) Cluster based permutation statistics contrasting responses to the identical Tk4 (continuum's midpoint) in random, backward, and forward conditions. Nonlinearities in speech coding emerge by $300 ms (highlighted region) in the left channel cluster. Line ¼ maximal difference (322 ms). Negative ¼ up. (B) CLARA source imaging contrasting the difference in activations to Tk4 during random versus forward conditions. Nonlinearities in perceptual processing localize to bilateral superior temporal gyri and medial/inferior frontal gyri. (C) and (D) Brain-behavior correlations between the change in regional source activations and magnitude of hysteresis effect. Changes in right rMFG contrasting "randomness" (i.e., Drandom-forward) are negatively associated with shifts in the CP boundary. Contrastively, modulations in left IFG contrast the direction of serial ordering (i.e., Dforward-backward) and are positively related to behavior. presentation led to more abrupt category changes. The reason behind this direction-dependent effect is unclear but could be related to differences in perceptual salience between continuum end points. We can rule out differences due to vowel loudness as both /u/ and /A/ end points had nearly identical loudness according to ANSI (2007) 33 (/A/¼ 71.9 phon; /u/ ¼ 71.2 phon). 34 Alternatively, /A/ might have been heard as being a more prototypical vowel (i.e., perceptual magnet), 35 perhaps owing to its higher frequency of occurrence in the English language. 36,37 Another explanation is that in the forward ordering, tokens were increasing in F1 frequency and previous work has demonstrated listeners are more sensitive to changes in rising versus falling pitch. 38,39 Thus, the increase in F1 may be more salient from a pitch (or spectral percept) standpoint. Conversely, RTs were faster in sequential compared to random presentation orders. RTs demonstrate the speed of processing, which increases (i.e., slows down) for more ambiguous or degraded tokens 7,30 and decreases (i.e., speeds up) for more prototypical tokens. 25 Faster RTs during sequential presentation suggest a quasi-priming effect whereby responses to adjacent tokens were facilitated by the preceding (phonetically similar) stimulus.
Behavioral changes in category boundary location were most evident at the individual rather than group level (cf. Refs. 8 and 40) and when speech tokens were presented sequentially. These findings suggest stimulus history plays a critical role in the current percept of phonemes. Listeners demonstrated three distinct response patterns (see supplementary material for hysteresis, enhanced contrast, and critical boundary shown in Table S1 17 ) differences which were largely obscured at the group level. This is consistent with previous work demonstrating trial-by-trial differences in nonlinear dynamics of speech categorization. [9][10][11] Critically, response patterns were highly stable within individuals, suggesting listeners have a dominant response pattern and/or apply different decision strategies (cf. biases) during categorization. This latter interpretation is also supported by the different regional activation patterns and their behavioral correlations. It is also reminiscent of lax versus strict observer models in signal detection frameworks where, for suprathreshold stimuli, listeners' response selection is primarily determined by their internal bias (i.e., preference for tokens at one end of the continuum). 41

Electrophysiological correlates of perceptual warping
ERPs revealed late ($320 ms post-stimulus) differences in response to token 4 (i.e., categorical boundary) between forward and random conditions over the left hemisphere. Sound-evoked responses in auditory cortex typically subside after $250 ms. 42,43 This suggests the stimulus order effects observed in our speech ERPs likely occur in higher-order brain regions subserving linguistic and/or attentional processing. The leftward lateralization of responses also suggests contextdependent coding might be mediated by canonical language-processing regions (e.g., Broca's area). 44 Indeed, source analysis confirmed engagement of extra-auditory brain areas including IFG and MFG whose activations scaled with listeners' perceptual shifts in category boundary. In contrast, auditory STG, though active during perceptual warping, did not correlate with behavior, per se.
Beyond its established role in speech-language processing, left IFG is heavily involved in category decisions, particularly under states of stimulus uncertainty (i.e., randomness, noise). 7,14,31 Related, we find direction-related modulations in the perceptual warping of speech categories (to otherwise identical sounds) are predicted by left IFG engagement. IFG involvement in our tasks is consistent with notions that frontal brain regions help shape behavioral category-level predictions at the individual level. 45 Contrastively, rMFG correlated with changes in behavior between random versus forward stimulus presentation, a contrast of ordered versus unordered sequencing. MFG regulates behavioral reorienting and serves to break (i.e., gate) attention during sensory processing. 46 Additionally, it is active when holding information in working memory, such as performing mental calculations, 47 and has been implicated in processing ordered numerical sequences and counting. 48 The observed perceptual nonlinearities induced by serial presentation might therefore be driven by such buffer and comparator functions of rMFG as listeners hold prior speech sounds in memory and compare present to previous sensory-memory traces. In contrast, un-ordered speech presented back-to-back would not load those operations and thus, may explain the reduced rMFG activity for random presentation. The simultaneous activation of canonical auditory areas (STG) concurrent with these two frontal regions leads us to infer that while auditory cortex is sensitive to category structure (present study; Refs. 7 and 14) top-down modulations from frontal lobes dynamically shapes category percepts online during speech perception.