• databases@iab-rubric.org
  • IIT Jodhpur

Audio Negation Benchmark

Image

Abstract. Text-to-audio models are expected to interpret prompts with precision and generate audio outputs accordingly. Large-scale training has improved their ability to follow textual instructions. However, the role of negation in such models remains underexplored. This study investigates whether audio generation systems can reliably distinguish between affirmative and negated prompts, for example, “A person is whistling” versus “A person is not whistling.” To enable systematic evaluation, we introduce a benchmark dataset containing one million negated prompts, covering diverse linguistic and acoustic contexts. We further propose strategies in both text and audio domains to analyze how negation is represented and processed. Experimental results demonstrate that current audio generation models not only struggle with negation but fail completely, often collapsing affirmative and negative prompts into similar outputs. These findings reveal a critical limitation in generative models and highlight the need for specialized benchmarks and training objectives to advance negation comprehension in multimodal generative systems.
1,000,000 negated prompts 4 negation types 3 negation scopes 3 T2A models tested 0.96 negation diversity index 99.6% human-validated prompts

The Negation Benchmark

The negative prompt dataset is generated using the AudioCaps dataset, which is a large-scale dataset of nearly 90K audio clips paired with human-written captions that we refer to as the original audio and original prompt. We use these original prompts to generate negative prompts. Since there are many possible ways to introduce negation, for example, the caption Fire truck sirens can be expressed as Fire truck sirens are absent or The area is quiet without any truck sirens. To produce a rich one-to-many mapping, we create a text negation module that takes the original caption as input and outputs negative captions in different writing styles. By doing so, we obtain 1 million negative prompts, which we then provide to a text-to-audio generation model to understand how well it processes negation.

Examples

In the examples below you can listen to:

  • Original Audio — the ground-truth recording from AudioCaps.
  • Positive Audio — generated by a T2A model from the original caption.
  • Negative Audio — generated from negated prompts in our dataset, covering Lexical Syntactic Semantic and Mixed negations.

Each cell shows a spectrogram (top) and an audio player (bottom). Notice how the negative outputs often look and sound nearly identical to the positive ones — the signature of affirmation bias in current T2A models.

Example 1: “Fire truck sirens”

Original Audio Positive Audio Lexical Negation Semantic Negation Syntactic Negation Mixed Negation
Spectrogram siren Original Audio siren Positive Audio siren Lexical Negation siren Semantic Negation siren Syntactic Negation siren Mixed Negation
Audio siren_original.wav siren_positive.wav siren_neg_lex.wav siren_neg_sem.wav siren_neg_syn.wav siren_neg_mix.wav
Caption Fire truck sirens Fire truck sirens Fire truck sirens are absent The area is quiet without the fire truck sirens. There are no fire truck sirens. No fire truck sirens are heard.

Example 2: “Multiple gun shots”

Original Audio Positive Audio Lexical Negation Semantic Negation Syntactic Negation Mixed Negation
Spectrogram gunshot Original Audio gunshot Positive Audio gunshot Lexical Negation gunshot Semantic Negation gunshot Syntactic Negation gunshot Mixed Negation
Audio gunshot_original.wav gunshot_positive.wav gunshot_neg_lex.wav gunshot_neg_sem.wav gunshot_neg_syn.wav gunshot_neg_mix.wav
Caption Multiple gunshots Multiple gunshots Multiple gunshots are absent There are no gunshots No multiple gunshots The scene is quiet without any gunshots

Example 3: “Someone whistles a song”

Original Audio Positive Audio Lexical Negation Semantic Negation Syntactic Negation Mixed Negation
Spectrogram whistle Original Audio whistle Positive Audio whistle Lexical Negation whistle Semantic Negation whistle Syntactic Negation whistle Mixed Negation
Audio whistle_original.wav whistle_positive.wav whistle_neg_lex.wav whistle_neg_sem.wav whistle_neg_syn.wav whistle_neg_mix.wav
Caption Someone whistles a song Someone whistles a song The whistling of a song is absent There is no whistling of a song Someone does not whistle a song The person is silent instead of whistling a song

Text Negation Module

This module produces negative prompts corresponding to the original caption.

  1. Sound event counting. (Example: “Moving of objects followed by a beep” has two sounding objects — moving objects and beep.)
  2. Prompt generation. Based on the negation type, negation scope, and sound count.

The number of prompts generated for an original caption depends on the categorization value. Examples are shown in the diagrams below.

Categorization flow diagram
Figure 1. Categorization flow. The original caption is decomposed into sounding objects, then negative prompts are produced by combining a Negation Type with a Negation Scope for the given Sound Count.
Example outputs for each negation type and scope
Figure 2. Example negative prompts produced for each combination of negation type, scope, and sound count.

Evaluation Protocols

Text Modality

Evaluation protocols in text modality include:

  • Embedding similarity on frozen encoder.
  • Embedding similarity on a fine-tuned encoder.
  • Scores on NegBLEURT.
Aim: check the similarity of negative re-captions with the original affirmative baseline.
Text-modality evaluation pipeline
Figure 3. Text-modality evaluation pipeline — (i) fine-tune MPNet so that positive and negative pairs differ, then evaluate on sentence similarity; (ii) fine-tune a BLEURT-style regression head to output a contradiction-vs-entailment score.

Audio Modality

Evaluation protocols in audio modality include:

  • Embedding similarity on frozen audio encoders.
  • Audio Question Answering (AQA).

Sample AQA questions for sound count 1 and sound count 2 are shown below.

AQA question design
Figure 4. Sample AQA questions for sound count 1 and sound count 2. For each NEGATIVE AUDIO (LEX / SYN / SEM / MIX), the option set is dynamically restricted to the same negation type, the original, and two distractors. The table on the bottom-right reports the number of questions and options per question for each combination.