Annotated Evaluations and Annotation bias
- Dr Stephen Anning
- Oct 20, 2025
- 11 min read
Introduction
The increasing integration of artificial intelligence (AI) into security and defence applications necessitates rigorous evaluation frameworks to ensure reliability, fairness, and interpretability. While quantitative benchmarks (e.g., accuracy, F1 score) provide useful performance indicators, they fail to capture the complexities of human interaction with AI. Capturing the complexities of human interaction with AI is particularly critical for human-centric applications, where AI systems are designed to generate insights about human behaviour. Over-reliance on quantitative methods risks neglecting the qualitative dimensions of human experience, leading to systems that are misaligned with real-world social and ethical considerations.
This review synthesises recent research on qualitative methods for evaluating AI and natural language processing (NLP) models, with a focus on the role of diversity in annotation, epistemic challenges in AI validation, and bias detection in AI-generated intelligence. Drawing from studies on annotator disagreement, qualitative safety assessments, and issue bias in large language models (LLMs), we examine how qualitative methods can improve AI evaluation in defence and security contexts.
The Limitations of Quantitative AI Evaluation
Traditional AI evaluation relies heavily on numerical benchmarks, assuming a single metric can summarise a model’s performance. However, this paradigm fails to account for human subjectivity, contextual nuance, and socio-political influences on AI-generated insights (Nickel, 2024).
Validity Challenges in AI Model Evaluation: The train-test paradigm used in machine learning is often invalid for developing human-centric systems, as it assumes that test data accurately represents the real-world distribution (Nickel, 2024). In security and intelligence applications, where AI models process dynamic and context-sensitive information, passive data collection methods introduce significant epistemic blind spots, leading to unreliable AI-generated insights.
Aggregating Annotator Disagreement: Many NLP tasks assume that a single "ground truth" label exists for each data point. However, research demonstrates that annotators from different backgrounds consistently disagree on subjective classification tasks, particularly in areas such as toxicity detection, misinformation labelling, and ethical AI evaluations (Jiang et al., 2024). Aggregating these perspectives into a single label erases important socio-political context.
Assessing the validity of AI models and the impact of annotator disagreement underscores the urgent need for qualitative evaluation methods that account for human perspectives, ethical considerations, and real-world deployment contexts.
The Role of Diversity in AI Evaluation
Recent research highlights the importance of incorporating diverse human perspectives into AI evaluation frameworks, particularly in the context of security and defence applications. Several studies emphasise that perceptions of safety, fairness, and bias vary significantly across demographic groups, suggesting that AI evaluation should prioritise pluralistic methodologies.
Annotator Perspectives and Disagreement: Traditional AI annotation pipelines aggregate human labels to create "gold-standard" datasets. However, social identity, ideology, and lived experience shape how annotators interpret AI-generated content (Rastogi et al., 2024). In safety evaluation studies, gender, ethnicity, and political background influence annotators' perceptions of harm in AI-generated images and text, challenging the assumption of a universally "correct" label.
The Problem with Expert-Only Annotations: AI safety evaluations often rely on expert annotators trained in standardised safety policies. However, experts and diverse raters frequently disagree on harm assessment, with experts underestimating the severity of bias-related harms (Rastogi et al., 2024). This divergence highlights the need for multi-perspective evaluation frameworks, particularly in high-risk intelligence applications where misjudging harm could have operational consequences.
To address challenges of annotator perspective and disagreements and expert-only annotations, researchers advocate for "perspectivist" annotation approaches that preserve annotator disagreements rather than enforcing a single truth label (Jiang et al., 2024). This is especially relevant for security AI applications, where diverse cultural perspectives influence how threats, misinformation, and ethical risks are perceived.
Qualitative Bias Detection in AI-Generated Intelligence
AI-generated intelligence reports must be evaluated for implicit biases that could shape decision-making in security contexts. However, traditional bias detection relies on statistical methods that fail to capture the qualitative nature of bias. Recent studies propose new frameworks for issue bias detection, counterspeech evaluation, and nuanced safety assessments in AI.
Issue Bias in Large Language Models (LLMs): LLMs systematically favour certain ideological perspectives over others, reinforcing dominant narratives in AI-generated text (Röttger et al., 2025). Using a large-scale dataset, researchers found that AI-generated text tends to align more closely with liberal viewpoints than conservative ones. The alignment with liberal rather than conservative viewpoints in LLMs has significant implications for defence applications, where biased AI-generated intelligence could shape strategic decisions.
Qualitative Counterspeech Evaluation: AI-generated counterspeech is a key strategy for combating online extremism and misinformation. However, research finds that LLMs struggle to generate effective counterspeech tailored to different audiences (Mishra et al., 2025). Qualitative user studies are necessary to assess whether AI-generated counterspeech aligns with human expectations and is persuasive across different ideological groups.
Granular AI Safety Ratings: Binary safety classifications ("safe" vs. "unsafe") are insufficient for nuanced AI safety assessments (Mishra et al., 2025). Research shows that different demographic groups rate AI-generated harm differently, necessitating a more fine-grained, qualitative approach to safety evaluation.
These findings reinforce the necessity of qualitative evaluation for AI-generated intelligence, particularly in sensitive areas such as counterterrorism, misinformation detection, and online extremism monitoring.
Best Practices for Integrating Qualitative Methods into Security AI Evaluation
Based on the reviewed literature, the following best practices to evaluate AI systems integrate qualitative methods into AI evaluation for defence and security applications:
Adopt Perspectivist Annotation Strategies: Instead of enforcing a single "ground truth" label, security AI models should retain and model annotator disagreement, ensuring that minority perspectives are preserved.
Use Qualitative Bias Audits: AI-generated intelligence should undergo qualitative error analysis to identify hidden ideological biases that could shape security decision-making.
Implement Mixed-Methods Evaluation Pipelines: AI models should be evaluated using a combination of quantitative benchmarks and qualitative user studies to ensure robustness and operational relevance.
Prioritise Contextual AI Safety Assessments: Security AI evaluations should incorporate multi-perspective safety scoring systems, recognising that perceptions of harm vary across demographic and cultural groups.
Conclusion
The reviewed research underscores the urgent need to integrate qualitative evaluation methods into AI assessment frameworks for defence and security applications. Over-reliance on quantitative metrics risks producing AI systems that fail to capture the qualitative aspects of human behaviour, leading to biased or misleading intelligence outputs. By incorporating diverse human perspectives, contextual safety evaluations, and qualitative bias detection methods, security agencies can build more reliable, fair, and interpretable AI systems.
Future work should focus on operationalising qualitative AI evaluation in real-world security contexts, ensuring that AI-generated intelligence aligns with the complexities of human-centric applications in defence and national security.
Citations
Jiang, L., Patel, R., Mishra, P., & Rieser, V. (2024). Voices in a Crowd: Searching for Clusters of Unique Perspectives in NLP Annotations. arXiv preprint.
Rastogi, C., Teh, T. H., Mishra, P., Patel, R., Diaz, M., & Rieser, V. (2024). Insights on Disagreement Patterns in Multimodal Safety Perception Across Diverse Rater Groups. arXiv preprint.
Nickel, M. (2024). No Free Delivery Service: Epistemic Limits of Passive Data Collection in Complex Social Systems. Meta AI.
Röttger, P., Hinck, M., Hofmann, V., & Hovy, D. (2025). IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance. arXiv preprint.
Mishra, P., Rastogi, C., Pfohl, S. R., & Rieser, V. (2025). Nuanced Safety in Generative AI: How Demographics Shape Responsiveness to Severity. arXiv preprint.
Aroyo, L., & Welty, C. (2015). Truth is a Lie: Crowd Truth and the Seven Myths of Human Annotation. Journal of AI Research.
Pavlick, E., & Kwiatkowski, T. (2019). Inherent Disagreements in Natural Language Inference. Transactions of the ACL.
Wu, X., Collins, H., Curry, M., & Wang, D. (2023). The Problem of Binary Safety Labels: Toward More Nuanced Evaluations in AI Safety. Proceedings of the NeurIPS Workshop on AI Ethics.
Paulhus, D. (1991). Measurement and Control of Response Bias in Social Psychology Research. Journal of Social Psychology.
Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press.
Appendix - Paper summaries
AI Evaluation Cannot Be Solely Quantitative
Human experience, cultural context, and ideological perspectives profoundly affect how AI-generated insights are perceived.
AI systems that rely purely on numerical benchmarks and aggregate ratings risk misrepresenting human concerns.
Annotation Diversity is Essential for Trustworthy AI
Gold-standard annotations erase minority viewpoints, leading to blind spots in security applications (e.g., in assessing misinformation, propaganda detection).
The reviewed research suggests preserving annotator disagreement rather than enforcing single truth labels.
Operational AI in Defence Requires a Mixed-Methods Approach
Qualitative evaluation must be integrated into security AI systems, ensuring that intelligence assessments capture human complexities, ethical considerations, and sociocultural variances.
Security agencies should adopt participatory data curation methods, leveraging domain-expert, cross-cultural, and adversarial perspectives.
1. The Role of Subjectivity in Annotations and Ambiguity in Machine Learning
Paper: Is a Picture of a Bird a Bird? A Mixed-Methods Approach to Understanding Diverse Human Perspectives and Ambiguity in Machine Vision Models
Key Points:
Annotation is not objective: Even in tasks as seemingly straightforward as image annotation, disagreements arise due to the subjective nature of human experience.
Three main sources of ambiguity in labelling:
Intrinsic ambiguity in the images – e.g., unclear representations of an object.
Annotators’ backgrounds and experiences – e.g., experts vs non-experts in recognising bird species.
Task definitions and instructions – ambiguity in annotation guidelines influences results.
Crowdsourced adversarial image-label pairs: The study conducted a challenge where users identified edge cases where machine vision models failed, helping to refine datasets and improve model robustness.
Implications for NLP: This work highlights that similar annotation ambiguities exist in language datasets, where perspectives on labels such as "hate speech" or "toxicity" differ across social groups.
2. Intersectionality and Safety in Conversational AI
Paper: Intersectionality in AI Safety: Using Multilevel Models to Understand Diverse Perceptions of Safety in Conversational AI
Key Points:
AI safety is subjective: Perceptions of safety in AI-generated conversations vary significantly based on users’ demographics.
Analysis of 101,286 annotations of chatbot interactions: The study found that:
Race and gender intersectionality matters – e.g., South Asian and East Asian women reported higher safety concerns.
Education influences safety perception uniquely for Indigenous raters – different levels of educational background affected judgments.
Gold standard answers depend on rater composition: For instance, in safety annotation, racial differences in raters led to significantly different assessments of the same AI-generated response.
Multilevel Bayesian modelling as a solution: This approach enables capturing interaction effects between demographic factors, highlighting the need for diverse rater pools.
3. Diversity-Aware Annotation for Conversational AI Safety
Paper: Diversity-Aware Annotation for Conversational AI Safety
Key Points:
Safety annotations need diverse rater perspectives: Socio-cultural backgrounds shape how people perceive AI-generated content.
Challenge of incorporating diversity in large-scale datasets: Full diversity is often impractical due to cost, but a two-step annotation approach is proposed:
Pilot study to identify key demographic subgroups with diverse safety perceptions.
Dynamic item allocation to ensure sufficient representation from these groups in annotation tasks.
Outcome: This approach improved recall of safety issues flagged by minority groups without significantly reducing precision.
Implication: Instead of brute-force diversity, targeted inclusion of key perspectives can efficiently enhance annotation quality.
4. Identifying Minority Perspectives in NLP Annotations
Paper: Voices in a Crowd: Searching for Clusters of Unique Perspectives
Key Points:
Moving beyond "gold standard" labels: Traditional supervised learning assumes a single correct answer, which erases minority perspectives.
Two traditional approaches to addressing disagreement:
Disagreement-based: Represents differing opinions as a probability distribution.
Metadata-based: Groups annotators based on shared characteristics (e.g., gender, ethnicity).
Problem: Both methods have weaknesses—disagreement-based approaches collapse nuanced perspectives into a single distribution, while metadata-based approaches assume homogeneity within demographic groups.
Proposed solution: A behaviour-based clustering framework that groups annotators dynamically based on annotation patterns, rather than predefined categories.
Application to NLP: This method can improve sentiment analysis, toxicity detection, and bias mitigation by preserving minority viewpoints in model training.
5. Rater Attitudes and Annotation Bias in Sexism and Misogyny Classification
Paper: Re-examining Sexism and Misogyny Classification with Annotator Attitudes
Key Points:
Political and social beliefs influence annotation decisions:
Right-Wing Authoritarianism (RWA) correlates with increased likelihood of labelling text as sexist.
Social Dominance Orientation (SDO) and Neosexist Attitudes (HN) correlate with lower likelihood of labelling content as sexist.
Bias in NLP datasets: Existing datasets for detecting sexism/misogyny often fail to include diverse perspectives, leading to skewed classifications.
Fine-tuning AI classifiers with annotator information: Models can be improved by incorporating annotator attitudes into their learning process.
Implication: Future NLP models should avoid treating ground truth as a monolithic concept and instead acknowledge the perspectivist nature of bias annotations.
6. The Challenges of Automatic Metrics in NLG Evaluation
Paper: Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices
Key Points:
Automatic metrics (e.g., BLEU, ROUGE) are widely used but flawed: They often fail to correlate with human judgment.
Missing transparency in NLP evaluation: Many papers fail to report implementation details or correlations with human evaluations.
Need for more diverse evaluation approaches: Error analysis, qualitative user studies, and contextual evaluation should supplement automatic metrics.
7. Counterspeech Strategies and NLP Models
Paper: A Strategy-Labelled Dataset of Counterspeech
Key Points:
Counterspeech is a key method for combating online hate speech: It can take many forms, including:
Fact-checking
Humour/sarcasm
Pointing out hypocrisy
Empathy and positive tone
NLP models trained on counterspeech should distinguish between strategies: Some strategies work better than others depending on context and audience.
8. Disagreement Patterns in Multimodal Safety Perception
Paper: Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups
Key Findings:
Perceptions of AI-generated content safety vary significantly across demographic groups, with age, gender, and ethnicity influencing harm assessment.
Expert raters and diverse rater groups often disagree—experts trained in safety policies frequently underestimate the perceived harm compared to diverse raters.
Intersectionality affects disagreement: Some groups, such as Gen-Z Black and Millennial Black raters, have distinct perspectives that are obscured when grouped only by age or ethnicity.
Bias and stereotyping concerns are underestimated by experts compared to diverse rater groups.
Relevance to Human-Centric AI in Defence & Security:
AI systems in defence intelligence should not solely rely on expert annotations—real-world operational settings demand context-sensitive, culturally aware evaluations.
Security applications need to factor in diverse perspectives on AI safety, particularly in sensitive areas such as misinformation detection and social harm evaluation.
9. The Epistemic Limits of Passive Data Collection in AI
Paper: No Free Delivery Service: Epistemic Limits of Passive Data Collection in Complex Social Systems
Key Findings:
Current AI evaluation methods (train-test paradigms) are fundamentally flawed in real-world complex social systems.
Passive data collection introduces biases that distort AI performance validation, particularly in AI-enabled social intelligence and predictive analytics.
Recommender systems, LLMs, and security applications suffer from "test invalidity"—meaning their performance in controlled settings does not translate well into dynamic, human-centric applications.
Participatory data curation and open science approaches are suggested as remedies to improve AI evaluation.
Relevance to Human-Centric AI in Defence & Security:
Real-world security applications cannot rely solely on passively collected intelligence data. Instead, intelligence AI systems should integrate qualitative, human-validated insights to prevent critical blind spots.
Operational AI models in security should undergo continuous re-evaluation beyond initial training datasets, ensuring validity in dynamic, high-risk environments.
10. Issue Bias in LLM Writing Assistance
Paper: IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance
Key Findings:
LLMs exhibit systematic issue biases in political, ethical, and social domains—many align more closely with US Democrat perspectives than Republican ones.
Biases are persistent across different LLM architectures, even when adjusted for neutrality.
Real-world LLM biases differ from controlled multiple-choice assessments—highlighting a critical limitation of common evaluation methods.
The IssueBench dataset introduces a structured approach to testing bias across diverse user-generated prompts.
Relevance to Human-Centric AI in Defence & Security:
Bias in AI-generated intelligence must be systematically evaluated. Security agencies relying on LLMs for intelligence processing must account for bias drift when analysing politically charged topics.
A qualitative evaluation layer should be introduced into AI-assisted decision-making frameworks to flag and adjust for ideological leanings.
11. Nuanced Safety Evaluation in Generative AI
Paper: Nuanced Safety in Generative AI: How Demographics Shape Responsiveness to Severity
Key Findings:
AI safety evaluation must account for nuanced human perspectives—binary classification (safe/unsafe) fails to reflect real-world complexity.
Different demographic groups rate AI-generated harm differently, with varying sensitivity to content safety violations.
Ordinal scales (e.g. Likert scales) reveal granular differences in perception, but also introduce measurement noise and subjectivity.
A new responsiveness metric is introduced to standardise how different groups perceive AI safety concerns.
Relevance to Human-Centric AI in Defence & Security:
Security-focused AI must incorporate qualitative safety metrics, ensuring diverse operational teams can assess AI-generated intelligence in contextually relevant ways.
A standardised multi-perspective framework is needed in AI-enabled intelligence applications to prevent overlooking key security threats due to bias in safety assessments.



Comments