arXiv Computational Linguistics AI News Source

Public articles 303Collected articles 330Trust 75Refresh 360 min

Health HealthySource type ResearchFull-text rights Full text allowedLast ingested 2026-06-26ID arxiv-cs-clStatus Enabled

Use abstract and metadata; check individual paper license before full text.

Latest public articles

From Lexicon to AI: A Structured-Data Pipeline for Specialized Conversational Systems in Low-Resource Languages

2026-06-26 04:00 UTC

A systematic methodology transforms structured linguistic resources like Hindi WordNet into 1.25 million instruction-response pairs to fine-tune a 12B-parameter language model using resource-efficient LoRA and 4-bit quantization. Evaluation via a Hindi language learning chatbot shows superior pedagogical effectiveness (91.0) compared to general-purpose models (79.4-83.6) while maintaining competitive semantic performance. This work offers a practical alternative for low-resource languages, enabling specialized AI development for hundreds of languages with existing WordNet resources.

Hindi WordNet converted into 1.25 million instruction-response pairs to fine-tune a 12B-parameter model
Resource-efficient fine-tuning using LoRA and 4-bit quantization

Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning

2026-06-26 04:00 UTC

A new study reveals that larger language models outperform smaller ones in reasoning tasks due to constraint-guided reasoning. Using the AdvCluster framework, researchers found stable performance gaps of 6.43% and 7.38% across model pairs. The analysis identifies constraint identification and structured reasoning as key advantages.

Larger models consistently outperform smaller ones on reasoning benchmarks in math, physics, chemistry, and programming.
The core advantage identified is constraint-guided reasoning: better identification and use of explicit and implicit constraints.

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

2026-06-26 04:00 UTC

This pilot study presents NEST-V1, a lightweight Transformer-based multimodal framework that generates emotion-conditioned Nepali Sign Language avatars from spoken input. On a dataset of 600 audio samples covering 4 common words and 3 emotional states, the system achieves 81.1% ASR accuracy and 79.21% emotion recognition accuracy with only 22.1M parameters, suitable for edge deployment. The work establishes a technical foundation for emotion-aware sign language translation in low-resource settings.

NEST-V1 is a multimodal framework that translates spoken Nepali words into sign language avatars with emotional expressions (happy, neutral, sad).
It uses a shared acoustic encoder for simultaneous ASR and emotion classification, achieving 81.1% and 79.21% accuracy on 600 labeled audio samples.

Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints

2026-06-26 04:00 UTC

This study investigates using Nonviolent Communication (NVC) principles as lightweight prompt-level constraints to guide large language models (LLMs) toward more de-escalating dialogue behavior in emotionally charged situations. Through a dual-agent simulation framework across multiple models and user resistance levels, NVC-constrained prompting consistently reduces conversational escalation and stabilizes interactions with highly resistant users.

LLMs are increasingly used in conflict situations, but prior safety research overlooks behaviors that may escalate conflict unintentionally.
Researchers reformulated NVC principles into process-oriented guidelines discouraging blame, emphasizing user emotions, and encouraging clarification before advice.

Context Recycling for Long-Horizon LLM Inference

2026-06-26 04:00 UTC

Large language models excel at short-context reasoning but degrade over long conversations due to context window limits. ContextForge recycles context via structured query generation, external memory retrieval, and controlled synthesis, reducing token overhead while maintaining answer quality. On a 15-turn healthcare benchmark, ContextForge improved consistency and reduced token consumption.

LLMs degrade in long conversations due to context window limitations
ContextForge combines structured query generation, external memory retrieval, and controlled synthesis to recycle context

Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

2026-06-26 04:00 UTC

A new study finds that using linguistic features such as assertive certainty, explicit moral vocabulary, and emotion words in fine-tuning data significantly shifts LLM reasoning toward stronger pro-animal-welfare stances, while hedged language and concrete sensory description dilute that stance. The research offers practical guidance for animal-welfare advocates.

Ten linguistic features were tested on Llama-3.2-1B using stance-contrast probes.
Eight features produced statistically significant shifts; seven increased pro-animal-welfare reasoning.

Investigating LLM's Problem Solving Capability -- a Study on Statics Questions

2026-06-26 04:00 UTC

A new study evaluates LLM performance on statics problems using a model distillation approach. LLMs perform well on text-only problems but accuracy drops when diagrams and multi-step reasoning are introduced. The decline is primarily due to difficulties in multi-step reasoning, not image recognition limits.

25 text-only statics questions were distilled from ChatGPT, with additional datasets including diagrams and modified numerical values.
LLMs perform well on text-only statics problems but accuracy decreases with diagrams and multi-step reasoning.

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training

2026-06-26 04:00 UTC

A study finds that post-training for helpfulness (SFT and RL) significantly degrades animal compassion values instilled during mid-training, while coding post-training better preserves them. Helpfulness training also causes a large drop in English general moral reasoning but not cross-lingually, whereas the compassion degradation transfers consistently across languages. This suggests mid-trained values are encoded more deeply and cross-lingually than reasoning improvements from domain-specific post-training. The paper recommends coding post-training for value-preserving model development.

Helpfulness post-training (SFT and GRPO) reduces animal compassion scores by ~30 percentage points on AHB compared to coding training.
On English MORU, helpfulness training lowers general moral reasoning by 25.5 percentage points, but no cross-lingual effect.

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

2026-06-26 04:00 UTC

A new benchmark, Know2Guess, aims to evaluate LLMs' ability to distinguish between knowledge-based answering and guessing, considering data contamination. It includes 1,200 items across five domains and tests models like FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct. Qwen2.5-3B-Instruct shows best reliability but still has calibration issues.

Know2Guess benchmark contains 1,200 items across five domains with contamination metadata
Evaluation shows incomplete transition from answering to abstaining

HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification

2026-06-26 04:00 UTC

HierBias is a novel hierarchical context-conditioned media bias detector that formally models document context for bias prediction, theoretically proving reduced Bayes error and improving sample efficiency via multi-task learning. It achieves 0.853 F1 and 0.723 MCC on BABE and BASIL, surpassing state-of-the-art.

HierBias models document context for sentence-level bias classification, reducing Bayes error.
Multi-task training of binary detection and fine-grained type classification improves sample efficiency.

The cognitive, affective, and behavioral expression of self-stigma among people who use drugs in online substance use communities

2026-06-25 04:00 UTC

This study developed a codebook for self-stigma across cognitive, affective, and behavioral domains and analyzed Reddit posts from people who use drugs. Results show that self-stigma is prevalent, and behavioral indicators often precede core ones, challenging progressive models.

Developed a ten-indicator codebook for self-stigma covering cognitive, affective, and behavioral domains.
Analyzed 72,115 posts from 1,660 users, with 5.3% containing self-stigma.

Dream at SemEval-2026 Task 13: SALSA for Single-Pass Machine-Generated Code Detection

2026-06-25 04:00 UTC

Large language models have transformed code generation, raising concerns about authorship, assessment integrity, and software trust. SemEval-2026 Task 13 Subtask A defines detection as binary classification over code snippets, with emphasis on out-of-distribution generalization across unseen programming languages and domains. The authors propose SALSA (Single-pass Autoregressive LLM Structured Classification), which maps each class to a dedicated output token and trains the model to emit a single-token label. By combining balanced sampling, parameter-efficient fine-tuning, and conservative training, the system achieves OOD F1=0.789 on the official leaderboard, significantly outperforming the CodeBERT baseline (F1=0.305).

LLM-generated code detection is critical for academic integrity and software security
SALSA simplifies detection via single-pass autoregressive structured classification

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

2026-06-25 04:00 UTC

The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood. This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.

LLMs can generate fluent critiques and approximate scores, but their reliability as decision-support systems is not well understood.
The survey presents a taxonomy of modeling approaches: prompt-based, supervised, retrieval-augmented, and alignment-optimized.

LLM Performance on a Real, Double-Marked GCSE Benchmark

2026-06-25 04:00 UTC

A new study introduces a dataset of 32,534 double-marked real student responses to GCSE mock exams, covering five subjects and handwritten work. Top LLMs agree with examiners more closely than examiners agree with each other, handling subjective and handwriting tasks effectively, with little dependence on model size.

Dataset includes 32,534 double-marked GCSE mock exam responses across 328 questions and five subjects.
Top LLMs outperform human inter-examiner agreement.

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

2026-06-25 04:00 UTC

Dustin is a sparse verification framework for long-context speculative decoding that combines draft model lookahead signals with target model historical attention to identify critical tokens, achieving 27.85x self-attention speedup and 9.17x end-to-end decoding speedup at 32k sequence length on Qwen2.5-72B with negligible accuracy loss.

Speculative decoding for long-context LLMs is bottlenecked by KV cache loading during verification
Existing compression methods (static eviction or dynamic selection) fail to balance efficiency and accuracy

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

2026-06-25 04:00 UTC

A new arXiv paper investigates the geometric relationship between detection and control directions in language models. While models can perfectly detect hallucination (AUC=1.0), the direction for detection and the direction for causing refusal have a cosine of only 0.12, indicating that detection does not imply controllability. This gap generalizes across models and sizes, originates in pretraining, and a 15-degree rotation can partially bridge it.

The angle between detection and control directions in language models averages 83 degrees, with cosine of 0.12.
Models can perfectly linearly separate fake entities but struggle to refuse generating them.

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

2026-06-25 04:00 UTC

A new framework using error-aware TF-IDF retrieval to correct ASR errors, achieving significant improvements in WER on Persian FLEURS.

Proposes error-aware TF-IDF for RAG to fix ASR hallucinations
Integrates text normalization and sparse penalty matrix

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

2026-06-25 04:00 UTC

AgentOdyssey is a novel evaluation framework that procedurally generates open-ended text games to test agents' ability to learn continuously during deployment. It challenges the traditional ML assumption of no learning at test time, interleaving learning and inference throughout. The framework measures world knowledge acquisition, episodic memory, exploration, action diversity, and model cost. Experiments show even the strongest agents fall far below human performance, with short-term memory emerging as a key beneficial mechanism.

AgentOdyssey procedurally generates open-ended text games to evaluate test-time continual learning in agents.
It breaks the traditional assumption of no learning at test time, requiring agents to learn and infer throughout deployment.

Small edits, large models: How Wikipedia advocacy shapes LLM values

2026-06-25 04:00 UTC

A new study shows that a small group of Wikipedia editors can significantly influence how large language models discuss animal welfare with just 125 edits. Using gradient-based attribution methods, the research traced the impact of these edits, finding that animal welfare-related Wikipedia content dominates model responses to relevant queries.

Pro-Animal Wikipedians (PAW) influenced LLM behavior on animal welfare through only 125 edits across 115 pages.
Attribution analysis showed PAW-edited sections comprised 68% of top documents for animal welfare queries vs. 52% for unrelated queries.

Graph-Based Phonetic Error Correction of Noisy ASR

2026-06-25 04:00 UTC

Researchers propose G-SPIN, a structured ASR correction framework that combines phonetic graph modeling with contextual language understanding. It uses a graph neural network to generate acoustically plausible candidate sets, a masked language model for scoring, and an instruction-tuned large language model for final re-ranking, enabling lightweight, modular inference-time correction.

ASR errors often stem from phonetic similarity and affect critical tokens
G-SPIN constructs phonetic candidate sets via GNN, scores with MLM, and re-ranks with LLM

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

2026-06-24 04:00 UTC

Standard tokenizer evaluation metrics like fertility rate fail to capture morphological correctness for agglutinative languages. The QuechuaTok benchmark compares four tokenization strategies on Southern Quechua, using morphological boundary accuracy (MorphAcc) alongside traditional metrics. Results show that while BPE achieves the lowest fertility rate (1.636), its MorphAcc is only 6.67%, whereas the morphology-aware PRPE tokenizer reaches 83.33% MorphAcc, demonstrating that fertility rate alone is insufficient for agglutinative languages.

Fertility rate inadequately reflects morphological accuracy for agglutinative languages.
QuechuaTok systematically compares BPE, Unigram LM, WordPiece, and PRPE tokenizers on Southern Quechua.

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

2026-06-24 04:00 UTC

This study challenges the effectiveness of exact-match retrieval recall as a proxy for retriever quality. In tau-bench, retrieved policy clauses performed nearly as well as gold policies in downstream classification tasks, despite only 7% exact-match recall. The findings suggest that relying solely on recall may underestimate the practical utility of retrieved policies.

Exact-match retrieval recall is often used as a proxy for retriever quality but can be misleading.
Tested policy classification in tau-bench using Qwen2.5-3B/7B classifiers.

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

2026-06-24 04:00 UTC

This study audits eight automatic scorers across three evaluation constructs, finding that no scorer transfers across all datasets. In the generated-answer attribution construct, metric rankings invert and an NLI scorer collapses on long-form tasks. A prompt-based LLM judge avoids collapse but is costly and non-deterministic. The research concludes that metric choice must be validated on the target dataset.

Eight automatic scorers are audited across three constructs; none transfers stably across datasets.
Metric rankings invert in generated-answer attribution; NLI scorer collapses on long-form tasks.

One Year Later...The Harms Persist, But So Do We!

2026-06-24 04:00 UTC

A new study evaluates six proprietary LLMs across 16 DSM-5 conditions using adversarial attacks, finding that safeguards only reliably hold for suicide and self-harm, with failure rates up to 100% for conditions like eating disorders, substance use disorder, and major depressive disorder. The authors call for clearly defined harm categories and targeted safeguards.

Evaluation of six proprietary LLMs across 16 DSM-5 conditions
Reliable safeguards only for suicide and self-harm; up to 100% failure for others

Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

2026-06-24 04:00 UTC

This paper proposes a training-free framework called IBA (Identify-Before-Answer) for Knowledge-Based Visual Question Answering (KB-VQA). It decouples entity identification from evidence ranking by prompting an MLLM to select high-confidence entities from candidate names, then using an off-the-shelf text re-ranker for evidence. Experiments on Encyclopedic-VQA and InfoSeek show consistent outperformance over fine-tuned multimodal reranking baselines with reduced training and inference complexity.

Proposes training-free IBA framework that decouples entity identification from evidence ranking.
MLLM selects entity from candidate names; off-the-shelf text re-ranker for evidence.

Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability

2026-06-24 04:00 UTC

A new framework uses LLMs to quantify product desirability from qualitative feedback without explicit scores, achieving Pearson correlations up to 0.97 and classification accuracy up to 94% on PDT datasets. GPT-4o-mini matches larger models at 94% lower cost, and the system includes confidence ratings and explainable AI.

LLMs produce numerical sentiment scores from qualitative responses with high correlation to expert labels (up to 0.97).
GPT-4o-mini achieves performance comparable to larger models at a 94% cost reduction.

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

2026-06-24 04:00 UTC

A new study shows that self-generated text recognition (SGTR) finetuning can effectively prevent and reverse emergent misalignment (EM) in large language models, outperforming benign finetuning methods. The research finds that EM results from destabilization of a model's aligned character rather than learning harmful content, and SGTR works by fortifying character consistency.

Emergent misalignment (EM) arises from destabilization of a model's aligned character, not direct learning of harmful content.
Self-generated text recognition (SGTR) finetuning is effective for both prevention and reversal, uniquely consistent in prevention.

Quantifying Prior Dominance in RAG Systems

2026-06-24 04:00 UTC

This paper introduces the Normalized Context Utilization (NCU) metric to quantify contextual information gain in RAG systems. Experiments show that small language models match or outperform large models in strict factual extraction, while a commercial API overrides external evidence in nearly half of adversarial conflicts.

NCU metric uses continuous token log-probabilities to distinguish contextual extraction from parametric recall.
Small language models outperform larger ones in strict extraction tasks, showing diminishing returns of scaling.

ModTGCN: Modularity-aware Graph Neural Networks for Text Classification

2026-06-24 04:00 UTC

Proposes ModTGCN, a modularity-aware GNN for text classification that jointly optimizes cross-entropy and a modularity-based auxiliary objective to promote class-coherent document communities while preserving discriminative representations. Achieves consistent gains on five benchmarks, with larger improvements on low-homophily datasets.

Incorporates global community structure to mitigate over-smoothing
Modularity objective computed on document similarity graph from transformer embeddings

EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

2026-06-24 04:00 UTC

EXPO-SQL proposes an execution-based clause-level policy optimization method that assigns fine-grained rewards to each clause in a SQL query by analyzing execution results, including error messages and incremental clause-wise execution, addressing the issue of insufficient learning signals caused by coarse-grained query-level rewards in existing RL methods. Experiments show it significantly outperforms existing supervised fine-tuning, prompting, and RL methods on multiple Text-to-SQL benchmarks.

Existing RL methods assign uniform query-level rewards to all clauses, treating correct and incorrect ones equally.
EXPO-SQL provides fine-grained clause-level rewards by analyzing execution results and incremental clause execution.

arXiv Computational Linguistics

Latest public articles

From Lexicon to AI: A Structured-Data Pipeline for Specialized Conversational Systems in Low-Resource Languages

Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints

Context Recycling for Long-Horizon LLM Inference

Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

Investigating LLM's Problem Solving Capability -- a Study on Statics Questions

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

HierBias: Context-Conditioned Hierarchical Media Bias Detection with Multi-Task Type Classification

The cognitive, affective, and behavioral expression of self-stigma among people who use drugs in online substance use communities

Dream at SemEval-2026 Task 13: SALSA for Single-Pass Machine-Generated Code Detection

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

LLM Performance on a Real, Double-Marked GCSE Benchmark

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

Small edits, large models: How Wikipedia advocacy shapes LLM values

Graph-Based Phonetic Error Correction of Noisy ASR

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

One Year Later...The Harms Persist, But So Do We!

Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

Quantifying Prior Dominance in RAG Systems

ModTGCN: Modularity-aware Graph Neural Networks for Text Classification

EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

All sources