Research

Papers, preprints, and open-source artifacts on the evaluation and reliability of AI systems in high-stakes environments.

Extract-0: A Specialized Language Model for Document Information Extraction

Henrique Godoy · arXiv preprint, 2025

Extract-0 is a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameter-efficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resources.

BibTeX
@misc{godoy2025extract0,
  title={Extract-0: A Specialized Language Model for Document Information Extraction},
  author={Godoy, Henrique},
  year={2025},
  eprint={2509.22906},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2509.22906}
}

Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

Henrique Godoy · arXiv preprint, 2025

Language models are increasingly used in Brazil, but most evaluation remains English-centric. Alvorada-Bench is a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting produces 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering-oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty. A cost-accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.

BibTeX
@misc{godoy2025alvorada,
  title={Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?},
  author={Godoy, Henrique},
  year={2025},
  eprint={2508.15835},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2508.15835}
}