Building Reliable Agentic AI Systems
This article presents the Preclinical Information Center (PRINCE), a platform developed by Bayer AG with Thoughtworks. It uses Agentic RAG and Text-to-SQL to integrate decades of safety study reports, evolving from keyword search to an intelligent research assistant capable of answering complex questions and drafting regulatory documents. The article discusses key engineering decisions through context engineering and harness engineering, emphasizing trust, transparency, and human-in-the-loop integration.
Building Reliable Agentic AI Systems
A Case Study in building production-ready agentic AI systems
This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted platform developed by Bayer AG with Thoughtworks to address pharmaceutical industry challenges in drug development. PRINCE leverages Agentic Retrieval-Augmented Generation and Text-to-SQL to integrate decades of safety study reports. We describe PRINCE's evolution from keyword-based search to an intelligent research assistant capable of answering complex questions and drafting regulatory documents. We reflect on key engineering decisions through the lens of context engineering—how information was shaped and routed between specialized agents—and harness engineering—how orchestration, recovery, and observability were built around the models to maintain control and reliability. The system prioritizes trust through transparency, explainability, and human-in-the-loop integration. PRINCE demonstrates AI's transformative potential in pharmaceuticals, significantly improving data accessibility and research efficiency while ensuring governance and compliance.
16 June 2026
Sarang Sanjay Kulkarni
Sarang Kulkarni is a Principal Consultant at Thoughtworks, working at the intersection of software engineering, data platforms, and applied AI. He focuses on building production-grade GenAI systems, particularly Retrieval-Augmented Generation (RAG) and multi-agent workflows, and helps teams take these systems from early ideas to real-world use. Sarang also contributes to Thoughtworks’ Global AI Service Development team and teaches an O’Reilly course on building production-ready RAG applications.
Contents
The Challenge: Navigating the Preclinical Data Maze
The Solution: PRINCE - An Evolutionary Platform
System Architecture: Engineering a Reliable Agentic RAG System
The Agentic RAG System
Clarify User Intent
Think & Plan: Process Reflection
The Researcher Agent
The Reflection Agent: Data Validation and Sufficiency
The Writer Agent: Answer Synthesis and Formatting
Building Trust in a Production LLM System
Transparency and Explainability
Evaluation
Monitoring
Engineering for Resilience: Error Handling and Recovery
Enhancing Data Quality: Named Entity Recognition and Annotation
The Journey Continues: Iterative Development
Conclusion
Preclinical drug discovery is inherently complex and data-intensive. Researchers face the significant challenge of efficiently accessing and analyzing vast volumes of information generated during this critical phase. Traditional keyword-based search methods, often reliant on rigid Boolean logic, frequently fall short when confronted with the nuanced and intricate nature of preclinical research questions.
The advent of Large Language Models (LLMs) has presented a transformative opportunity. By combining the generative power of LLMs with the precision of information retrieval systems, Retrieval-Augmented Generation (RAG) has emerged as a promising technique. This approach holds the potential to revolutionize preclinical data access, enabling researchers to pose complex questions in natural language and receive accurate, context-rich answers grounded in proprietary data.
Recognizing this potential early, Bayer committed to exploring how these technologies could address longstanding challenges in preclinical research.
In this post, we share that journey—how Bayer's early investment in generative AI has resulted in PRINCE, an agentic AI system built on Agentic RAG. This case study explores the technical architecture, engineering decisions, and lessons learned in transforming preclinical data retrieval from a challenging maze into an intuitive conversational experience.
Many of the engineering decisions behind PRINCE can now be understood through the lens of context engineering and harness engineering, although when the system was first designed we did not use these terms. Context engineering shaped what information each model received, what it did not receive, and how context moved between specialized steps such as research, reflection, and writing. Harness engineering shaped the scaffolding around the models: orchestration, tool boundaries, state persistence, retries, fallbacks, validation, reflection loops, observability, and human review.
While this post focuses on the technical architecture and engineering challenges, our paper published in Frontiers in Artificial Intelligence covers the product evolution and business impact in more detail.
The Challenge: Navigating the Preclinical Data Maze
The preclinical research landscape at Bayer, like many large pharmaceutical organizations, is characterized by a diverse and extensive array of data. This includes highly structured datasets from various studies, alongside vast amounts of unstructured information embedded within text documents such as study reports, publications, and regulatory submissions. Researchers frequently encountered significant hurdles in accessing and analyzing this information effectively:
Data Silos: information was fragmented and scattered across numerous disparate systems and repositories, making it exceedingly difficult to gain a comprehensive, holistic view of preclinical data related to a specific compound or study.
Limited Search Capabilities: traditional keyword-based search engines struggled with the complexity and variability of preclinical terminology and research questions, often yielding irrelevant, incomplete, or overwhelming results.
Time-Consuming Manual Analysis: extracting specific insights or compiling information across multiple documents required considerable manual effort, diverting valuable researcher time away from core scientific activities.
These inherent challenges highlighted a clear need for a more efficient, intelligent, and integrated approach to preclinical data retrieval and analysis.
The Solution: PRINCE - An Evolutionary Platform
To address these challenges, Bayer developed the Preclinical Information Center (PRINCE) platform. PRINCE was conceived as a unified gateway to preclinical data, initially focusing on consolidating previously siloed structured study metadata and exposing them in a “Searchable” manner. This initial phase allowed users to apply advanced filters and retrieve information primarily from structured study metadata.
However, a significant portion of Bayer's valuable preclinical knowledge resides within unstructured PDF study reports accumulated over decades. Due to numerous system migrations over the years, the structured metadata associated with these reports could be incomplete, missing, or even contain incorrect annotations. Crucially, the authoritative “gold standard” information was consistently present within the approved PDF study reports.
The emergence of Generative AI, particularly RAG, provided the key to unlocking this wealth of unstructured data. By integrating RAG capabilities, PRINCE began to shift the paradigm from a filter-based 'search' tool to a natural language 'ask' system, enabling researchers to query the content of these study reports directly.
This evolution reflects PRINCE's progression through three distinct phases:
Search: the initial phase focused on creating a unified gateway to thousands of nonclinical study reports, consolidating multiple in-house data silos from various preclinical domains into a searchable format, primarily leveraging structured metadata.
Ask: this phase introduced an AI-powered question-answering system utilizing Retrieval Augmented Generation (RAG). This enabled researchers to derive insights directly from unstructured data, including scanned PDFs from historical reports, by posing questions in natural language.
Do: the current phase positions PRINCE as an active research assistant capable of executing complex tasks. This is achieved through the integration of multi-agent systems, allowing the platform to handle intricate queries, orchestrate workflows, and support activities like drafting regulatory documents.
This deliberate evolution from Search to Ask to Do represents a strategic response to the industry's need for greater efficiency and innovation in preclinical development. By providing researchers with increasingly powerful tools to access, analyze, and act upon preclinical data, PRINCE aims to enable faster data-driven decision-making, reduce the need for unnecessary experiments, and ultimately accelerate the development of safer, more effective therapies.
System Architecture: Engineering a Reliable Agentic RAG System
The system functions as an interactive conversational UI, powered by a robust backend infrastructure. Its architecture, designed for handling complex queries and delivering accurate, context-rich answers, is orchestrated using LangGraph and served via a FastAPI application.
Figure 1 provides the system context—UI, backend, data stores, LLM fallbacks, and observability—while Figure 2 zooms into how the system coordinates its specialized agents.
Figure 1: System context and supporting platforms.
User Request: the process begins when a user submits a request through the Conversational UI which is built with React.
Orchestration: the user's request is routed to a LangGraph-based orchestration layer in the backend. This workflow engine coordinates a multi-stage process that progresses through clarifying user intent, thinking and planning, conducting research (using RAG and Text-to-SQL), validating data completion, and finally generating a response through the Writer agent. The workflow includes deliberate pause points and feedback loops to ensure data completeness before proceeding. (We explore the details of this agentic workflow in a dedicated section later.)
Data Retrieval and State Management: the Researcher agents interact with a comprehensive and distributed data ecosystem:
Vector representations of all study reports are stored in OpenSearch, forming the core knowledge base for information retrieval.
Curated structured data, resulting from various ETL and harmonization processes, is accessed via Athena.
The state of the agent's execution is meticulously tracked. After each logical step (a LangGraph node execution), the corresponding state is persisted in PostgreSQL using a LangGraph checkpointer.
Broader application-level state is managed in DynamoDB.
The system leverages internal GenAI platforms that host models from OpenAI, Anthropic, Google, and open-source providers. These platforms expose all models via a unified OpenAI-compatible endpoint, making it easy to swap models and choose the best tool for each task. They also manage the control plane, enforcing rate limits and other safeguards to prevent abuse.
Resilience and Error Handling: robustness is a critical design principle, with multiple fallback mechanisms in place:
If a specific LLM fails, the system automatically retries the request several times before falling back to an alternative model or platform to ensure service continuity.
To recover quickly from transient failures, retries are implemented at both the individual LLM call level and the logical node level (i.e., an entire step in the agent's plan).
Also, agents are provided the context of the errors so that they can chart a different trajectory or alternative plan of action as a response.
Observability and Evaluation: the entire system is monitored for performance and reliability:
General system health and metrics are tracked using Cloudwatch.
Langfuse serves as the primary observability tool, providing detailed traces of all production traffic. This allows for in-depth debugging of issues. Furthermore, evaluation datasets are stored and managed within Langfuse, making it easier to analyze performance scores and diagnose specific failures. The evaluation is done using RAGAS evaluation framework. The live traffic evaluation is done on a daily basis while the dataset evaluation is done whenev
[truncated for AI cost control]