Monday, November 10, 2025

The Evolution of Statistical Programming: Human Expertise + AI

The Evolution of Statistical Programming: Human Expertise + AI

At Intego Clinical, we are developing an AI Assistant designed specifically for statistical programmers working in R. A key part of our strategy is helping clients transition toward a technology-agnostic environment, supporting their move from SAS to R with confidence, and an internally hosted AI Code Assistant, trained on our own best practices and aligned with our quality standards, will play a major role in making that transition smoother, faster, and more consistent.

Across the industry, AI experimentation is everywhere, but much of it still depends on public, general-purpose models. We believe the greatest long-term value will come from self-hosted AI systems designed around an organization’s own workflows, security requirements, and programming standards. Building AI capabilities in-house allows companies to maintain full control of their data, tailor the assistant to their specific processes, and create sustainable, enterprise-ready solutions that stand apart from generic tools. This philosophy is central to how we are shaping our own AI strategy.

To better understand how this technology is reshaping our field, we asked our senior data scientists, Kostiantyn Drach (KD), Oleksandr Leonov (OL), Lyudmyla Polyakova (LP), and Viktoriia Shevtsova (VS), to share their insights about the project and the future of AI in clinical statistical programming. Their answers highlight why the shift toward R, standardization, and self-hosted AI models is not only a technological evolution but also a transformation of the role and value of statistical programmers in clinical research.

Why does AI matter for statistical programmers today?
KD: For years, statistical programming in clinical trials was built on a predictable foundation: stable workflows, SAS as the industry standard, and strict regulatory alignment. But the landscape is changing. The shift toward R – driven by open-source innovation, modern analytics tooling, and reproducibility – is transforming both the technology stack and the expectations from statistical programmers. AI facilitates this shift to a great extent, especially AI based on self-hosted models.

What challenges do teams face when moving from SAS to R?
OL: This transition is not just a “change of syntax”. Teams quickly discover that moving from SAS to R exposes deeper gaps – in workflow reuse, documentation culture, standardization, and cross-project consistency.

How does AI help address these new challenges?=
KD: This is precisely where enterprise AI Assistants become valuable. The AI Assistant is being built not as a chatbot or a code autocompletion tool, but as an intelligent statistical programming partner – one that understands how code, standards, and clinical context fit together.

Why isn’t a generic large language model (LLM) enough for clinical programming?
LP: General-purpose LLMs are trained on open-source code and broad technical text – but clinical programming is not “general purpose”. It operates in a world of strict structure, traceability, and regulatory validation.

In clinical trials, the goal is not just to “generate working R code”, but to ensure that the code aligns with CDISC standards (SDTM, ADaM), preserves traceability and auditability, follows validated templates for TLF generation, and finally reflects the study protocol and SAP context.

Without access to these domain constraints and corporate knowledge, a large language model remains an “intelligent code generator”, but not a clinical programming assistant.

What exactly is the AI Assistant, and what makes it different from generic AI tools?
VS: The AI Assistant is being developed to support statisticians and programmers in their everyday workflows – from defining ADaM derivations to validating SDTM mappings and producing traceable TLF outputs in R. Unlike generic code generators, it is built by statistical programmers and for statistical programmers, encoding the practical expertise accumulated inside the Intego Group and using the advantages of the RAG (Retrieval-Augmented Generation) and fine-tuning approaches.

Why does the RAG approach matter here?
LP: A generic LLM can generate code – but it cannot understand which derivation rule is validated, which TLF template is compliant, or which SOP version applies. RAG solves this by letting the model retrieve before it generates. The process works as follows:

First, in the retrieval phase, the system locates relevant documents or templates (e.g., ADaM specs, SOPs, QC logs).

Second, in the augmentation phase,these fragments are injected into the prompt.

Finally, in the generation phase,the model produces an answer grounded in corporate and regulatory context.
This turns the AI Assistant into an expert system, that is, not a black-box generator of plausible answers, but an auditable system whose outputs can be traced and justified.

What engineering principles make RAG truly enterprise-grade?
VS: To be useful in a regulated environment, the knowledge base must be engineered around task-level building blocks rather than raw text. That means semantic chunking by derivation rule, function, or SOP reference; rich metadata (dataset, variables, CDISC version, package version); traceability logs for GxP/QA validation.

The result is a system where every answer is auditable and source-linked.

Why does fine-tuning matter if RAG already provides good context?
LP: RAG is a powerful foundation as it gives the model access to the right information at the right moment. But retrieval alone does not change how the model reasons about that information.
If RAG is external memory, then fine-tuning is internal competence. RAG retrieves facts; fine-tuning teaches interpretation and domain-specific reasoning.

How will the fine-tuning improve the AI Assistant?
OL: Fine-tuning helps transform an LLM from a code generator into a true statistical programming expert. It adapts the model to domain-specific data and introduces Reinforcement Learning from Human Feedback (RLHF) for continuous improvement in accuracy, style, and compliance awareness.
In regulated domains like clinical programming, expertise must evolve with new standards, SOP revisions, and therapeutic area conventions, which means corporate models will be trained continuously, not once.

How can the AI Assistant help teams transition from SAS to R?
KD: The move from SAS to R is not just about changing tools, it’s a transformation of workflows and validation practices. The AI Assistant is designed to serve as a bridge, namely, generating R equivalents of SAS procedures, explaining the statistical reasoning behind transformations, or aligning outputs with company-specific conventions.

Beyond that, it acts as a training companion, providing guided explanations and real-world examples to accelerate learning.

Why does self-hosted AI matter for clinical research?
OL: In clinical research, data privacy is a regulatory requirement, it is not optional. The AI Assistant is being designed as a self-hosted AI solution, operating within the organization’s own infrastructure.
As discussed in our earlier post, Why Self-Hosted LLMs May Be the Future for Clinical Research, running models on your own infrastructure offers three key strategic advantages:

data stays entirely within internal systems, ensuring confidentiality and compliance;
the model can be customized to company SOPs and CDISC implementations;
validation and governance remain fully auditable for QA and regulatory review.

How will AI change the role of the statistical programmer?
VS: As AI becomes part of statistical workflows, routine code generation will be increasingly automated. Human expertise will shift toward designing and supervising how the model applies clinical knowledge. In practice, this means moving from writing code to defining logic and standards. AI handles repetition, and the human provides judgment, context, and compliance oversight.

Intego Insights

December 1, 2025

PHUSE EU Connect 2025: A Look Back at an Inspiring Week in Hamburg

November 10, 2025