Regeneron: Proactive Pharmaceutical Process Monitoring

Regeneron’s BioPerceptron platform is an innovative and integrated method to detect subvisible particles in formulations. It leverages deep learning, AI, end-to-end cloud orchestration, and advanced visualization capabilities to transform proprietary, unstructured data into proactive pharmaceutical process monitoring.

< 15 min

time to run each MFI classification


positive prediction rate

0.1 - 10 μm

size of SVPs that can be detected


The monitoring of subvisible particles (SVPs) in injectable formulations development is a critical component of ensuring patient safety and product quality, both at Regeneron specifically as well as in the biopharmaceutical industry at large. 

Currently, there is a gap in routinely measuring and controlling subvisible particles smaller than 10μm in biotherapeutic products. Recent studies have indicated the potential of proteinaceous particles in the subvisible size range (0.1–10μm) to aggregate and lead to down-the-line process failure or patient immunogenicity reactions.

Regeneron’s BioPerceptron platform — a deep learning solution built with Dataiku for high-throughput biomedical image processing that leverages AI, end-to-end cloud orchestration, and advanced visualization capabilities — addresses this gap, transforming proprietary, unstructured data into proactive pharmaceutical process monitoring. 

Solution Overview & Benefits

While the problem of determining possible unseen contaminants in drug formulations is not new (and neither is the use of deep learning methods for image analyses and classification), Regeneron’s approach is a timely use of deep learning in a in a way that makes life-saving medicines more efficacious and less risky by extending the industry’s highest standards of quality biomanufacturing.

Current industry standards use light obscuration techniques (based on the ability of a particle to reduce measured light intensity when passing a light beam) to measure SVPs, but the method cannot distinguish different types of particles (synthetic vs. proteinaceous)  in a test solution.

Regeneron’s IT teams partnered with formulation development scientists to develop a deep learning convolutional neural network (CNN) approach that assigns weights (importance) to various features of an image to be able to “learn” and differentiate characteristics of one image versus another. 

This initial solution was then expanded into a cloud-native platform that can:

  1. Automatically parse and ingest existing formulations data sources.
  2. Analyze and classify the particles detected within the images present.
  3. Detect anomalies by applying quality threshold limits.
  4. Illustrate the diagnostic findings through a self-service visualization service.
  5. Provide rapid feedback for contaminant detection and corrective action.

Thanks to development on Dataiku combined with the use of GPUs to streamline image processing pipelines, each microscopic flow imaging (MFI) classification takes less than 15 minutes to complete. The result of the classification is a better than 94% positive prediction rate for silicon and protein SVPs across various sizes. 

In addition to potentially improved product quality and safety plus improved process development for more efficient manufacturing scale-up, the BioPerceptron platform is modular, which offers the potential of simplifying regulatory validation by only changing single components rather than an entire system.

Challenges Overcome

For this use case, there were business and technical challenges as well as data and modeling challenges. 

On the business side, unseen contaminant aggregates in pharmaceutical drug development present a multi-factorial challenge. They are very small — in the range of 1-25 μm — and they vary in type (silicon oil droplets, protein aggregates, fibers, glass particles, or air bubbles). Being able to classify type as well as size has several advantages, and understanding what type of SVP exists in the pharmaceutical product aids in diagnosing the source of the contaminant.

In addition, as with any machine learning or AI exercise, there are some more specific data and modeling challenges associated with this particular use case:

Understanding the Data Landscape 

The data in this case consists of high-resolution microscopy images. The challenge here was in the ability to identify the different particles and their real world characteristics to apply industry quality limits in a timely manner from very large high-resolution microscopy images.

To solve this challenge, IT worked closely with their formulations research partners to replicate development and manufacturing conditions and generate samples representative of real-world data. These MFI files are captured and tagged with appropriate metadata from the microscopy system. Then, the image datasets are pushed through the parallelized high-throughput classification pipeline.

Verifying the Categorization of the Datasets Used for Model Learning

Regeneron applied state-of-the-art data validation and unsupervised learning methods to understand underlying patterns within the image data and systematically capture data inconsistencies. Methods include neural network-based dimensionality reduction, hierarchical clustering, and multi-dimensional variance analysis.

Choosing a Training Dataset Adequately Representative of Real-World Conditions

To bring this use case to life, Regeneron needed to have training data that was adequately representative —in terms of both size and type – of real-world conditions. Their solution balanced the use of randomized and stratified sampling to accurately capture the distribution of particle types and sizes expected in production settings without compromising the model’s need for sufficient data samples.

Tuning Model Parameters to Optimize Predictive Value

Regeneron integrated existing cloud capabilities with their deep learning pipeline to seamlessly leverage elastic and parallelizable compute resources for time-consuming model learning cycles.

Schau jetzt
Abdul Shaik, Head of Enterprise Data & Analytics at Regeneron,
talks getting value from AI at the Everyday AI Conference NYC in September 2023.


Regeneron’s BioPerceptron platform is innovative because of its: 

  • Analysis of particles under 10 μm, which is novel given resolution constraints and morphological invariances in very small particles.
  • Automated ingestion of MFI images and model execution through cloud orchestration methods.
  • Interactive and self-service results (particle visualization dashboards) for computational biologists and development formulations scientists working collaboratively with the data science team.
  • Pre-model execution data scan in near real-time, which outputs an initial data triage that can be used for data profiling and model evaluation.
  • Model results integrated automatically with the source MFI data.
  • Extensible framework, which is model agnostic, allowing upgrades and changes to the algorithm without changing the workflow.
  • High throughput, which can be parallelized with each MFI classification run, taking less than 15 minutes.

While each one of those points in isolation is not noteworthy, the innovation in Regeneron’s solution exists in the aggregate.

We consider subvisible particle classification as a single use case in a larger, cloud-native framework for ingesting, parsing, and processing images with the intent of addressing pressing biological questions. Shah Nawaz CTO/VP, Digital Transformation at Regeneron

Keys to Success

There are several factors that enabled Regeneron’s innovation with the development of the BioPerceptron platform. 


Schau jetzt
Shah Nawaz, CTO/VP, Digital Transformation at Regeneron,
takes the stage at the Everyday AI Conference NYC in September 2023.

First, Regeneron has established robust data transfer, data validation, and data privacy frameworks to be able to scale pilot projects quickly once they prove feasibility. In addition, through an iterative process of human validation and statistical methods, Regeneron was able to establish the right datasets to test the validity of CNN models in trial.

Importantly, thanks to Dataiku, Regeneron has a well-established AI and machine learning test bed and scientific compute to be able to iterate through experiments relatively quickly. With the ability for both data and non-data professionals to use and collaborate with Dataiku, Reneneron found success by bringing the right mix of engaged wet-lab scientists, data scientists, and compute specialists with a collaborative spirit.

Novartis: Streamlining Analytics & AI Across the Organization

Novartis moved from repetitive manual calculations in Excel to informed decision making grounded in accurate and real-time data with Dataiku.

Read more
Schau jetzt

Mount Sinai: An Enterprise Data Blueprint for Success

Mount Sinai has pivoted its processes to create more holistic methods which enable lasting results and life-long, positive impacts in patients’ lives. At the core of this transformation? Dataiku.

Mehr Erfahren

The NHS: Scaling AI for Population Health

The NHS uses Dataiku for MLOps, model monitoring, and more.

Mehr Erfahren

Malakoff Humanis: Improving Customer Relations With the Power of NLP

To address their growing challenges in keeping up with customer demands and providing quality customer service, Malakoff Humanis turned to Dataiku’s Deep Belief program and collaborated with Dataiku’s data scientists on two advanced natural language processing (NLP) projects.

Mehr Erfahren

Thrive SPC: Using Dataiku, Snowflake, and Snow Fox Data to Improve Clinical Home Care

By moving to Dataiku and working with Dataiku partners, Snowflake and Snow Fox Data, Thrive Skilled Pediatric Care (Thrive SPC) has been able to advance from complicated spreadsheets to a central platform that provides clear insights and metrics to fuel their data-driven healthcare solutions.

Mehr Erfahren