The monitoring of subvisible particles (SVPs) in injectable formulations development is a critical component of ensuring patient safety and product quality, both at Regeneron specifically as well as in the biopharmaceutical industry at large.
Currently, there is a gap in routinely measuring and controlling subvisible particles smaller than 10μm in biotherapeutic products. Recent studies have indicated the potential of proteinaceous particles in the subvisible size range (0.1–10μm) to aggregate and lead to down-the-line process failure or patient immunogenicity reactions.
Regeneron’s BioPerceptron platform — a deep learning solution built with Dataiku for high-throughput biomedical image processing that leverages AI, end-to-end cloud orchestration, and advanced visualization capabilities — addresses this gap, transforming proprietary, unstructured data into proactive pharmaceutical process monitoring.
Solution Overview & Benefits
While the problem of determining possible unseen contaminants in drug formulations is not new (and neither is the use of deep learning methods for image analyses and classification), Regeneron’s approach is a timely use of deep learning in a in a way that makes life-saving medicines more efficacious and less risky by extending the industry’s highest standards of quality biomanufacturing.
Current industry standards use light obscuration techniques (based on the ability of a particle to reduce measured light intensity when passing a light beam) to measure SVPs, but the method cannot distinguish different types of particles (synthetic vs. proteinaceous) in a test solution.
Regeneron’s IT teams partnered with formulation development scientists to develop a deep learning convolutional neural network (CNN) approach that assigns weights (importance) to various features of an image to be able to “learn” and differentiate characteristics of one image versus another.
This initial solution was then expanded into a cloud-native platform that can:
- Automatically parse and ingest existing formulations data sources.
- Analyze and classify the particles detected within the images present.
- Detect anomalies by applying quality threshold limits.
- Illustrate the diagnostic findings through a self-service visualization service.
- Provide rapid feedback for contaminant detection and corrective action.
Thanks to development on Dataiku combined with the use of GPUs to streamline image processing pipelines, each microscopic flow imaging (MFI) classification takes less than 15 minutes to complete. The result of the classification is a better than 94% positive prediction rate for silicon and protein SVPs across various sizes.
In addition to potentially improved product quality and safety plus improved process development for more efficient manufacturing scale-up, the BioPerceptron platform is modular, which offers the potential of simplifying regulatory validation by only changing single components rather than an entire system. The Regeneron team used Dataiku as the front-end development tool to build and orchestrate workflows that utilized a host of AWS services, including Amazon Bedrock for large language models (LLMs).
Challenges Overcome
For this use case, there were business and technical challenges as well as data and modeling challenges.
On the business side, unseen contaminant aggregates in pharmaceutical drug development present a multi-factorial challenge. They are very small — in the range of 1-25 μm — and they vary in type (silicon oil droplets, protein aggregates, fibers, glass particles, or air bubbles). Being able to classify type as well as size has several advantages, and understanding what type of SVP exists in the pharmaceutical product aids in diagnosing the source of the contaminant.
In addition, as with any machine learning or AI exercise, there are some more specific data and modeling challenges associated with this particular use case:
Understanding the Data Landscape
The data in this case consists of high-resolution microscopy images. The challenge here was in the ability to identify the different particles and their real world characteristics to apply industry quality limits in a timely manner from very large high-resolution microscopy images.
To solve this challenge, IT worked closely with their formulations research partners to replicate development and manufacturing conditions and generate samples representative of real-world data. These MFI files are captured and tagged with appropriate metadata from the microscopy system. Then, the image datasets are pushed through the parallelized high-throughput classification pipeline.
Verifying the Categorization of the Datasets Used for Model Learning
Regeneron applied state-of-the-art data validation and unsupervised learning methods to understand underlying patterns within the image data and systematically capture data inconsistencies. Methods include neural network-based dimensionality reduction, hierarchical clustering, and multi-dimensional variance analysis.
Choosing a Training Dataset Adequately Representative of Real-World Conditions
To bring this use case to life, Regeneron needed to have training data that was adequately representative —in terms of both size and type – of real-world conditions. Their solution balanced the use of randomized and stratified sampling to accurately capture the distribution of particle types and sizes expected in production settings without compromising the model’s need for sufficient data samples.
Tuning Model Parameters to Optimize Predictive Value
Regeneron integrated existing cloud capabilities with their deep learning pipeline to seamlessly leverage elastic and parallelizable compute resources for time-consuming model learning cycles.
Innovation
Regeneron’s BioPerceptron platform is innovative because of its:
- Analysis of particles under 10 μm, which is novel given resolution constraints and morphological invariances in very small particles.
- Automated ingestion of MFI images and model execution through cloud orchestration methods.
- Interactive and self-service results (particle visualization dashboards) for computational biologists and development formulations scientists working collaboratively with the data science team.
- Pre-model execution data scan in near real-time, which outputs an initial data triage that can be used for data profiling and model evaluation.
- Model results integrated automatically with the source MFI data.
- Extensible framework, which is model agnostic, allowing upgrades and changes to the algorithm without changing the workflow.
- High throughput, which can be parallelized with each MFI classification run, taking less than 15 minutes.
While each one of those points in isolation is not noteworthy, the innovation in Regeneron’s solution exists in the aggregate.
Keys to Success
There are several factors that enabled Regeneron’s innovation with the development of the BioPerceptron platform.
First, Regeneron has established robust data transfer, data validation, and data privacy frameworks to be able to scale pilot projects quickly once they prove feasibility. In addition, through an iterative process of human validation and statistical methods, Regeneron was able to establish the right datasets to test the validity of CNN models in trial.
Importantly, thanks to Dataiku, Regeneron has a well-established AI and machine learning test bed and scientific compute to be able to iterate through experiments relatively quickly. With the ability for both data and non-data professionals to use and collaborate with Dataiku, Reneneron found success by bringing the right mix of engaged wet-lab scientists, data scientists, and compute specialists with a collaborative spirit.