A Foundation Model for Cell Segmentation

Cells are a fundamental unit of biological organization, and identifying them in imaging data – cell segmentation – is a critical task for various cellular imaging experiments. While deep learning methods have led to substantial progress on this problem, most models in use are specialist models that work well for specific domains. Methods that have learned the general notion of “what is a cell” and can identify them across different domains of cellular imaging data have proven elusive. In this work, we present CellSAM, a foundation model for cell segmentation that generalizes across diverse cellular imaging data. CellSAM builds on top of the Segment Anything Model (SAM) by developing a prompt engineering approach for mask generation. We train an object detector, CellFinder, to automatically detect cells and prompt SAM to generate segmentations. We show that this approach allows a single model to achieve human-level performance for segmenting images of mammalian cells (in tissues and cell culture), yeast, and bacteria collected across various imaging modalities. We show that CellSAM has strong zero-shot performance and can be improved with a few examples via few-shot learning. We also show that CellSAM can unify bioimaging analysis workflows such as spatial transcriptomics and cell tracking. A deployed version of CellSAM is available at https://cellsam.deepcell.org/.


Introduction
Accurate cell segmentation is crucial for quantitative analysis and interpretation of various cellular imaging experiments.
Modern spatial genomics assays can produce data on the location and abundance of 10 1 -10 2 protein species and 10 2 -10 4 RNA species simultaneously in living and fixed tissues [1][2][3][4][5] .These data shed light on the biology of healthy and diseased tissues but are challenging to interpret.Cell segmentation enables these data to be converted to interpretable tissue maps of protein localization and transcript abundances.Similarly, live-cell imaging provides insight into dynamic phenomena in bacterial and mammalian cell biology.Mechanistic insights into critical phenomena such as the mechanical behavior of the bacterial cell wall 6,7 , information transmission in cell signaling pathways [8][9][10][11] , heterogeneity in immune cell behavior While promising, these studies reported challenges adapting SAM to these new use cases 36,43 .These challenges include reduced performance and uncertain boundaries when transitioning from natural to medical images.Cellular images contain additional complications -they can involve different imaging modalities (e.g., phase microscopy vs. fluorescence microscopy), thousands of objects in a field of view (as opposed to dozens in a natural image), uncertain and noisy boundaries (artifacts of projecting 3D objects into a 2D plane) 43 .In addition to these challenges, SAM's default prompting strategy does not allow for accurate inference for cellular images.Currently, the automated prompting of SAM uses a uniform grid of points to generate masks, an approach poorly suited to cellular images given the wide variation of cell densities.More precise prompting (e.g., a bounding box or mask) requires prior knowledge of cell locations.This creates a weak tautology -SAM can find the cells provided it knows a priori where they are.This limitation makes it challenging for SAM to serve as a foundation model for cell segmentation -it can accelerate labeling but still requires human input for inference.A solution to this problem would enable SAM-like models to serve as foundation models and knowledge engines, as they could accelerate the generation of labeled data, learn from them, and make that knowledge accessible to life scientists via inference.
In this work, we developed CellSAM, a foundation model for cell segmentation (Fig. 1).CellSAM extends the SAM methodology to perform automated cellular instance segmentation.To achieve this, we first assembled a comprehensive dataset for cell segmentation spanning five different morphological archetypes.To automate inference with SAM, we took a prompt engineering approach and explored the best ways to prompt SAM to generate high-quality masks.We observed that bounding boxes consistently generated high-quality masks compared to alternative approaches.We further identified a compute-efficient method to fine-tune SAM to achieve even better performance.To facilitate automated inference through prompting, we developed CellFinder, a transformer-based object detector that uses the Anchor DETR framework.
Within CellSAM, CellFinder and SAM shares the same ViT backbone; the bounding boxes generated by CellFinder are then used as prompts for SAM, enumerating masks for all the cells in an image.We trained CellSAM on a large, diverse corpus of cellular imaging data, enabling it to achieve state-of-the-art (SOTA) performance on nine datasets.We also evaluated CellSAM's zero-shot performance using a held-out dataset 44 , demonstrating that it outperforms existing methods for zero-shot segmentation.The datasets described in this work are available at https://deepcell.readthedocs.io/en/master/data-gallery/; a deployed version of CellSAM is available at our lab's web portal https://deepcell.org.

Construction of a dataset for general cell segmentation
A significant challenge with existing cellular segmentation methods is their inability to generalize across various imaging modalities and cell morphologies.To address this, we curated a dataset from the literature containing 2D images of various cell morphologies (mammalian cells in tissues and adherent cell culture, yeast cells, bacterial cells, and mammalian cell nuclei) and imaging modalities (fluorescence, brightfield, phase contrast, hematoxylin & eosin staining, and mass cytometry imaging).For each ingested dataset, we inspected them for data leaks between training and testing splits and removed them when present.Our final dataset consisted of TissueNet  Omnipose 47,48 , YeastNet 49 , YeaZ 50 , the 2018 Kaggle Data Science Bowl dataset (DSB) 51 , and an internally collected dataset of phase microscopy images across eight mammalian cell lines (Phase400).For evaluation, we group these datasets into four types: Tissue, Cell Culture, Bacteria, and Yeast.As the DSB 51 comprises cell nuclei that span several of these types, we evaluate it separately and refer to it as Nuclear.While our method focuses on whole-cell segmentation, we included DSB 51 because cell nuclei are often used as a surrogate when the information necessary for whole-cell segmentation (e.g., cell membrane markers) is absent from an image.A summary of the dataset is shown in Figure 2a.
To evaluate CellSAM's zero-shot performance, we used a held-out LIVECell 44 dataset.A detailed description of data sources and pre-processing steps can be found in the Appendix A.

Bounding boxes are accurate prompts for cell segmentation with SAM
For accurate inference, SAM needs to be provided with approximate information about the location of cells in the form of prompts.To better engineer prompts, we first assessed SAM's ability to generate masks when provided prompts derived from ground truth labels -either point prompts (derived from the cell's center of mass) or bounding box prompts.For these tests, we used the pre-trained model weights that were publicly released 35 .Our benchmarking results are shown in Figure 2b and revealed that bounding boxes had significantly higher zero-shot performance than point prompting, although both approaches struggled with Tissue imaging data.To improve SAM's mask generation ability for cellular image data, we explored fine-tuning SAM on our compiled data to help it bridge the gap from natural to cellular images.
During these fine-tuning experiments, we observed that fine-tuning all of SAM was unnecessary; instead, we only needed to fine-tune the layers connecting SAM's ViT to its decoder, the model neck, to achieve good performance.All other layers can be frozen.Fine-tuning SAM in this fashion led to a model capable of generating high-quality cell masks when prompted by ground truth bounding boxes, as seen in Figure 2b.Predicted segmentations are outlined in red.

CellFinder and CellSAM enable accurate and automated cell segmentation
Given that bounding box prompts yield accurate segmentation masks from SAM across various datasets, we sought to develop an object detector that could generate prompts for SAM in an automated fashion.Given that our zero-shot experiments demonstrated that ViT features can form robust internal representations of cellular images, we reasoned we could build an object detector on top of the image features generated by SAM's ViT.Previous work has explored this space and demonstrated that ViT backbones can achieve SOTA performance on natural images 52,53 .For our object detection module, we use the Anchor DETR framework 54 , using the same ViT backbone as the SAM module; we call this object detection module CellFinder.Anchor DETR is well suited for object detection in cellular images because it formulates object detection as a set prediction task.This allows it to -in theory -perform cell segmentation in images that are densely packed or contain overlapping objects, common occurrences in cellular imaging data.These failure modes are challenging to address with existing methods.Bounding box methods (e.g., the R-CNN family 55,56 ) rely on nonmaximum suppression, leading to poor performance in this regime.Methods that frame cell segmentation as a dense, pixel-wise prediction task (e.g., Mesmer 18 and Cellpose 15 ) assume that each pixel can be uniquely assigned to a single cell and cannot handle overlapping objects.
We train CellSAM in two stages; the full details can be found in Appendices B. In the first stage, we train CellFinder on the object detection task.We convert the ground truth cell masks into bounding boxes and train the ViT backbone and the CellFinder module.Once CellFinder is trained, we freeze the model weights of the ViT and fine-tune the SAM module as described above.This accounts for the distribution shifts in the ViT features that occur during the CellFinder training.Once training is complete, we use CellFinder to prompt SAM's mask decoder.We refer to the collective method as CellSAM; Figure 1 outlines an image's full path through CellSAM during inference.We benchmark CellSAM's performance using a suite of metrics (Figure 2c and 2d and Supplemental Figure S2) and find that it outperforms Cellpose models trained on comparable datasets.We highlight two features of our benchmarking analyses below.
• CellSAM is a strong generalist model.Generalization across cell morphologies and imaging datasets has been a significant challenge for deep learning-based cell segmentation algorithms.To evaluate CellSAM's generalization capabilities, we compared its performance to CellSAM and Cellpose models trained as specialists (e.g., on a single dataset) or generalists (e.g., on the entire dataset).Consistent with the literature, we observed that Cellpose's performance degraded when trained as a generalist (Figure 2c), as specialist Cellpose models had a higher F1 score across all datasets.We observed that the reverse was true for CellSAM; the F1 score remained the same or improved in four of the five data categories and across seven of the nine datasets (Figure 2 and Supplemental Figure S2).
• CellSAM achieves SOTA zero-shot performance.To further evaluate CellSAM's capacity for generalization, we evaluated its performance on an entirely unseen dataset, LIVECell 44 , without further fine-tuning.When compared against the Cellpose-generalist model, we find that CellSAM's zero-shot segmentation performance is considerably better, albeit still not accurate enough to be used in real-world settings.We note that some of the poor reported performance is due to label errors in the LIVECell dataset 16 .

Discussion
Cell segmentation is a critical task for cellular imaging experiments.While deep learning methods have made substantial progress in recent years, there remains a need for methods that can generalize across diverse images and further reduce the marginal cost of image labeling.In this work, we sought to meet these needs by developing CellSAM, a foundation model for cell segmentation.Transformer-based methods for cell segmentation are showing promising performance.
CellSAM builds on these works by integrating the mask generation capabilities of SAM with transformer-based object detection to empower both scalable image labeling and automated inference.We trained CellSAM on a diverse dataset curated from the literature.Our benchmarking demonstrated that CellSAM achieves SOTA performance on cell segmentation and that this performance is aided by our attempts to create a general segmentation model.Given its utility in image labeling and accuracy during inference, we believe CellSAM is a valuable contribution to the field and will help create the data infrastructure required for cellular imaging's AI-powered future.
The work described here has importance beyond aiding life scientists with cell segmentation.First, foundation models are immensely useful for natural language and vision tasks and hold similar promise for the life sciences -provided they are suitably adapted to this new domain.We can see several uses for CellSAM that might be within reach of future work.First, given its generalization capabilities, it is likely that CellSAM has learned a general representation for the notion of "cells" used to query imaging data.These representations might serve as an interface between imaging data and other modalities (e.g., single-cell RNA Sequencing), provided there is suitable alignment between cellular representations for each domain 57,58 .Second, much like what has occurred with natural images, we foresee that the integration of natural language labels in addition to cell-level labels might lead to vision-language models capable of generating human-like descriptors of cellular images with entity-level resolution 33 .Third, the generalization capabilities may enable the standardization of cellular image analysis pipelines across all the life sciences.If the accuracy is sufficient, microbiologists and tissue biologists could use the same collection of foundation models for interpreting their imaging data even for challenging experiments 59,60 .Last, new efforts seek to generate AI scientists capable of generating hypotheses and exploring them through the design and execution of new experiments 61 .Foundation models like CellSAM could contribute to this vision by serving as this scientist's "eyes", converting complex imaging data to structured knowledge that can be operationalized.
While the work presented here highlights the potential foundation models hold for cellular image analysis, much work remains to be done for this future to manifest.Extension of this methodology to 3D imaging data is essential; recent work on memory-efficient attention kernels 62 will aid these efforts.Exploring how to enable foundation models to leverage the full information content of images (e.g., multiple stains, temporal information for movies, etc.) is an essential avenue of future work.Expanding the space of labeled data remains a priority -this includes images of perturbed cells and cells with more challenging morphologies (e.g., neurons).Data generated by pooled optical screens 63 may synergize well with the data needs of foundation models.Compute-efficient fine-tuning strategies must be developed to enable flexible adaptation to new image domains.Lastly, prompt engineering is a critical area of future work, as it is critical to maximizing model performance.The work we presented here can be thought of as prompt engineering, as we leverage CellFinder to produce bounding box prompts for SAM.As more challenging labeled datasets are incorporated, the nature of the "best" prompts will likely evolve.Finding the best prompts for these new data, rather than the best vision pipelines, is a task that will likely fall on both the computer vision and life science communities.

A Dataset Construction
To train CellSAM, we combined nine separate datasets spanning a variety of modalities: TissueNet 18 , DeepBacs 45 , BriFiSeg 46 , Cellpose 15,16 , Omnipose 47,48 , YeastNet 49 , YeaZ 50 , the 2018 Kaggle Data Science Bowl (DSB) 51 , and an internally collected dataset of phase microscopy images across eight mammalian cell lines (Phase400).The LIVECell 44 dataset was held out for zero-shot testing.Our collective dataset included images across multiple imaging modalities (brightfield, phase contrast, h&e staining, fluorescence, and mass cytometry), imaging targets (histology sections, yeast, cell culture, bacteria, nuclei), length scales, and morphologies.During preprocessing, every image in our dataset was normalized using Contrast Limited Adaptive Histogram Equalization (CLAHE) 64 with a kernel size of 128 pixels.We treated nuclear and whole-cell channels as green and blue channels in an RGB image, respectively, and the red channel is always blank.We moved the green channel to blue for nuclear-only datasets (i.e., BriFiSeg and DSB) to keep the blue channel always non-empty.
If available, we used pre-determined train/val/test splits for each dataset; otherwise, we introduced 80-10-10% data splits.For datasets with multiple fields of view of the same object set, we required all FOVs to belong to the same split.
We defer all duplicated samples to the train split for published datasets with a pre-existing data leak.Our assembled dataset uses a fixed image size of 512 by 512 pixels.Images shorter than 512 pixels on either axis are zero-padded up to 512.For images with more than 512 pixels on either axis, we tiled them to 512 by 512 pixels with a 25% overlap and filled the empty regions with zeros.Any cropped images without valid annotations were removed.We follow a widely used annotation scheme for labeling our masks, with zero representing the background and unique positive integers representing different objects.While this format precludes accurate segmentation of overlapping objects, labels of this kind were not present in the dataset we compiled.We filtered out invalid cell labels if 1) the label contained disjoint regions, typically caused by random mouse clicking; 2) the label has only a 1-pixel height or width.The cropped images with filtered annotations are used for training, validation, and testing.LIVECell 44 annotations were converted from the COCO format to this labeling format for consistency.We used cellpose's 16 pre-processing function livecell ann to masks() to remove overlapping regions.To match the phase contrast cell size in the training set, we rescale the LIVECell images by 2.0 before the standard image preprocessing pipeline.We use scikit-image 65 skimage.transform.rescale()function with bicubic interpolation for images and nearest interpolation for annotation masks.
To summarize our dataset format, our images have RGB channels with a fixed size of 512 by 512 pixels, stored in shape (3, 512, 512) float32 array in the range [0, 1].The blue channel was the main channel, reserved for whole-cell images, and always non-empty.The green channel was the supplementary channel used for nuclear images but could be empty.The blue channel was used if only nuclear images were available (e.g., for the DSB dataset).The red channel was always empty.Our label masks had the same height and width as the images, stored with the shape (1, 512, 512) int32 in the range [0, number of objects].We stored the processed dataset in two formats.The numpy npy format was used for CellSAM fine-tuning and model evaluation.The COCO format 53 was used for CellFinder and ViT backbone training.
coco AP @ 0.5 IoU) or we compute the argmax over the cell dimension N to generate a tensor W × H, where each pixel corresponds to an integer that is unique for each cell.
Thresholding.Given CellSAM's model architecture, we have three different thresholds at inference time.First, we had a threshold on the bounding boxes generated by CellFinder, which we set to 0.4 across all datasets.After the boxes were passed through the Mask Decoder, we had an overall mask score outputted by the IoU prediction head of the Mask Decoder, which we set to 0.5.Lastly, we thresholded the mask decoder output after applying the sigmoid function to each pixel, which we set at 0.5.
CellSAM Postprocessing.We use the same postprocessing steps that are used by SAM 35 .This consisted of hole filling and island removal for each predicted cell.

B.1.1 Model Implementation and Training
CellSAM is implemented in pytorch 70 .For CellFinder we modify the official Anchor DETR repo1 .For CellSAM, we modify the official Segment Anything repo2 .We use pytorch lightning 71 to scale the training.Prototyping was done using NVIDIA's RTX 4090.We used machines with either NVIDIA A6000s or A100s (40GB and 80GB versions) for the experiments in the paper.

C Benchmarking
We benchmarked the performance of CellSAM models against Cellpose 15,16 trained on our compiled datasets.

C.1 Cellpose Model Training.
We follow the hyper-parameters described in the original paper 16 to train specialist and generalist Cellpose models from scratch.We use the SGD optimizer with a weight decay of 10 −4 and a batch size of eight.We train each model for 300 epochs with a base learning rate of 0.1.The learning rate increases linearly from 0 to 0.1 over the first ten epochs, then decreases by a factor of two every five epochs after the 250th epoch.The main channel (--chan) is 3 (blue), and the supplementary channel (--chan2) is 2 (green).Other hyper-parameters were kept at the default setting.We trained each model on a single NVIDIA A6000 GPU with 11.4GB GPU Memory utilization.In total, we train nine specialist models and one generalist model.

C.2 Metrics
We used the Metrics package present in the DeepCell library 18,23 , which is a set of tools for object-level evaluation of cell segmentations.Predictions that match the ground truth labels (determined by a mask IoU ≥ 0.6) are true positives (TP), predictions with no matching ground truth labels are false positives (FP), and ground truth labels without a valid match are false negatives.We compute the recall, precision, and F1 scores using the following formulas: • Recall: recall = TP TP+FN .• Precision: precision = TP TP+FP .
• F1: Details of the implementation of these metrics are described in prior work 23 .
We also used the COCO evaluation metrics 53 during CellFinder's development.The COCO metrics are a widely used benchmark for assessing the object-level quality of object detection and instance segmentation methods.These metrics report Average Precision (AP), the area under the Precision-Recall curve for a given object class.In our case, we only had a single object class -cells.The AP is computed for different IoU thresholds, ranging from 0.5 to 0.95, with a step size of 0.05.We report the mean AP across all IoU thresholds, denoted as mAP, as well as the AP at IoU=0.5, denoted as AP50, to quantify CellFinder's performance.Because the object density is much higher in cellular images than in natural images, we modified the limit for the maximum number of detections from 100 to 10,000.We also fed the actual confidence score per binary prediction of the CellSAM model to the COCO evaluator.For the Cellpose models, we used a fixed confidence score of 1.0.Cellpose-specific Cellpose-general CellSAM-specific CellSAM-general Fig. S2 Per dataset performance across a suite of metrics from the DeepCell package, and additionally, we included the AP50 from the COCO metrics.We show the error rate (1-metric) on these bar plots.We demonstrate CellSAM-specificand CellSAM-generalsuperior performance across multiple datasets and evaluation metrics.

Fig. 2
Fig.2CellSAM is a strong generalist model for cell segmentation.a) For training and evaluating CellSAM, we curated a diverse cell segmentation dataset from the literature.The number of annotated cells is given for each data type.Nuclear refers to a heterogeneous dataset (DSB)51 containing nuclear segmentation labels.b) Zero-shot (ZS) and fine-tuned mask generation error (1-F1 score) for SAM when using point and bounding box prompts.All prompting in this figure was done with ground truth prompts.The best performance is achieved with bounding box prompts and fine-tuning.c) Segmentation performance for CellSAM and Cellpose on different data types.We compare the segmentation error (1-F1) for models that were trained as specialists (e.g., on one dataset) or generalists (the full dataset).Models were trained for a similar number of steps across all datasets.We observed that CellSAM-generalhas a lower error than Cellpose-general on almost all tested datasets.Further, we observed that generalist training improved CellSAM's performance over specialist training; the reverse was true for Cellpose.d) Zero-shot performance of CellSAM-general and Cellpose-General on the LIVECell dataset.Here, we show greater than 4x segmentation performance on an unseen dataset.e) Qualitative results of CellSAM segmentations for different data and imaging modalities.Predicted segmentations are outlined in red.
Fig.S1Per dataset performance comparing zero-shot point prompting, zero-shot box prompting, and fine-tuned box prompting across a suite of metrics from the DeepCell package, and additionally, we included the AP50 from the COCO metrics.We show the error rate (1-metric) on these bar plots.We demonstrate CellSAM-specificand CellSAM-generalsuperior performance across multiple datasets and multiple evaluation metrics.

Image: Cell Finder Prompt Encoder Output Segmentation: Vision Transformer
15,16, ViT) to generate information-rich image features.These image features are then sent to two downstream modules.The first module, CellFinder, decodes these features into bounding boxes using a transformer-based encoder-decoder pair.The second module combines these image features with prompts to generate masks using SAM's mask decoder.CellSAM integrates these two modules using the bounding boxes generated by CellFinder as prompts for SAM.CellSAM is trained in two stages, using the pre-trained SAM model weights as a starting point.In the first stage, we train the ViT and the CellFinder model together on the object detection task.This yields an accurate CellFinder but results in a distribution shift between the ViT and SAM's mask decoder.The second stage closes this gap by fixing the ViT and SAM mask decoder weights and fine-tuning the remainder of the SAM model (i.e., the model neck) using ground truth bounding boxes and segmentation labels.