Annotating human BMMCs#

Annotate a dataset of human bonemarrow mononuclear cells (BMMCs) [LBC+21]. This data corresponds to the gene expression portion of the NeurIPS 2021 challenge 10x multiome data.

The original raw version of this dataset can be obtained from GEO. Running this notebook will automatically download the data from figshare.

Preliminaries#

Import packages & data#

import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns

from cell_annotator import CellAnnotator, ObsBeautifier
from cell_annotator.utils import _shuffle_cluster_key_categories_within_sample

Load the gene expression portion of the NeurIPS 2021 challenge 10x multiome data. This dataset has been obtained from GEO. We computed an embedding using scANVI to visualize the data [XLM+21].

adata = sc.read("data/human_bmmcs.h5ad", backup_url="https://figshare.com/ndownloader/files/51994907")
adata

AnnData object with n_obs × n_vars = 69249 × 13431
    obs: 'pct_counts_mt', 'n_counts', 'n_genes', 'size_factors', 'phase', 'cell_type', 'batch', 'pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', '_scvi_batch', '_scvi_labels'
    var: 'feature_types', 'gene_id'
    uns: 'DonorNumber_colors', 'Samplename_colors', 'Site_colors', '_scvi_manager_uuid', '_scvi_uuid', 'batch_colors', 'cell_type_colors', 'dataset_id', 'genome', 'neighbors', 'organism', 'phase_colors', 'umap'
    obsm: 'X_pca', 'X_scANVI', 'X_umap', '_scvi_extra_categorical_covs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

Preprocess the data#

Create numerical cell type labels to be used for annotation and a shuffled version thereof.

adata.obs["leiden"] = adata.obs["cell_type"].copy()
adata.obs["leiden"] = adata.obs["leiden"].cat.rename_categories(
    {cat: str(i) for i, cat in enumerate(adata.obs["cell_type"].cat.categories)}
)

adata = _shuffle_cluster_key_categories_within_sample(adata, sample_key="Site", key_added="leiden_shuffled")

Visualize

sc.pl.embedding(adata, basis="umap", color=["Site", "DonorNumber", "cell_type", "leiden_shuffled"], wspace=0.7, ncols=5)

../../_images/2decdd6d9256f398228cd23edc3b54cd220845ca2aa5a7fb5c792404856dab1b.png

We’re using the Site annotation to denote different samples, and we randomly shuffled the order of cluster labels within each sample to resemble a realistic scenario, in which we’re given a dataset with several samples, with independent clustering per sample (i.e., pre-integration).

Query cell type labels per sample#

In this notebook, we will demonstrate the slightly more advanced usage of this package, which involves a few method calls, but gives you more control and insights.

If you’re new and looking for the simplest way to get started, head over to our basic tutorial: Annotating the heart atlas.

cell_ann = CellAnnotator(
    adata,
    species="human",
    tissue="bone marrow mononuclear cells",
    cluster_key="leiden_shuffled",
    sample_key="Site",
    model="gpt-4.1",
)

cell_ann

INFO     ✅ OPENAI API key is available                                                                            
INFO     Initializing `4` SampleAnnotator objects(s).

🧬 CellAnnotator
================
📋 Species: human
🔬 Tissue: bone marrow mononuclear cells
⏳ Stage: adult
🔗 Cluster key: leiden_shuffled
🔬 Sample key: Site

🤖 Provider: openai
🧠 Model: gpt-4.1

🔋 Status: ❌ Not working

📊 Samples: 4
🏷️  Sample IDs: 'site1', 'site2', 'site3', 'site4'

Query expected cell types and actual markers#

Let’s start by querying expected cell types and their markers in this tissue. We can use this step as a validation, to see how much the model “knows” about our data.

cell_ann.get_expected_cell_type_markers()

INFO     Querying cell types.                                                                                      
INFO     Writing expected cell types to `self.expected_cell_types`                                                 
INFO     Querying cell type markers.                                                                               
INFO     Writing expected marker genes to `self.expected_marker_genes`.                                            
INFO     Filtering marker genes to only include those present in adata.var_names.                                  
INFO     Filtered 4 marker genes and removed 0 cell types with no marker genes left after filtering.

Let’s take a look at these.

cell_ann.expected_marker_genes

{'Hematopoietic stem cells (HSCs)': ['CD34',
  'BAALC',
  'GATA2',
  'SPI1',
  'MEIS1'],
 'Multipotent progenitor cells (MPPs)': ['CD34',
  'BAALC',
  'MEIS1',
  'GATA2',
  'SPI1'],
 'Common myeloid progenitors (CMPs)': ['CD34',
  'GATA2',
  'SPI1',
  'MEIS1',
  'RUNX1'],
 'Common lymphoid progenitors (CLPs)': ['CD34',
  'BAALC',
  'MEIS1',
  'GATA3',
  'IL7R'],
 'Megakaryocyte-erythroid progenitors (MEPs)': ['GATA1',
  'GATA2',
  'SPI1',
  'CD34',
  'MEIS1'],
 'Granulocyte-monocyte progenitors (GMPs)': ['CD34',
  'SPI1',
  'GATA2',
  'CEBPA',
  'CSF1R'],
 'Monocytes': ['CD14', 'LYZ', 'FCGR2A', 'CSF1R', 'SPI1'],
 'Macrophages': ['CD163', 'CSF1R', 'SPI1', 'CD14', 'CD68'],
 'Neutrophils (precursors and mature)': ['S100A8',
  'S100A9',
  'CSF3R',
  'MPO',
  'PRTN3'],
 'Eosinophils (precursors and mature)': ['PRG2', 'GATA2', 'CEBPA', 'IL5RA'],
 'Basophils (precursors and mature)': ['GATA2', 'IL3RA', 'CEBPA', 'FCER1G'],
 'Erythroblasts': ['GYPA', 'HBB', 'HBA2', 'ALAS2', 'KLF1'],
 'Pro-erythroblasts': ['GYPA', 'HBB', 'ALAS2', 'KLF1', 'CD34'],
 'Normoblasts (late erythroblasts)': ['HBB', 'HBA2', 'GYPA', 'ALAS2', 'KLF1'],
 'B cells (early and mature)': ['CD19', 'CD79B', 'MS4A1', 'POU2F2', 'SPI1'],
 'T cells (mature, some precursors)': ['CD3E', 'CD3D', 'CD3G', 'CD2', 'CD7'],
 'Natural killer (NK) cells': ['KLRK1', 'KLRD1', 'GNLY', 'GZMB', 'NKG7'],
 'Plasma cells': ['XBP1', 'MZB1', 'IGHG1', 'IGKC', 'JCHAIN'],
 'Dendritic cells (conventional and plasmacytoid)': ['CD86',
  'ITGAX',
  'FCER1G',
  'CLEC4C',
  'SPI1'],
 'Mesenchymal stromal cells': ['VIM', 'CD44', 'ENG', 'VCAN'],
 'Endothelial precursor cells': ['PECAM1', 'CD34', 'ENG', 'FLI1'],
 'Mast cells (precursors and some mature forms)': ['KIT', 'FCER1G', 'GATA2']}

Ok, that looks reasonable. We could modify these expected cell types and marker genes by providing a dictionary and overwriting CellAnnotator.expected_marker_genes, thereby infusing prior knowledge. This information will be used later on to guide cell type label querying.

In the next step, we’ll compute marker genes for the clusters in our dataset.

cell_ann.get_cluster_markers()

INFO     Iterating over samples to compute cluster marker genes.

We can also take a look at these marker genes. Under the hood, we created once cell_annotator.SampleAnnotator class per sample, and marker genes are written into these. Let’s take a look at an example:

cell_ann.sample_annotators["site3"].marker_gene_dfs["0"].head()

	gene	specificity	auc
0	BANK1	0.803347	0.969336
1	RALGPS2	0.754415	0.943715
2	OSBPL10	0.932700	0.881645
3	MS4A1	0.917240	0.880816
4	EBF1	0.842919	0.845968

These are the top marker genes for one cluster in one sample.

Annotate clusters#

Using the marker genes computed above, we can now query the model for cell type labels. The expected cell types and marker genes queried above help to make predictions more robust and consistent across samples.

cell_ann.annotate_clusters()

INFO     Iterating over samples to annotate clusters.
WARNING  Not enough markers provided for cluster 0 in sample `site1` (0<2).                                        
WARNING  Not enough markers provided for cluster 10 in sample `site1` (1<2).                                       
WARNING  Not enough markers provided for cluster 18 in sample `site1` (0<2).                                       
WARNING  Not enough markers provided for cluster 8 in sample `site3` (0<2).                                        
WARNING  Not enough markers provided for cluster 14 in sample `site4` (1<2).                                       
INFO     Querying cell-type label de-duplication.                                                                  
INFO     Removed 15/50 cell types.                                                                                 
INFO     Iterating over samples to harmonize cell type annotations.
WARNING  Invalid global cell type name: Hematopoietic stem/progenitor cell (HSC/MPP) in sample `site2`. Re-using   
         the original name: Hematopoietic stem/progenitor cell (HSC/MPP).                                          
INFO     Writing updated cluster labels to `adata.obs[`cell_type_predicted'].

🧬 CellAnnotator
================
📋 Species: human
🔬 Tissue: bone marrow mononuclear cells
⏳ Stage: adult
🔗 Cluster key: leiden_shuffled
🔬 Sample key: Site

🤖 Provider: openai
🧠 Model: gpt-4.1

🔋 Status: ❌ Not working

📊 Samples: 4
🏷️  Sample IDs: 'site1', 'site2', 'site3', 'site4'

Within each cell_annotator.SampleAnnotator, we can inspect annotation results:

cell_ann.sample_annotators["site3"].annotation_df.head(10)

	n_cells	marker_genes	reason_for_failure	marker_gene_description	cell_type	cell_state	annotation_confidence	reason_for_confidence_estimate	cell_type_harmonized
0	455	BANK1, RALGPS2, OSBPL10, MS4A1, EBF1, BLK, PAX5	None	The markers MS4A1 (CD20), PAX5, EBF1, and BLK ...	B cell	Normal	High	Multiple highly specific and canonical B cell ...	B cell
1	356	NFIA, SLC25A21, KIAA1211, ANK1, MBOAT2, CDC42B...	None	Several markers in cluster 1, including ANK1 a...	Early erythroid progenitor	Normal	Medium	While several markers overlap with erythroid l...	Early erythroid progenitor
2	213	SLC4A1, TSPAN5, ANK1, SLC25A37, SOX6, SPTA1, M...	None	The cluster markers SLC4A1, ANK1, SLC25A37, SO...	Normoblast (late erythroblast)	Normal	High	Markers are classic and highly specific for er...	Normoblast (late erythroblast)
3	258	ZNF385D, XACT, KIAA1211, MED12L, TAFA2, RYR3, ...	None	The markers for this cluster (ZNF385D, XACT, K...	Unknown	Normal	Low	None of the provided markers are standard for ...	Unknown
4	353	TCF7L2, MTSS1, WARS, PECAM1, MS4A7, RNF144B, C...	None	PECAM1 (CD31) is a classic marker for endothel...	Endothelial precursor cells	Normal	High	PECAM1 is highly specific for endothelial cell...	Endothelial precursor cells
5	38	IFNG-AS1, MZB1, TXNDC5, DENND5B, SEL1L3, AL589...	None	Cluster 5 expresses MZB1 and IFNG-AS1, both of...	Plasma cells	Normal	High	Key plasma cell marker MZB1 is present, suppor...	Plasma cells
6	248	NEGR1, CST3, CCSER1, RUNX2, HLA-DPB1, FLT3, CIITA	None	RUNX2 is a transcription factor that can be se...	Multipotent progenitor (MPP)/early dendritic-l...	Normal	Medium	Marker overlap with both lymphoid/dendritic pr...	Multipotent progenitor (MPP)/early dendritic-l...
7	623	APP, AC023590.1, RUNX2, FAM160A1, PLXNA4, CUX2...	None	The marker genes include APP (amyloid beta pre...	Unknown	Normal	Low	None of the marker genes are classical markers...	Unknown
8	129		Not enough markers provided for cluster 8 in s...	Unknown	Unknown	Unknown	Unknown	Unknown	Unknown
9	581	CAMK4, BCL11B, INPP4B, LEF1, ANK3, IL7R, MLLT3	None	CAMK4, BCL11B, LEF1, and IL7R are canonical ma...	T cells (early/precursor)	Normal	High	The presence of multiple key T-lineage specify...	T cell (early precursor or thymocyte)

You can see that the model returns Unknown when the provided information isn’t enough to output a reliable prediction.

Annotations have been automatically harmonized across samples (cell_type_harmonized) and written to the underling anndata.AnnData object, by default to cell_type_predicted.

Evaluate#

First, let’s get clusters into a consistent ordering across annotations. This will make it easier to interpret the results in a confusion matrix.

obr = ObsBeautifier(adata, model="o4-mini")  # reasoning models work best here

# bring categories into a more meaningful order
obr.reorder_categories(keys=["cell_type", "cell_type_predicted"])

# assign colors to the categories
obr.assign_colors(keys=["cell_type", "cell_type_predicted"])

INFO     ✅ OPENAI API key is available                                                                            
INFO     Querying label ordering.                                                                                  
INFO     Reordering categories for key 'cell_type'                                                                 
INFO     Reordering categories for key 'cell_type_predicted'                                                       
INFO     Querying cluster colors.                                                                                  
INFO     Assigning colors for key 'cell_type'                                                                      
INFO     Assigning colors for key 'cell_type_predicted'

Compute ground-truth with predicted cell type labels in a confusion matrix (across all samples).

df = adata.obs.groupby(["cell_type", "cell_type_predicted"], observed=True).size().unstack()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df, annot=False, cmap="Blues", xticklabels=True, yticklabels=True)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

../../_images/1b2518372425a0594433877b5e785807f0575dd99593fe80c9654f1cfd08f586.png

We can also compare ground-truth and predicted cell types in the UMAP embedding.

sc.pl.embedding(adata, basis="umap", color=["cell_type", "cell_type_predicted"], wspace=0.6)

../../_images/ac604ceca88bcdb6bdba862cfc4dd7bd9d9ded326bee3d80434ccc7261680a59.png

Annotating human BMMCs

Contents