Annotating the heart atlas#

Annotate a subsampled version of the heart cell atlas [LitvivnukovaTLopezM+20]. Head to https://www.heartcellatlas.org/ for more information on this dataset.

Running this notebook will automatically download the data from figshare.

Preliminaries#

Import packages & data#

import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns

from cell_annotator import CellAnnotator, ObsBeautifier
from cell_annotator.utils import _shuffle_cluster_key_categories_within_sample

Load a subsampled version of the heart cell atlas [LitvivnukovaTLopezM+20]. This dataset has been obtained using scvi.data.heart_cell_atlas_subsampled(). We computed an embedding with scVI to visualize the data [LRC+18].

adata = sc.read("data/heart_atlas.h5ad", backup_url="https://figshare.com/ndownloader/files/51994787")

adata
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'cell_source_colors', 'cell_type_colors', 'donor_colors', 'hvg', 'log1p', 'neighbors', 'umap'
    obsm: 'X_scVI', 'X_umap', '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

Preprocess the data#

Create numerical cell type labels to be used for annotation and a shuffled version thereof.

adata.obs["leiden"] = adata.obs["cell_type"].copy()
adata.obs["leiden"] = adata.obs["leiden"].cat.rename_categories(
    {cat: str(i) for i, cat in enumerate(adata.obs["cell_type"].cat.categories)}
)

adata = _shuffle_cluster_key_categories_within_sample(adata, sample_key="cell_source", key_added="leiden_shuffled")

Visualize

sc.pl.embedding(
    adata, basis="umap", color=["donor", "cell_source", "cell_type", "leiden_shuffled"], wspace=0.4, ncols=5
)
../../_images/7f027001433cca107eb5c89a2f0be880c458976fb6edf6d403ef9018688ad32d.png

We’re using the cell_source annotation to denote different samples, and we randomly shuffled the order of cluster labels within each sample to resemble a realistic scenario, in which we’re given a dataset with several samples, with independent clustering per sample (i.e., pre-integration).

Query cell type labels per sample#

In this notebook, we will demonstrate the simplest usage of this package, which is just a single method call. If you want to have more control over what’s happening behind the scences, to check model outputs at intermediate steps or to infuse prior knowledge, head over to our more advanced tutorial: Annotating human BMMCs.

cell_ann = CellAnnotator(
    adata, species="human", tissue="heart", cluster_key="leiden_shuffled", sample_key="cell_source", model="gpt-4.1"
)
cell_ann
INFO     ✅ OPENAI API key is available                                                                            
INFO     Initializing `4` SampleAnnotator objects(s).
🧬 CellAnnotator
================
📋 Species: human
🔬 Tissue: heart
⏳ Stage: adult
🔗 Cluster key: leiden_shuffled
🔬 Sample key: cell_source

🤖 Provider: openai
🧠 Model: gpt-4.1

🔋 Status: ❌ Not working

📊 Samples: 4
🏷️  Sample IDs: 'Sanger-CD45', 'Sanger-Nuclei', 'Sanger-Cells', 'Harvard-Nuclei'

Note

A word on model choice:

For each LLM providor (like OpenAI, Gemini, etc.), there’s a default model which will be used if you don’t specify one explicitly. Such a default model is enough for simpler use cases or to experiment with the package. Running this notebook with the default model will cost you less than 0.01 USD. Here, we’re using a more powerful model, to demonstrate the full strength of the package. Running with this more powerful model will incur some small cost, which depends on the model, the LLM provider, and your usage tier (it’s usually far less than 1 USD).

cell_ann.annotate_clusters()
INFO     Querying cell types.                                                                                      
INFO     Writing expected cell types to `self.expected_cell_types`                                                 
INFO     Querying cell type markers.                                                                               
INFO     Writing expected marker genes to `self.expected_marker_genes`.                                            
INFO     Filtering marker genes to only include those present in adata.var_names.                                  
INFO     Filtered 8 marker genes and removed 0 cell types with no marker genes left after filtering.               
INFO     Iterating over samples to annotate clusters.
INFO     Querying cell-type label de-duplication.                                                                  
INFO     Removed 3/19 cell types.                                                                                  
INFO     Iterating over samples to harmonize cell type annotations.
INFO     Writing updated cluster labels to `adata.obs[`cell_type_predicted'].
🧬 CellAnnotator
================
📋 Species: human
🔬 Tissue: heart
⏳ Stage: adult
🔗 Cluster key: leiden_shuffled
🔬 Sample key: cell_source

🤖 Provider: openai
🧠 Model: gpt-4.1

🔋 Status: ❌ Not working

📊 Samples: 4
🏷️  Sample IDs: 'Sanger-CD45', 'Sanger-Nuclei', 'Sanger-Cells', 'Harvard-Nuclei'

Under the hood, this creates one cell_annotator.SampleAnnotator object per sample. Let’s take a look at one of them:

cell_ann.sample_annotators["Sanger-Cells"]
🧬 SampleAnnotator
==================
📋 Sample: Sanger-Cells
🔢 Clusters: 8
🔬 Cells: 1,753

🧬 Markers: ✅ Computed
🏷️  Annotation: ✅ Complete

Within each cell_annotator.SampleAnnotator, we can inspect annotation results:

cell_ann.sample_annotators["Sanger-Cells"].annotation_df
n_cells marker_genes reason_for_failure marker_gene_description cell_type cell_state annotation_confidence reason_for_confidence_estimate cell_type_harmonized
1 51 AIF1, S100A4, LYZ, TYROBP, FCER1G, S100A9, LAPTM5 None The marker genes AIF1, S100A4, LYZ, TYROBP, FC... Infiltrating Cardiac Macrophages Normal High Markers are canonical for infiltrating/activat... Infiltrating Cardiac Macrophages
2 1191 VWF, EGFL7, SLC9A3R2, CLDN5, F8, PECAM1, EMCN None The cluster 2 markers include VWF, PECAM1, and... Vascular Endothelial Cells Normal High All cluster markers are highly specific and we... Vascular Endothelial Cells
4 35 CCL5, PTPRC, CORO1A, HCST, NKG7, CD69, CCL4 None Cluster 4 expresses CCL5 and CCL4 (chemokines ... Natural Killer (NK) cell Activated High Multiple canonical NK cell markers are present... Natural Killer (NK) cell
5 3 KRT18, ITLN1, KRT19, HP, SLPI, PRG4, RARRES2 None Cluster 5 expresses KRT18 and KRT19, which are... Mesothelial cells of the pericardium Normal High Cluster 5 expresses multiple markers (KRT18, K... Mesothelial cells of the pericardium
6 262 NDUFA4L2, RGS5, AGT, CPE, ACTA2, COX4I2, TPM2 None Markers like RGS5 and ACTA2 are classic pericy... Pericyte Normal High Presence of RGS5 and ACTA2 together is a hallm... Pericytes
8 6 PLP1, CHL1, LGI4, CRYAB, TMEM176B, NRXN1, GPM6B None PLP1 is a canonical marker for Schwann cells a... Schwann cells and Cardiac Glial-like cells Normal High Multiple canonical markers for Schwann and gli... Schwann cells and Cardiac Glial-like cells
9 91 DCN, C1S, SERPINF1, C7, FBLN1, CFD, LUM None Markers DCN and LUM are canonical markers for ... Cardiac Fibroblasts Normal High Strong expression of canonical cardiac fibrobl... Cardiac Fibroblasts
10 114 TAGLN, TPM2, MYH11, ACTA2, SOD3, CRYAB, IGFBP5 None Markers TAGLN, TPM2, MYH11, and ACTA2 are cano... Vascular Smooth Muscle Cells Normal High There is strong co-expression of multiple cano... Vascular Smooth Muscle Cells

Further, annotations have been automatically harmonized across samples (cell_type_harmonized) and written to the underling anndata.AnnData object, by default to cell_type_predicted.

Evaluate results#

First, let’s get clusters into a consistent ordering across annotations. This will make it easier to interpret the results in a confusion matrix.

obr = ObsBeautifier(adata, model="o4-mini")  # reasoning models work best here

# bring categories into a more meaningful order
obr.reorder_categories(keys=["cell_type", "cell_type_predicted"])

# assign colors to the categories
obr.assign_colors(keys=["cell_type", "cell_type_predicted"])
INFO     ✅ OPENAI API key is available                                                                            
INFO     Querying label ordering.                                                                                  
INFO     Reordering categories for key 'cell_type'                                                                 
INFO     Reordering categories for key 'cell_type_predicted'                                                       
INFO     Querying cluster colors.                                                                                  
INFO     Assigning colors for key 'cell_type'                                                                      
INFO     Assigning colors for key 'cell_type_predicted'

Compute ground-truth with predicted cell type labels in a confusion matrix (across all samples).

df = adata.obs.groupby(["cell_type", "cell_type_predicted"], observed=True).size().unstack()

# Plot the heatmap
plt.figure(figsize=(8, 8))
sns.heatmap(df, annot=False, cmap="Blues", xticklabels=True, yticklabels=True)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
../../_images/b88506d4b86e67a70e19580f6f79a141b502fac412cb7f447e9b845dec96193e.png

Note that your results might look slightly different, but we have found the the model behaves mostly robust and many variations are synonyms or closely related cell types. We can also compare ground-truth and predicted cell types in the UMAP embedding.

sc.pl.embedding(adata, basis="umap", color=["cell_type", "cell_type_predicted"], wspace=0.4, ncols=5)
../../_images/749be02a3efba94e056f939155cd818f63e99e5bca735c374bdf0bd3e9f0c665.png

Colors have been assigned automatically and should ideally be comparable across the two annotations.