cell_annotator.CellAnnotator#

class cell_annotator.CellAnnotator(adata, species, tissue, stage='adult', cluster_key='leiden', sample_key=None, model=None, max_completion_tokens=None, provider=None, api_key=None)#

Main class for annotating cell types across multiple samples.

Orchestrates the annotation workflow by creating SampleAnnotator instances for each sample, coordinating marker gene computation, cell type annotation, and harmonizing results across samples. Supports any LLM provider backend.

Parameters:
  • adata (AnnData) – AnnData object containing single-cell data.

  • sample_key (str | None (default: None)) – Key in obs indicating sample/batch membership. If None, treats the entire dataset as a single sample.

  • species (str) – Species name (e.g., ‘homo sapiens’, ‘mus musculus’).

  • tissue (str) – Tissue name (e.g., ‘brain’, ‘heart’, ‘lung’).

  • stage (str (default: 'adult')) – Developmental stage (e.g., ‘adult’, ‘embryonic’, ‘fetal’).

  • cluster_key (str (default: 'leiden')) – Key of the cluster column in adata.obs.

  • model (str | None (default: None)) – Model name. If None, uses the default model for the selected or auto-detected provider. Examples: ‘gpt-4o-mini’, ‘gemini-2.5-flash-lite’, ‘claude-haiku-4-5’.

  • max_completion_tokens (int | None (default: None)) – Maximum number of tokens the model is allowed to use for completion.

  • provider (str | None (default: None)) – LLM provider name. If None, auto-detects from model name or uses the first available provider with a valid API key. See PackageConstants.supported_providers for the list of supported providers.

  • api_key (str | None (default: None)) – Optional API key for the selected provider. If None, uses environment variables. Useful for programmatically providing API keys or using different keys per instance.

Attributes table#

api_keys

Access to API key manager.

Methods table#

annotate_clusters([min_markers, ...])

Annotate clusters based on marker genes.

check_api_access([provider, model])

Check API access and log warnings if needed.

get_cluster_markers([method, ...])

Get marker genes per cluster

get_expected_cell_type_markers([n_markers, ...])

Get expected cell types and marker genes.

list_available_models()

List available models for the current provider.

query_llm(instruction, response_format[, ...])

Query the LLM with a given instruction.

test_query([return_details])

Test if the LLM setup is working correctly.

Attributes#

CellAnnotator.api_keys#

Access to API key manager.

Methods#

CellAnnotator.annotate_clusters(min_markers=2, restrict_to_expected=False, key_added='cell_type_predicted')#

Annotate clusters based on marker genes.

Parameters:
  • min_markers (int (default: 2)) – Minimal number of required marker genes per cluster.

  • key_added (str (default: 'cell_type_predicted')) – Name of the key in .obs where updated annotations will be written.

  • restrict_to_expected (bool (default: False)) – If True, only use expected cell types for annotation.

Returns:

Updates the following attributes: - self.annotation_df - self.adata.obs[key_added] - self.annotated

CellAnnotator.check_api_access(provider=None, model=None)#

Check API access and log warnings if needed.

Return type:

bool

Parameters:
  • provider (str | None)

  • model (str | None)

CellAnnotator.get_cluster_markers(method='wilcoxon', min_specificity=0.75, min_auc=0.7, max_markers=7, use_raw=False, use_rapids=False)#

Get marker genes per cluster

Parameters:
  • method (Optional[Literal['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']] (default: 'wilcoxon')) – Method for sc.tl.rank_genes_groups.

  • min_specificity (float (default: 0.75)) – Minimum specificity threshold for marker genes.

  • min_auc (float (default: 0.7)) – Minimum AUC threshold for marker genes.

  • max_markers (int (default: 7)) – Maximum number of marker genes per cluster.

  • use_raw (bool (default: False)) – Whether to use raw data for calculations.

  • use_rapids (bool (default: False)) – Whether to use RAPIDS for GPU acceleration.

Return type:

None

Returns:

Updates the following attributes: - self.marker_dfs - self.marker_genes

CellAnnotator.get_expected_cell_type_markers(n_markers=5, filter_to_var_names=True, provide_var_names=True)#

Get expected cell types and marker genes.

Parameters:
  • n_markers (int (default: 5)) – Number of marker genes per cell type.

  • filter_to_var_names (bool (default: True)) – Whether to filter marker genes to only include those present in adata.var_names

  • provide_var_names (bool (default: True)) – If True, include the available gene names in the prompt and instruct the model to restrict itself to this set.

Return type:

None

Returns:

Updates the following attributes: - self.expected_cell_types - self.expected_marker_genes

CellAnnotator.list_available_models()#

List available models for the current provider.

Return type:

list[str]

Returns:

list[str] List of available model names.

CellAnnotator.query_llm(instruction, response_format, other_messages=None)#

Query the LLM with a given instruction.

Parameters:
  • instruction (str) – Instruction to provide to the model.

  • response_format (type[BaseOutput]) – Response format class.

  • other_messages (list | None (default: None)) – Additional messages to provide to the model.

Return type:

BaseOutput

Returns:

Parsed response.

CellAnnotator.test_query(return_details=False)#

Test if the LLM setup is working correctly.

Performs a simple query to verify that the API key is valid and the model can be accessed successfully.

Parameters:

return_details (bool (default: False)) – If True, returns (success, message) tuple with detailed information. If False, returns only boolean success status.

Return type:

bool | tuple[bool, str]

Returns:

If return_details=False: True if the test query succeeds, False otherwise. If return_details=True: Tuple of (success, message) with detailed status.