The supervised (sp) accessor#
The supervised metrics accessor provides metrics to compare the spatial single cell expression profiles to those from a single-cell RNA sequencing dataset.
- segtraq.sp.supervised.marker_purity(sdata, cell_type_key: str, markers: dict[str, dict[str, list[str]]], use_quantiles: bool = False, tables_key: str = 'table', tables_cell_id_key: str = 'cell_id', tables_centroid_x_key: str = 'x_centroid', tables_centroid_y_key: str = 'y_centroid', weight_cont: float = 0.7, require_neighbor_expression: bool = True, neighbors_key: str = 'spatial_connectivities', inplace: bool = True) DataFrame#
Compute per-cell marker purity using positive markers globally and negative markers restricted to the focal cell’s neighborhood.
- Positive markers:
For each cell’s annotated type c, use markers[c][“positive”].
Compute per-cell precision/recall/F1 relative to these markers, using either a quantile-based threshold or expression > 0.
- Negative markers (neighborhood-aware):
- For a cell i of type c:
Find the cell types present in its spatial neighborhood, using adata.obsp[neighbors_key].
Take c’s negative markers: markers[c][“negative”].
Intersect them with the union of positive markers of the neighbor cell types.
Consider only genes that are present in at least one neighbor of the source type (if require_neighbor_expression=True).
→ “relevant negatives” for this cell.
Compute per-cell precision/recall/F1 for these relevant negatives.
- Finally:
Combine positive_F1 and negative_F1 into an overall F1_purity that rewards high positive-F1 and low negative-F1. Weighting factor weight_cont controls the importance of negative-F1.
- Parameters:
sdata (SpatialData-like) – Must contain tables[tables_key] as an AnnData with expression and .obs metadata.
cell_type_key (str) – Column in the AnnData .obs with cell-type labels.
markers (dict) – A dictionary mapping cell types to their positive and negative markers, in the format {cell_type: {“positive”: list[str], “negative”: list[str]}}.
use_quantiles (bool, optional, default=False) – If True, define predictions by the top-|markers| fraction per cell (rank-based); if False, use direct expression-based criteria (expression > 0).
tables_key (str, optional, default="table") – Key of the AnnData table in sdata.tables.
tables_cell_id_key (str, optional, default="cell_id") – Column in the AnnData .obs with unique cell IDs.
tables_centroid_x_key (str or None, optional, default="x_centroid") – Column in the cell table with the x-coordinate of the cell centroid.
tables_centroid_y_key (str or None, optional, default="y_centroid") – Column in the cell table with the y-coordinate of the cell centroid.
require_neighbor_expression (bool, optional, default=True) – If True, contamination is only counted when the relevant gene is expressed in at least one neighboring cell of the source type.
weight_cont (float, optional, default=0.7) – Weighting factor for negative marker F1 in the overall F1_purity. Higher values give more weight to negative F1. Must be in the range [0, 1].
neighbors_key (str, optional, default="spatial_connectivities") – Key in adata.obsp containing a cell x cell adjacency / connectivity matrix that defines the spatial neighborhood.
inplace (bool, optional, default=True) – If True, store marker purity results in sdata.tables[tables_key].obs.
- Returns:
- Columns:
- [
‘positive_precision’, ‘positive_recall’, ‘positive_F1’, ‘negative_precision’, ‘negative_recall’, ‘negative_F1’, ‘F1_purity’]
indexed by cell ID.
- Return type:
pandas.DataFrame
- segtraq.sp.supervised.mutually_exclusive_coexpression_rate(sdata, markers: dict[str, dict[str, list[str]]], tables_key: str = 'table', pseudocount: float = 0.5, inplace: bool = True) DataFrame#
Compute mutual exclusivity between marker genes using Fisher’s exact test. Returns a DataFrame with one row per unordered gene pair.
- Parameters:
sdata (SpatialData-like) – Must contain tables[tables_key] as an AnnData with expression and .obs metadata.
markers (dict) – A dictionary mapping cell types to their positive and negative markers, in the format {cell_type: {“positive”: list[str], “negative”: list[str]}}.
tables_key (str, optional, default="table") – Key of the AnnData table in sdata.tables.
pseudocount (float, optional, default=0.5) – Pseudocount added to all cells of the contingency table to avoid division by zero when computing odds ratios. This is equivalent to the Haldane-Anscombe correction, see https://pmc.ncbi.nlm.nih.gov/articles/PMC7398076.
inplace (bool, optional, default=True) – If True, store the resulting DataFrame in sdata.tables[tables_key].uns[“MECR”].
- Returns:
- Columns:
[‘gene1’, ‘gene2’, ‘odds_ratio’, ‘pvalue’, ‘a’, ‘b’, ‘c’, ‘d’]
where (a, b, c, d) are the counts in the contingency table.
- Return type:
pd.DataFrame
- segtraq.sp.supervised.neighbor_contamination(sdata, cell_type_key: str, markers: dict[str, dict[str, list[str]]], tables_key: str = 'table', tables_cell_id_key: str = 'cell_id', tables_centroid_x_key: str = 'x_centroid', tables_centroid_y_key: str = 'y_centroid', require_neighbor_expression: bool = True, neighbors_key: str = 'spatial_connectivities', uns_key: str = 'negative_marker_contamination', uns_key_binary: str = 'negative_marker_contamination_binary', inplace: bool = True)#
Compute per-cell negative-marker contamination and directed c_src→c_tgt contamination summaries.
- Per-cell outputs (written to .obs):
- negative_marker_contamination_counts:
Total transcripts in the focal cell that belong to genes that are (i) negative markers of the focal cell type and (ii) positive markers of at least one neighboring cell type.
- negative_marker_contamination_fraction:
For each such gene g, compute x_i(g) / (x_i(g) + mean_neighbor(g)), averaged across genes (neighbors pooled across all neighbor types).
- Type x type outputs (written to .uns):
- uns_key:
A directed matrix with entries = mean over genes of x_i(g) / (x_i(g) + mean_src(g)), aggregated across all contributing (cell, gene) observations for (c_src, c_tgt).
- uns_key_binary:
A directed matrix with entries = (# target cells of type c_tgt contaminated by c_src) / (# target cells of type c_tgt). Here “contaminated by c_src” is binary per target cell: at least one relevant gene (negative in target, positive in source) is detected in the target cell and detected in at least one neighbor of type c_src.
- Parameters:
sdata (SpatialData-like) – Must contain tables[tables_key] as an AnnData with expression and .obs metadata.
cell_type_key (str) – Column in the AnnData .obs with cell-type labels.
markers (dict) – {cell_type: {“positive”: list[str], “negative”: list[str]}}.
tables_key (str, optional, default="table") – Key of the AnnData table in sdata.tables.
tables_cell_id_key (str, optional, default="cell_id") – Column in the AnnData .obs with unique cell IDs.
tables_centroid_x_key (str or None, optional, default="x_centroid") – Column in the cell table with the x-coordinate of the cell centroid.
tables_centroid_y_key (str or None, optional, default="y_centroid") – Column in the cell table with the y-coordinate of the cell centroid.
require_neighbor_expression (bool, optional, default=True) – If True, contamination is only counted when the relevant gene is expressed in at least one neighboring cell of the source type.
neighbors_key (str, optional, default="spatial_connectivities") – Key in adata.obsp containing a cell x cell adjacency / connectivity matrix that defines the spatial neighborhood.
uns_key (str, optional, default="negative_marker_contamination") – Key in .uns under which the directed source → target mean contamination fraction matrix is stored.
uns_key_binary (str, optional, default="negative_marker_contamination_binary") – Key in .uns under which the directed source → target binary contamination proportion matrix is stored.
inplace (bool, optional, default=True) – If True, store per-cell contamination metrics in sdata.tables[tables_key].obs and type-level matrices in .uns.
- Returns:
per_cell_df (pd.DataFrame) – Per-cell contamination metrics, indexed by cell ID.
contamination_matrix_df (pd.DataFrame) – Directed type x type mean contamination fraction matrix (c_src rows, c_tgt columns). This matrix report the average strength of contamination from one cell type into another.
contamination_binary_df (pd.DataFrame) – Directed type x type binary contamination proportion matrix (c_src rows, c_tgt columns). This matrix reports what percentage of cells of a target type is contaminated by each source type.