The supervised (sp) accessor#
The supervised metrics accessor provides metrics to compare the spatial single cell expression profiles to those from a single-cell RNA sequencing dataset.
- segtraq.sp.supervised.marker_purity(sdata, cell_type_key: str, markers: dict[str, dict[str, list[str]]], tables_key: str = 'table', tables_raw_counts_layer: str | None = None, tables_cell_id_key: str = 'cell_id', tables_centroid_x_key: str = 'x_centroid', tables_centroid_y_key: str = 'y_centroid', tables_gene_key: str | None = None, require_neighbor_expression: bool = True, neighbors_key: str = 'spatial_connectivities', inplace: bool = True) DataFrame#
Compute per-cell marker purity using balanced accuracy.
- For each cell of type c:
positive markers of c are expected to be expressed.
relevant negative markers are negative markers of c that are also positive markers of neighboring cell types.
- The score combines:
- positive_marker_recall: fraction of expected positive markers expressed
(analogous to recall/sensitivity at the marker-level)
- negative_marker_avoidance: fraction of relevant negative markers avoided.
(analogous to specificity at the marker-level)
marker_balanced_accuracy: mean of sensitivity and specificity.
- Parameters:
sdata (SpatialData-like) – Must contain tables[tables_key] as an AnnData with expression and .obs metadata.
cell_type_key (str) – Column in the AnnData .obs with cell-type labels.
markers (dict) – A dictionary mapping cell types to their positive and negative markers, in the format {cell_type: {“positive”: list[str], “negative”: list[str]}}.
tables_key (str, optional, default="table") – Key of the AnnData table in sdata.tables.
tables_raw_counts_layer (str | None, optional) – Layer containing count data. If None, adata.X is used if it looks like counts. If a layer is specified, it must exist and contain count-like values.
tables_cell_id_key (str, optional, default="cell_id") – Column in the AnnData .obs with unique cell IDs.
tables_centroid_x_key (str or None, optional, default="x_centroid") – Column in the cell table with the x-coordinate of the cell centroid.
tables_centroid_y_key (str or None, optional, default="y_centroid") – Column in the cell table with the y-coordinate of the cell centroid.
tables_gene_key (str or None, default=None) – Column in sdata.tables[tables_key].var containing gene identifiers. If None, sdata.tables[tables_key].var_names are used.
require_neighbor_expression (bool, optional, default=True) – If True, contamination is only counted when the relevant gene is expressed in at least one neighboring cell of the source type.
neighbors_key (str, optional, default="spatial_connectivities") – Key in adata.obsp containing a cell x cell adjacency / connectivity matrix that defines the spatial neighborhood.
inplace (bool, optional, default=True) – If True, store marker purity results in sdata.tables[tables_key].obs.
- Returns:
- Columns:
- [
‘positive_marker_recall’, ‘negative_marker_avoidance’, ‘marker_balanced_accuracy’, ‘n_evaluated_positive_markers’, ‘n_evaluated_negative_markers’
]
indexed by cell ID.
- Return type:
pandas.DataFrame
- segtraq.sp.supervised.mutually_exclusive_coexpression_rate(sdata, markers: dict[str, dict[str, list[str]]], tables_key: str = 'table', tables_gene_key: str | None = None, tables_raw_counts_layer: str | None = None, pseudocount: float = 0.5, inplace: bool = True) DataFrame#
Compute mutual exclusivity between marker genes using Fisher’s exact test. Returns a DataFrame with one row per unordered gene pair.
- Parameters:
sdata (SpatialData-like) – Must contain tables[tables_key] as an AnnData with expression and .obs metadata.
markers (dict) – A dictionary mapping cell types to their positive and negative markers, in the format {cell_type: {“positive”: list[str], “negative”: list[str]}}.
tables_key (str, optional, default="table") – Key of the AnnData table in sdata.tables.
tables_gene_key (str or None, default=None) – Column in sdata.tables[tables_key].var containing gene identifiers. If None, sdata.tables[tables_key].var_names are used.
tables_raw_counts_layer (str | None, optional) – Layer containing count data. If None, adata.X is used if it looks like counts. If a layer is specified, it must exist and contain count-like values.
pseudocount (float, optional, default=0.5) – Pseudocount added to all cells of the contingency table to avoid division by zero when computing odds ratios. This is equivalent to the Haldane-Anscombe correction, see https://pmc.ncbi.nlm.nih.gov/articles/PMC7398076.
inplace (bool, optional, default=True) – If True, store the resulting DataFrame in sdata.tables[tables_key].uns[“MECR”].
- Returns:
- Columns:
[‘gene1’, ‘gene2’, ‘odds_ratio’, ‘pvalue’, ‘a’, ‘b’, ‘c’, ‘d’]
where (a, b, c, d) are the counts in the contingency table.
- Return type:
pd.DataFrame
- segtraq.sp.supervised.neighbor_contamination(sdata, cell_type_key: str, markers: dict[str, dict[str, list[str]]], tables_key: str = 'table', tables_raw_counts_layer: str | None = None, tables_cell_id_key: str = 'cell_id', tables_centroid_x_key: str = 'x_centroid', tables_centroid_y_key: str = 'y_centroid', tables_gene_key: str | None = None, require_neighbor_expression: bool = True, neighbors_key: str = 'spatial_connectivities', uns_key: str = 'contamination_counts_matrix', uns_key_binary: str = 'contamination_fraction_matrix', inplace: bool = True)#
Compute per-cell negative-marker contamination and directed c_src→c_tgt contamination summaries.
- Per-cell outputs (written to .obs):
- contamination_counts:
Total transcripts in the focal cell that belong to genes that are (i) negative markers of the focal cell type and (ii) positive markers of at least one neighboring cell type.
- contamination_fraction:
For each such gene g, compute x_i(g) / (x_i(g) + mean_neighbor(g)), averaged across genes (neighbors pooled across all neighbor types).
- Type x type outputs (written to .uns):
- uns_key:
A directed matrix with entries = mean over genes of x_i(g) / (x_i(g) + mean_src(g)), aggregated across all contributing (cell, gene) observations for (c_src, c_tgt).
- uns_key_binary:
A directed matrix with entries = (# target cells of type c_tgt contaminated by c_src) / (# target cells of type c_tgt). Here “contaminated by c_src” is binary per target cell: at least one relevant gene (negative in target, positive in source) is detected in the target cell and detected in at least one neighbor of type c_src.
- Parameters:
sdata (SpatialData-like) – Must contain tables[tables_key] as an AnnData with expression and .obs metadata.
cell_type_key (str) – Column in the AnnData .obs with cell-type labels.
markers (dict) – {cell_type: {“positive”: list[str], “negative”: list[str]}}.
tables_key (str, optional, default="table") – Key of the AnnData table in sdata.tables.
tables_raw_counts_layer (str | None, optional) – Layer containing count data. If None, adata.X is used if it looks like counts. If a layer is specified, it must exist and contain count-like values.
tables_cell_id_key (str, optional, default="cell_id") – Column in the AnnData .obs with unique cell IDs.
tables_centroid_x_key (str or None, optional, default="x_centroid") – Column in the cell table with the x-coordinate of the cell centroid.
tables_centroid_y_key (str or None, optional, default="y_centroid") – Column in the cell table with the y-coordinate of the cell centroid.
tables_gene_key (str or None, default=None) – Column in sdata.tables[tables_key].var containing gene identifiers. If None, sdata.tables[tables_key].var_names are used.
require_neighbor_expression (bool, optional, default=True) – If True, contamination is only counted when the relevant gene is expressed in at least one neighboring cell of the source type.
neighbors_key (str, optional, default="spatial_connectivities") – Key in adata.obsp containing a cell x cell adjacency / connectivity matrix that defines the spatial neighborhood.
uns_key (str, optional, default="contamination_counts_matrix") – Key in .uns under which the directed source → target mean contamination fraction matrix is stored.
uns_key_binary (str, optional, default="contamination_fraction_matrix") – Key in .uns under which the directed source → target binary contamination proportion matrix is stored.
inplace (bool, optional, default=True) – If True, store per-cell contamination metrics in sdata.tables[tables_key].obs and type-level matrices in .uns.
- Returns:
per_cell_df (pd.DataFrame) – Per-cell contamination metrics, indexed by cell ID.
contamination_matrix_df (pd.DataFrame) – Directed type x type mean contamination fraction matrix (c_src rows, c_tgt columns). This matrix report the average strength of contamination from one cell type into another.
contamination_binary_df (pd.DataFrame) – Directed type x type binary contamination proportion matrix (c_src rows, c_tgt columns). This matrix reports what percentage of cells of a target type is contaminated by each source type.