Baseline metrics#

The baseline module contains several metrics that can help you to assess the quality of your segmentation. This includes metrics to quantify the number of cells in your data, the number of transcripts/genes per cell, the percentage of unassigned transcripts, the transcript density, and a variety of morphological features. The methods all return their corresponding values or dataframes, and also write them into the spatialdata object.

To follow along with this tutorial, you can download the data from here.

[1]:

%load_ext autoreload
%autoreload 2

import math

import matplotlib.pyplot as plt
import spatialdata as sd

import segtraq

sdata = sd.read_zarr("../../data/proseg2.zarr")

# putting the spatialdata object into a SegTraQ constructor
# this has the advantage that we only need to set keywords like cell IDs or transcript IDs once
st = segtraq.SegTraQ(
    sdata,
    images_key=None,
    tables_area_volume_key="volume",
    points_background_id=0,
    tables_x_key="centroid_x",
    tables_y_key="centroid_y",
)

/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/xarray_schema/__init__.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import DistributionNotFound, get_distribution
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:532: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left = partial(_left_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:533: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left_exclusive = partial(_left_exclusive_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:534: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  inner = partial(_inner_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:535: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right = partial(_right_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:536: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right_exclusive = partial(_right_exclusive_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/anndata/utils.py:434: FutureWarning: Importing read_text from `anndata` is deprecated. Import anndata.io.read_text instead.
  warnings.warn(msg, FutureWarning)
version mismatch: detected: RasterFormatV02, requested: FormatV04
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/zarr/creation.py:610: UserWarning: ignoring keyword argument 'read_only'
  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/src/segtraq/SegTraQ.py:128: UserWarning: Missing 4 cell IDs in shapes: [np.int64(11273), np.int64(10307), np.int64(10308), np.int64(6870)]... These cells are present in tables, but not in shapes. This might lead to inconsistencies in the spatialdata object.
  validate_spatialdata(

Let’s start by having a look at the spatialdata dataset.

To get a first impression of the quality of the data, we can check how many cells there are in the data. We can also check how many transcripts were measured in total, how many genes they map to, and how many transcripts were not assigned to a cell.

[2]:

st.bl.num_cells()

[2]:

[3]:

st.bl.num_transcripts()

[3]:

[4]:

st.bl.num_genes()

[4]:

[5]:

st.bl.perc_unassigned_transcripts()

[5]:

np.float64(2.0661238167303786)

Next, let’s see how many transcripts were detected per cell. We can do this using the transcripts_per_cell() method.

[6]:

transcripts_per_cell = st.bl.transcripts_per_cell()
# removing the background
transcripts_per_cell = transcripts_per_cell[transcripts_per_cell["cell_id"] != 0]
transcripts_per_cell.head()

[6]:

	cell_id	transcript_count
1	3446	2345
2	1097	1777
3	2722	1458
4	2006	1407
5	2100	1314

[7]:

# plotting the number of transcripts per cell
plt.figure(figsize=(10, 6))
plt.hist(transcripts_per_cell["transcript_count"], bins=100)

# adding a line for the median
plt.axvline(
    transcripts_per_cell["transcript_count"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    transcripts_per_cell["transcript_count"].median() + 5,
    50,
    f"Median: {transcripts_per_cell['transcript_count'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Number of Transcripts")
plt.ylabel("Count")
plt.title("Distribution of Transcripts per Cell")
plt.show()

We can also do the same thing for the number of genes per cell, since there are often multiple transcripts measured per gene.

[8]:

genes_per_cell = st.bl.genes_per_cell()
# removing the background
genes_per_cell = genes_per_cell[genes_per_cell["cell_id"] != 0]
genes_per_cell.head()

[8]:

	cell_id	gene_count
1	1	389
2	2	45
3	3	5
4	4	120
5	5	41

[9]:

# plotting the number of genes per cell
plt.figure(figsize=(10, 6))
plt.hist(genes_per_cell["gene_count"], bins=100)

# adding a line for the median
plt.axvline(
    genes_per_cell["gene_count"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    genes_per_cell["gene_count"].median() + 5,
    50,
    f"Median: {genes_per_cell['gene_count'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Number of Genes")
plt.ylabel("Count")
plt.title("Distribution of Genes per Cell")
plt.show()

Next to the number of transcripts per cell, we can also investigate the transcript density, which is computed as the number of transcripts divided by the cell area. Note that the background does not appear in this data frame.

[10]:

transcript_density = st.bl.transcript_density()
transcript_density.head()

[10]:

	cell_id	transcript_density
1	3446	0.745935
2	1097	1.361132
3	2722	1.176612
4	2006	1.169772
5	2100	0.935158

[11]:

# plotting the transcript density per cell
plt.figure(figsize=(10, 6))
plt.hist(transcript_density["transcript_density"], bins=100)

# adding a line for the median
plt.axvline(
    transcript_density["transcript_density"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    transcript_density["transcript_density"].median() + 0.05,
    50,
    f"Median: {transcript_density['transcript_density'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Transcript Density (transcripts per area)")
plt.ylabel("Count")
plt.title("Distribution of Transcript Density per Cell")
plt.show()

Finally, let’s look at some morphological features, such as the cell area, circularity, elongation, etc. We can get those with the function morphological_features(). If you only want to compute certain features, you can select them with the features_to_compute argument. This can drastically reduce the runtime, as especially the features elongation and eccentricity can take a while to compute.

[12]:

morphological_features = st.bl.morphological_features()
morphological_features.head()

[12]:

	cell_id	cell_area	perimeter	circularity	bbox_width	bbox_height	extent	solidity	convexity	elongation	eccentricity	compactness	num_polygons
0	1	223.75	77.291274	0.470665	13.0	28.0	0.614698	0.874023	0.867660	2.153846	0.885685	26.699178	3
1	2	48.50	28.825554	0.733494	9.0	9.0	0.598765	0.906542	0.955507	1.090909	0.399653	17.132217	1
2	3	35.00	27.756656	0.570878	9.0	7.0	0.555556	0.853659	0.914635	1.368421	0.682625	22.012341	1
3	4	75.25	42.007644	0.535870	12.0	11.0	0.570076	0.817935	0.865989	1.016854	0.181313	23.450394	3
4	5	92.50	44.342251	0.591175	14.0	11.0	0.600649	0.864486	0.863819	1.119565	0.449652	21.256597	1

Let’s plot all of the distributions in one plot using subplots.

[13]:

# exclude 'cell_id' from the features to plot
features = [f for f in morphological_features.columns if f != "cell_id"]

# define grid layout
num_features = len(features)
cols = 6
rows = math.ceil(num_features / cols)

# create subplots
fig, axes = plt.subplots(rows, cols, figsize=(cols * 4, rows * 3))
axes = axes.flatten()

for i, feature in enumerate(features):
    ax = axes[i]
    data = morphological_features[feature]

    ax.hist(data, bins=100)
    median_val = data.median()
    ax.axvline(median_val, color="black", linestyle="dashed", linewidth=1)
    ax.text(
        median_val + 0.05,
        ax.get_ylim()[1] * 0.9,
        f"Median: {median_val:.2f}",
        color="black",
        fontsize=8,
    )

    ax.set_title(feature, fontsize=10)
    ax.set_xlabel("Value", fontsize=8)
    ax.set_ylabel("Count", fontsize=8)

# hide unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()

Instead of measures per cell, we can also compute some metrics per gene. For example, we can see the percentage of how often a gene was not assigned to any cell. This can help to detect potential biases in our segmentation.

[14]:

perc_unassigned_transcripts_per_gene = st.bl.perc_unassigned_transcripts_per_gene()
perc_unassigned_transcripts_per_gene.sort_values(by="perc_unassigned", ascending=False).head()

[14]:

	total	unassigned	perc_unassigned
feature_name
HPV16-E7	1	1	100.000000
CIDEC	169	105	62.130178
GHR	118	69	58.474576
PLIN1	240	137	57.083333
SLC1A1	22	12	54.545455

In our example, most transcripts were assigned to a cell. However, if you detected that a large number of transcripts were unassigned, you could follow up with a gene set enrichment analysis (GSEA) to look for specific biases.

Finally, we can check the anndata object to verify that all of our metrics are stored in there.

[15]:

sdata.tables["table"]

[15]:

AnnData object with n_obs × n_vars = 17885 × 5095
    obs: 'cell_id', 'centroid_x', 'centroid_y', 'centroid_z', 'volume', 'label_id', 'region', 'gene_count', 'transcript_count', 'cell_area', 'perimeter', 'circularity', 'bbox_width', 'bbox_height', 'extent', 'solidity', 'convexity', 'elongation', 'eccentricity', 'compactness', 'num_polygons'
    var: 'total', 'unassigned', 'perc_unassigned'
    uns: 'spatialdata_attrs', 'num_cells', 'num_transcripts', 'num_genes', 'perc_unassigned_transcripts'
    layers: 'X_estimated'

Alternatively, all bl metrics can be computed in one run via run_baseline.

[16]:

st.run_baseline()
sdata.tables["table"].obs.head()

[16]:

	cell_id	centroid_x	centroid_y	centroid_z	volume	label_id	region	gene_count	cell_area	perimeter	...	bbox_height	extent	solidity	convexity	elongation	eccentricity	compactness	num_polygons	transcript_count	transcript_density
0	1	154.66340	1053.59360	-0.115346	1131.67400	1	cell_labels	389.0	223.75	77.291274	...	28.0	0.614698	0.874023	0.867660	2.153846	0.885685	26.699178	3.0	535.0	0.472751
1	2	348.67285	287.59875	0.524985	256.04938	2	cell_labels	45.0	48.50	28.825554	...	9.0	0.598765	0.906542	0.955507	1.090909	0.399653	17.132217	1.0	48.0	0.187464
2	3	720.20215	1218.87230	0.660745	148.57190	3	cell_labels	5.0	35.00	27.756656	...	7.0	0.555556	0.853659	0.914635	1.368421	0.682625	22.012341	1.0	6.0	0.040384
3	4	877.44590	1707.41210	-0.980020	233.92108	4	cell_labels	120.0	75.25	42.007644	...	11.0	0.570076	0.817935	0.865989	1.016854	0.181313	23.450394	3.0	150.0	0.641242
4	5	1253.77540	1426.49510	0.859675	327.17487	5	cell_labels	41.0	92.50	44.342251	...	11.0	0.600649	0.864486	0.863819	1.119565	0.449652	21.256597	1.0	47.0	0.143654

5 rows × 22 columns

[17]:

sdata.tables["table"].var.head()

[17]:

	total	unassigned	perc_unassigned
gene_symbol
EEF1G	29662	568	1.914908
XBP1	24838	42	0.169096
TIMP3	18879	74	0.391970
RAB11FIP1	15609	24	0.153757
H3F3B	13492	368	2.727542