Baseline metrics

Baseline metrics#

The baseline module contains several metrics that can help you to assess the quality of your segmentation. This includes metrics to quantify the number of cells in your data, the number of transcripts/genes per cell, the percentage of unassigned transcripts, the transcript density, and a variety of morphological features

[1]:
%load_ext autoreload
%autoreload 2

import math

import matplotlib.pyplot as plt
import spatialdata as sd

import segtraq as st

# this dataset was downloaded from https://spatialdata.scverse.org/en/latest/tutorials/notebooks/datasets/README.html#
# it is the third dataset in the list
sdata = sd.read_zarr("/g/huber/projects/CODEX/segtraq/data/spatialdata_xenium_breast_cancer_rep1.zarr")
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/xarray_schema/__init__.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import DistributionNotFound, get_distribution
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:532: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left = partial(_left_join_spatialelement_table)
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:533: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left_exclusive = partial(_left_exclusive_join_spatialelement_table)
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:534: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  inner = partial(_inner_join_spatialelement_table)
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:535: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right = partial(_right_join_spatialelement_table)
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:536: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right_exclusive = partial(_right_exclusive_join_spatialelement_table)
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/zarr/creation.py:610: UserWarning: ignoring keyword argument 'read_only'
  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)
/home/meyerben/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

Let’s start by having a look at the spatialdata dataset.

[2]:
sdata
[2]:
SpatialData object, with associated Zarr store: /g/huber/projects/CODEX/segtraq/data/spatialdata_xenium_breast_cancer_rep1.zarr
├── Images
│     ├── 'morphology_focus': DataTree[cyx] (1, 25778, 35416), (1, 12889, 17708), (1, 6444, 8854), (1, 3222, 4427), (1, 1611, 2213)
│     └── 'morphology_mip': DataTree[cyx] (1, 25778, 35416), (1, 12889, 17708), (1, 6444, 8854), (1, 3222, 4427), (1, 1611, 2213)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 8) (3D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (167780, 1) (2D shapes)
│     └── 'cell_circles': GeoDataFrame shape: (167780, 2) (2D shapes)
└── Tables
      └── 'table': AnnData (167780, 313)
with coordinate systems:
    ▸ 'global', with elements:
        morphology_focus (Images), morphology_mip (Images), transcripts (Points), cell_boundaries (Shapes), cell_circles (Shapes)

To get a first impression of the quality of the data, we can check how many cells there are in the data. We can also check how many transcripts were measured in total, how many genes they map to, and how many transcripts were not assigned to a cell.

[3]:
st.bl.num_cells(sdata)
[3]:
167780
[4]:
st.bl.num_transcripts(sdata)
[4]:
42638083
[5]:
st.bl.num_genes(sdata)
[5]:
541
[6]:
st.bl.perc_unassigned_transcripts(sdata)
[6]:
0.06603498098167312

Next, let’s see how many transcripts were detected per cell. We can do this using the transcripts_per_cell() method.

[7]:
transcripts_per_cell = st.bl.transcripts_per_cell(sdata)
# removing the background
transcripts_per_cell = transcripts_per_cell[transcripts_per_cell["cell_id"] != -1]
transcripts_per_cell.head()
[7]:
cell_id transcript_count
1 82952 1703
2 11728 1652
3 1359 1592
4 105998 1568
5 79034 1519
[8]:
# plotting the number of transcripts per cell
plt.figure(figsize=(10, 6))
plt.hist(transcripts_per_cell["transcript_count"], bins=100)

# adding a line for the median
plt.axvline(
    transcripts_per_cell["transcript_count"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    transcripts_per_cell["transcript_count"].median() + 5,
    50,
    f"Median: {transcripts_per_cell['transcript_count'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Number of Transcripts")
plt.ylabel("Count")
plt.title("Distribution of Transcripts per Cell")
plt.show()
../_images/notebooks_baseline_11_0.png

We can also do the same thing for the number of genes per cell, since there are often multiple transcripts measured per gene.

[9]:
genes_per_cell = st.bl.genes_per_cell(sdata)
# removing the background
genes_per_cell = genes_per_cell[genes_per_cell["cell_id"] != -1]
genes_per_cell.head()
[9]:
cell_id gene_count
1 1 21
2 2 40
3 3 10
4 4 9
5 5 35
[10]:
# plotting the number of genes per cell
plt.figure(figsize=(10, 6))
plt.hist(genes_per_cell["gene_count"], bins=100)

# adding a line for the median
plt.axvline(
    genes_per_cell["gene_count"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    genes_per_cell["gene_count"].median() + 5,
    50,
    f"Median: {genes_per_cell['gene_count'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Number of Genes")
plt.ylabel("Count")
plt.title("Distribution of Genes per Cell")
plt.show()
../_images/notebooks_baseline_14_0.png

Next to the number of transcripts per cell, we can also investigate the transcript density, which is computed as the number of transcripts divided by the cell area. Note that the background does not appear in this data frame.

[11]:
transcript_density = st.bl.transcript_density(sdata)
transcript_density.head()
[11]:
cell_id transcript_density
1 82952 1.232467
2 11728 2.516099
3 1359 2.559001
4 105998 3.072366
5 79034 3.110092
[12]:
# plotting the transcript density per cell
plt.figure(figsize=(10, 6))
plt.hist(transcript_density["transcript_density"], bins=100)

# adding a line for the median
plt.axvline(
    transcript_density["transcript_density"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    transcript_density["transcript_density"].median() + 0.05,
    50,
    f"Median: {transcript_density['transcript_density'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Transcript Density (transcripts per area)")
plt.ylabel("Count")
plt.title("Distribution of Transcript Density per Cell")
plt.show()
../_images/notebooks_baseline_17_0.png

Finally, let’s look at some morphological features, such as the cell area, circularity, elongation, etc. We can get those with the function morphological_features(). If you only want to compute certain features, you can select them with the features_to_compute argument. This can drastically reduce the runtime, as especially the features elongation and eccentricity can take a while to compute.

[13]:
morphological_features = st.bl.morphological_features(sdata)
morphological_features.head()
[13]:
cell_id cell_area perimeter circularity bbox_width bbox_height extent solidity convexity elongation eccentricity compactness
0 1 56.919458 30.298952 0.779140 10.625000 8.500000 0.630250 0.965899 0.983011 1.190205 0.542289 16.128517
1 2 188.324123 59.574355 0.666802 20.400024 14.024994 0.658222 0.899008 0.955540 1.659024 0.797919 18.845720
2 3 14.495154 17.303524 0.608364 6.375000 4.462494 0.509524 0.822026 0.970329 1.278102 0.622764 20.656002
3 4 40.347058 24.257933 0.861617 8.287476 7.437500 0.654580 0.981329 0.999023 1.050888 0.307414 14.584639
4 5 102.527238 40.730717 0.776614 10.412476 15.087494 0.652632 0.953189 0.989881 1.675849 0.802455 16.180981

Let’s plot all of the distributions in one plot using subplots.

[15]:
# exclude 'cell_id' from the features to plot
features = [f for f in morphological_features.columns if f != "cell_id"]

# define grid layout
num_features = len(features)
cols = 6
rows = math.ceil(num_features / cols)

# create subplots
fig, axes = plt.subplots(rows, cols, figsize=(cols * 4, rows * 3))
axes = axes.flatten()

for i, feature in enumerate(features):
    ax = axes[i]
    data = morphological_features[feature]

    ax.hist(data, bins=100)
    median_val = data.median()
    ax.axvline(median_val, color="black", linestyle="dashed", linewidth=1)
    ax.text(
        median_val + 0.05,
        ax.get_ylim()[1] * 0.9,
        f"Median: {median_val:.2f}",
        color="black",
        fontsize=8,
    )

    ax.set_title(feature, fontsize=10)
    ax.set_xlabel("Value", fontsize=8)
    ax.set_ylabel("Count", fontsize=8)

# hide unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()
../_images/notebooks_baseline_21_0.png