Module: Baseline Metrics

Contents

Module: Baseline Metrics#

When assessing the quality of a segmentation, the first thing one usually looks at are summary metrics such as the number of segmented cells, number of transcripts/genes per cell, the percentage of unassigned transcripts, the transcript density, and a variety of morphological features.

a2664e3787fa4f6e92d64d42be3c79dc

The baseline (bl) module contains several metrics that can help you to assess the quality of your segmentation. The methods all return their corresponding values or dataframes, and also write them into the spatialdata object.

To follow along with this tutorial, you can download the data from here.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import math

import matplotlib.pyplot as plt
import spatialdata as sd

import segtraq

sdata = sd.read_zarr("../../data/xenium_5K_data/proseg2.zarr")

# putting the spatialdata object into a SegTraQ constructor
# this has the advantage that we only need to set keywords like cell IDs or transcript IDs once
st = segtraq.SegTraQ(
    sdata,
    images_key=None,
    tables_area_key=None,
    points_background_id=0,
    tables_centroid_x_key="centroid_x",
    tables_centroid_y_key="centroid_y",
)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:531: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left = partial(_left_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:532: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left_exclusive = partial(_left_exclusive_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:533: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  inner = partial(_inner_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:534: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right = partial(_right_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:535: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right_exclusive = partial(_right_exclusive_join_spatialelement_table)
no parent found for <ome_zarr.reader.Label object at 0x7f17072adfd0>: None
no parent found for <ome_zarr.reader.Label object at 0x7f17072c6350>: None
no parent found for <ome_zarr.reader.Label object at 0x7f17072c6710>: None
no parent found for <ome_zarr.reader.Label object at 0x7f17072b3230>: None
no parent found for <ome_zarr.reader.Label object at 0x7f17075536f0>: None
no parent found for <ome_zarr.reader.Label object at 0x7f17072ccef0>: None
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/src/segtraq/SegTraQ.py:130: UserWarning: Missing 4 cell IDs in shapes: [np.int64(11273), np.int64(10307), np.int64(10308), np.int64(6870)]... These cells are present in tables, but not in shapes. This might lead to inconsistencies in the spatialdata object.
  validate_spatialdata(
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/src/segtraq/utils.py:1255: UserWarning: Duplicate IDs detected in index 'cell_id' for shapes 'nucleus_boundaries'. Resetting and renaming index to `segtraq_id` to ensure uniqueness.
  nucleus_shapes = _ensure_index(
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/src/segtraq/SegTraQ.py:130: RuntimeWarning: No area column specified for tables. Area will be automatically computed from shapes.
  validate_spatialdata(

To get a first impression of the quality of the data, we can check how many cells there are in the data. We can also check how many transcripts were measured in total, how many genes they map to, and how many transcripts were not assigned to a cell.

[3]:
st.bl.num_cells()
[3]:
17885
[4]:
st.bl.num_transcripts()
[4]:
2189801
[5]:
st.bl.num_genes()
[5]:
5095
[6]:
st.bl.perc_unassigned_transcripts()
[6]:
2.0661238167303786

Next, let’s see how many transcripts were detected per cell. We can do this using the transcripts_per_cell() method.

[7]:
transcripts_per_cell = st.bl.transcripts_per_cell()
transcripts_per_cell.head()
[7]:
cell_id transcript_count
0 3446 2345
1 1097 1777
2 2722 1458
3 2006 1407
4 2100 1314

We can plot the median and distribution of this to see how the number of transcripts differs across cells.

[8]:
# plotting the number of transcripts per cell
plt.figure(figsize=(10, 6))
plt.hist(transcripts_per_cell["transcript_count"], bins=100)

# adding a line for the median
plt.axvline(
    transcripts_per_cell["transcript_count"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    transcripts_per_cell["transcript_count"].median() + 5,
    50,
    f"Median: {transcripts_per_cell['transcript_count'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Number of Transcripts")
plt.ylabel("Count")
plt.title("Distribution of Transcripts per Cell")
plt.show()
../_images/notebooks_baseline_11_0.png

We can also do the same thing for the number of genes per cell, since there are often multiple transcripts measured per gene.

[9]:
genes_per_cell = st.bl.genes_per_cell()
genes_per_cell.head()
[9]:
cell_id gene_count
0 1 389
1 2 45
2 3 5
3 4 120
4 5 41
[10]:
# plotting the number of genes per cell
plt.figure(figsize=(10, 6))
plt.hist(genes_per_cell["gene_count"], bins=100)

# adding a line for the median
plt.axvline(
    genes_per_cell["gene_count"].median(),
    color="black",
    linestyle="dashed",
    linewidth=1,
)
plt.text(
    genes_per_cell["gene_count"].median() + 5,
    50,
    f"Median: {genes_per_cell['gene_count'].median():.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Number of Genes")
plt.ylabel("Count")
plt.title("Distribution of Genes per Cell")
plt.show()
../_images/notebooks_baseline_14_0.png

Next to the number of transcripts per cell, we can also investigate the transcript density, which is computed as the number of transcripts divided by the cell area. Note that the background does not appear in this data frame.

[11]:
transcript_density = st.bl.transcript_density()
transcript_density.head()
[11]:
cell_id transcript_density
0 3446 3.674109
1 1097 4.498734
2 2722 5.559581
3 2006 4.456057
4 2100 3.913626
[12]:
x = transcript_density["transcript_density"].dropna()

p99 = x.quantile(0.99)
x_clip = x[x <= p99]

plt.figure(figsize=(10, 6))
plt.hist(x_clip, bins=100)

# median from full distribution
med = x.median()

plt.axvline(med, color="black", linestyle="dashed", linewidth=1)
plt.text(
    med + 0.05,
    plt.ylim()[1] * 0.9,
    f"Median: {med:.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Transcript Density (transcripts per area)")
plt.ylabel("Count")
plt.title("Distribution of Transcript Density per Cell")
plt.show()
../_images/notebooks_baseline_17_0.png

We can also compute the mean number of transcripts per detected gene per cell, which is computed by averaging per-gene transcript counts across genes observed in each cell. Note that only detected genes are considered and background transcripts are excluded.

[13]:
mean_transcripts_per_gene_per_cell = st.bl.mean_transcripts_per_gene_per_cell()
mean_transcripts_per_gene_per_cell.head()
[13]:
cell_id mean_transcripts_per_gene
0 1 1.375321
1 2 1.066667
2 3 1.200000
3 4 1.250000
4 5 1.146341
[14]:
x = mean_transcripts_per_gene_per_cell["mean_transcripts_per_gene"].dropna()

p99 = x.quantile(0.99)
x_clip = x[x <= p99]

plt.figure(figsize=(10, 6))
plt.hist(x_clip, bins=100)

# median from full distribution
med = x.median()

plt.axvline(med, color="black", linestyle="dashed", linewidth=1)
plt.text(
    med + 0.05,
    plt.ylim()[1] * 0.9,
    f"Median: {med:.2f}",
    color="black",
)

# adding labels and title
plt.xlabel("Transcript Density (transcripts per area)")
plt.ylabel("Count")
plt.title("Distribution of Transcript Density per Cell")
plt.show()
../_images/notebooks_baseline_20_0.png

Finally, let’s look at some morphological features, such as the cell area, circularity, elongation, etc. We can get those with the function morphological_features(). If you only want to compute certain features, you can select them with the features_to_compute argument. This can drastically reduce the runtime, as especially the features elongation and eccentricity can take a while to compute.

[15]:
morphological_features = st.bl.morphological_features()
morphological_features.head()
[15]:
cell_id cell_area perimeter circularity bbox_width bbox_height extent solidity convexity elongation eccentricity compactness num_polygons
0 1 223.75 77.291274 0.470665 13.0 28.0 0.614698 0.874023 0.867660 2.153846 0.885685 26.699178 3
1 2 48.50 28.825554 0.733494 9.0 9.0 0.598765 0.906542 0.955507 1.090909 0.399653 17.132217 1
2 3 35.00 27.756656 0.570878 9.0 7.0 0.555556 0.853659 0.914635 1.368421 0.682625 22.012341 1
3 4 75.25 42.007644 0.535870 12.0 11.0 0.570076 0.817935 0.865989 1.016854 0.181313 23.450394 3
4 5 92.50 44.342251 0.591175 14.0 11.0 0.600649 0.864486 0.863819 1.119565 0.449652 21.256597 1

Let’s plot all of the distributions in one plot using subplots.

[16]:
# exclude 'cell_id' from the features to plot
features = [f for f in morphological_features.columns if f != "cell_id"]

# define grid layout
num_features = len(features)
cols = 6
rows = math.ceil(num_features / cols)

# create subplots
fig, axes = plt.subplots(rows, cols, figsize=(cols * 4, rows * 3))
axes = axes.flatten()

for i, feature in enumerate(features):
    ax = axes[i]
    data = morphological_features[feature]

    ax.hist(data, bins=100)
    median_val = data.median()
    ax.axvline(median_val, color="black", linestyle="dashed", linewidth=1)
    ax.text(
        median_val + 0.05,
        ax.get_ylim()[1] * 0.9,
        f"Median: {median_val:.2f}",
        color="black",
        fontsize=8,
    )

    ax.set_title(feature, fontsize=10)
    ax.set_xlabel("Value", fontsize=8)
    ax.set_ylabel("Count", fontsize=8)

# hide unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()
../_images/notebooks_baseline_24_0.png

Instead of measures per cell, we can also compute some metrics per gene. For example, we can see the percentage of how often a gene was not assigned to any cell. This can help to detect potential biases in our segmentation.

[17]:
perc_unassigned_transcripts_per_gene = st.bl.perc_unassigned_transcripts_per_gene()
perc_unassigned_transcripts_per_gene.sort_values(by="perc_unassigned", ascending=False).head()
[17]:
total unassigned perc_unassigned
feature_name
HPV16-E7 1 1 100.000000
CIDEC 169 105 62.130178
GHR 118 69 58.474576
PLIN1 240 137 57.083333
SLC1A1 22 12 54.545455

In our example, most transcripts were assigned to a cell. However, if you detected that a large number of transcripts were unassigned, you could follow up with a gene set enrichment analysis (GSEA) to look for specific biases.

Finally, we can check the anndata object to verify that all of our metrics are stored in there.

[18]:
sdata.tables["table"]
[18]:
AnnData object with n_obs × n_vars = 17885 × 5095
    obs: 'cell_id', 'centroid_x', 'centroid_y', 'centroid_z', 'volume', 'label_id', 'region', 'gene_count', 'transcript_count', 'transcript_density', 'mean_transcripts_per_gene', 'cell_area', 'perimeter', 'circularity', 'bbox_width', 'bbox_height', 'extent', 'solidity', 'convexity', 'elongation', 'eccentricity', 'compactness', 'num_polygons'
    var: 'total', 'unassigned', 'perc_unassigned'
    uns: 'spatialdata_attrs', 'num_cells', 'num_transcripts', 'num_genes', 'perc_unassigned_transcripts'
    layers: 'X_estimated'

Alternatively, all bl metrics can be computed in one run via run_baseline.

[19]:
st.run_baseline()
sdata.tables["table"].obs.head()
[19]:
cell_id centroid_x centroid_y centroid_z volume label_id region gene_count mean_transcripts_per_gene transcript_count ... circularity bbox_width bbox_height extent solidity convexity elongation eccentricity compactness num_polygons
0 1 154.66340 1053.59360 -0.115346 1131.67400 1 cell_boundaries 389.0 1.375321 535.0 ... 0.470665 13.0 28.0 0.614698 0.874023 0.867660 2.153846 0.885685 26.699178 3.0
1 2 348.67285 287.59875 0.524985 256.04938 2 cell_boundaries 45.0 1.066667 48.0 ... 0.733494 9.0 9.0 0.598765 0.906542 0.955507 1.090909 0.399653 17.132217 1.0
2 3 720.20215 1218.87230 0.660745 148.57190 3 cell_boundaries 5.0 1.200000 6.0 ... 0.570878 9.0 7.0 0.555556 0.853659 0.914635 1.368421 0.682625 22.012341 1.0
3 4 877.44590 1707.41210 -0.980020 233.92108 4 cell_boundaries 120.0 1.250000 150.0 ... 0.535870 12.0 11.0 0.570076 0.817935 0.865989 1.016854 0.181313 23.450394 3.0
4 5 1253.77540 1426.49510 0.859675 327.17487 5 cell_boundaries 41.0 1.146341 47.0 ... 0.591175 14.0 11.0 0.600649 0.864486 0.863819 1.119565 0.449652 21.256597 1.0

5 rows × 23 columns

[20]:
sdata.tables["table"].var.head()
[20]:
total unassigned perc_unassigned
gene_symbol
EEF1G 29662 568 1.914908
XBP1 24838 42 0.169096
TIMP3 18879 74 0.391970
RAB11FIP1 15609 24 0.153757
H3F3B 13492 368 2.727542

Session Info#

[21]:
print(sd.__version__)  # spatialdata
0.7.2