Module: Region Similarity#

Assuming that transcripts are homogeneously distributed throughout the cell, we expect there to be similar expression of genes in the nucleus and in the rest of the cell. If this is not the case, it can be indicative of transcript spillover from adjacent cells.

dabc717dd6c54c13ae7eb9b6032a53ff

The region similarity (rs) module provides metrics to assess how similar the expression profiles are between subcellular regions (e.g., cell and nucleus).

To follow along with this tutorial, you can download the data from here.

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import spatialdata as sd
import spatialdata_plot  # noqa
from scipy.stats import linregress

import segtraq

/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:531: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left = partial(_left_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:532: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left_exclusive = partial(_left_exclusive_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:533: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  inner = partial(_inner_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:534: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right = partial(_right_join_spatialelement_table)
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/.venv/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:535: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right_exclusive = partial(_right_exclusive_join_spatialelement_table)

We start out by loading the data into a SegTraQ object.

[3]:

# loading the spatialdata object
sdata = sd.read_zarr("../../data/xenium_v1_data/sdata_proseg_v3_crop.zarr")

# putting the object into a SegTraQ object
st = segtraq.SegTraQ(
    sdata,
    points_cell_id_key="assignment",
    points_background_id=None,
    points_gene_key="gene",
    tables_area_key=None,
    tables_cell_id_key="cell",
    shapes_cell_id_key="cell",
    tables_centroid_x_key="centroid_x",
    tables_centroid_y_key="centroid_y",
)

sdata

/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/src/segtraq/SegTraQ.py:130: RuntimeWarning: No area column specified for tables. Area will be automatically computed from shapes.
  validate_spatialdata(
/g/huber/users/meyerben/notebooks/spatial_transcriptomics/SegTraQ/src/segtraq/bl/baseline.py:564: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  areas = geom.area.values

[3]:

SpatialData object, with associated Zarr store: /g/huber/projects/CODEX/segtraq/data/20260204_SegTraQ_sdata/xenium_v1_data/sdata_proseg_v3_crop.zarr
├── Images
│     └── 'morphology_focus': DataTree[cyx] (1, 2500, 2500), (1, 1250, 1250), (1, 625, 625), (1, 313, 313), (1, 156, 156)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 10) (3D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (1533, 1) (2D shapes)
│     ├── 'cell_boundaries_z0': GeoDataFrame shape: (1518, 2) (2D shapes)
│     ├── 'cell_boundaries_z1': GeoDataFrame shape: (1520, 2) (2D shapes)
│     ├── 'cell_boundaries_z2': GeoDataFrame shape: (1507, 2) (2D shapes)
│     ├── 'cell_boundaries_z3': GeoDataFrame shape: (1305, 2) (2D shapes)
│     └── 'nucleus_boundaries': GeoDataFrame shape: (1501, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (1533, 480)
with coordinate systems:
    ▸ 'global', with elements:
        morphology_focus (Images), transcripts (Points), cell_boundaries (Shapes), cell_boundaries_z0 (Shapes), cell_boundaries_z1 (Shapes), cell_boundaries_z2 (Shapes), cell_boundaries_z3 (Shapes), nucleus_boundaries (Shapes)

As you can see, the spatialdatadataset contains cell and nuclear masks as shapes. It is important that you have a nuclear segmentation in your object, otherwise you will not be able to compute the metrics below.

Intersection over Union between cell and nucleus masks#

First, we get the nucleus that overlaps most with each cell and compute the Intersection over Union (IoU) between cell and nuclear masks using the method match_nuclei_to_cells().

[4]:

results_df = st.rs.match_nuclei_to_cells()
results_df.head()

[4]:

	cell	nucleus_id	iou	nucleus_fraction
0	70066	78094.0	0.372469	0.909792
1	70067	NaN	NaN	NaN
2	70069	78081.0	0.621191	0.938538
3	70070	78084.0	0.357129	0.969828
4	70071	78083.0	0.662761	0.994992

For each cell_id, we obtain the ID (nucleus_id) of the nucleus mask with the highest IoU. If a cell does not overlap with any nucleus, the function returns a missing value for nucleus_id. In addition to the IoU, we also report the fraction of the nucleus that overlaps with the cell. If the nucleus has an invalid geometry, IoU and nucleus_fraction are reported as NA.

Let’s see what this looks like when we plot the IoU spatially.

[5]:

# link annotations with cell boundaries
sdata.tables["table"].obs["region"] = "cell_boundaries"
sdata.set_table_annotates_spatialelement("table", region="cell_boundaries")

# plot
sdata.pl.render_shapes(
    element="cell_boundaries",
    color="iou",
    cmap="viridis",
    fill_alpha=0.5,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.render_shapes(
    element="nucleus_boundaries",
    fill_alpha=0.2,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(title="Overlay of nuclei and cell masks colored by IoU", colorbar=True)

WARNING  Found 35 NaN values in color data. These observations will be colored with the 'na_color'.

../_images/notebooks_region_similarity_11_1.png

We will quickly set up some helper functions to facilitate plotting.

[6]:

# helper functions for plotting
def plot_histogram(
    df,
    column,
    bins=30,
    figsize=(6, 5),
    color="steelblue",
    edgecolor="black",
    show_median=True,
    median_kwargs=None,
    median_color="red",
    title=None,
    xlabel=None,
    ylabel="Count",
    ax=None,
):
    values = df[column].dropna()
    median_value = np.median(values)

    if ax is None:
        fig, ax = plt.subplots(figsize=figsize, constrained_layout=True)

    ax.hist(values, bins=bins, color=color, edgecolor=edgecolor)

    if show_median:
        median_kwargs = median_kwargs or {}
        ax.axvline(
            median_value,
            linestyle="--",
            linewidth=2,
            color=median_color,
            label=f"Median = {median_value:.2f}",
            **median_kwargs,
        )
        ax.legend()

    ax.set_title(title or f"Distribution of {column}")
    ax.set_xlabel(xlabel or column)
    ax.set_ylabel(ylabel)

    plt.show()


def plot_regression(
    df,
    x,
    y,
    figsize=(6, 6),
    dropna=True,
    ci=95,
    scatter_kws=None,
    line_kws=None,
    title=None,
    xlabel=None,
    ylabel=None,
    r2_loc=(0.05, 0.95),
    r2_fmt="{:.3f}",
    ax=None,
):
    data = df[[x, y]]
    if dropna:
        data = data.dropna()

    # Regression stats
    slope, intercept, r_value, p_value, std_err = linregress(data[x], data[y])
    r_squared = r_value**2

    if ax is None:
        fig, ax = plt.subplots(figsize=figsize, constrained_layout=True)

    scatter_kws = scatter_kws or {"alpha": 0.6}
    line_kws = line_kws or {"color": "red"}

    sns.regplot(
        data=data,
        x=x,
        y=y,
        ci=ci,
        scatter_kws=scatter_kws,
        line_kws=line_kws,
        ax=ax,
    )

    # R² annotation
    ax.text(
        r2_loc[0],
        r2_loc[1],
        rf"$R^2 = {r2_fmt.format(r_squared)}$",
        transform=ax.transAxes,
        verticalalignment="top",
        fontsize=12,
        bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8),
    )

    ax.set_xlabel(xlabel or x)
    ax.set_ylabel(ylabel or y)
    ax.set_title(title or f"{y} vs. {x}")
    ax.grid(True)

    plt.show()

We can now investigate what the distribution of IoUs looks like.

[7]:

plot_histogram(
    df=sdata["table"].obs,
    column="iou",
    xlabel="Intersection over Union (IoU)",
)

../_images/notebooks_region_similarity_15_0.png

Expression similarity between cell and nucleus#

Now that we have matched each cell with a nucleus, we can investigate how similar the expression profiles are between the nucleus and the whole cell (including the nucleus). For this, we use the function similarity_nucleus_cell().

[8]:

similarity_df = st.rs.similarity_nucleus_cell(metric="cosine_sim")

[9]:

plot_histogram(
    df=sdata["table"].obs,
    column="similarity_nucleus_cell",
    title="Cosine similarity between cell and nucleus expression",
    xlabel="Cosine similarity",
)

../_images/notebooks_region_similarity_19_0.png

The histogram shows the distribution of cosine similarity values across cells. It is right-skewed, with a median of 0.77. This provides intuition about possible spatial spillover: cells may be contaminated by neighboring cells, assuming that nuclei capture expression with less contamination due to their smaller radius.

[10]:

plot_regression(
    df=sdata["table"].obs,
    x="iou",
    y="similarity_nucleus_cell",
    title="Cosine sim. vs. IoU with Regression Line",
    ylabel="Cosine similarity between cell and nucleus intersection",
)

../_images/notebooks_region_similarity_21_0.png

The calculated R2 (coefficient of determination) is 0.307, indicating that similarity/correlation between cell and nucleus increases with their IoU.

The spatial plot below shows cell boundaries colored by cosine similarity between the nucleus and the cell. Cells whose transcripts are distributed across the nucleus region tend to have higher similarity with the nucleus, while cells whose transcripts are located outside the nuclear region —potentially due to spillover from neighboring cells— tend to exhibit lower similarity.

[11]:

# link annotations with cell boundaries
sdata.tables["table"].obs["region"] = "cell_boundaries"
sdata.set_table_annotates_spatialelement("table", region="cell_boundaries")

# plot
fig, ax = plt.subplots(1, 2, figsize=(10, 5), constrained_layout=True)

sdata.pl.render_shapes(
    element="cell_boundaries",
    color="iou",
    cmap="viridis",
    fill_alpha=1.0,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(ax=ax[0], title="Overlay of nuclei and cell masks colored by IoU", colorbar=True)

sdata.pl.render_shapes(
    element="cell_boundaries",
    color="similarity_nucleus_cell",
    cmap="viridis",
    fill_alpha=1.0,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(ax=ax[1], title="Overlay of nuclei and cell masks colored by cosine similarity", colorbar=True)

WARNING  Found 35 NaN values in color data. These observations will be colored with the 'na_color'.
WARNING  Found 121 NaN values in color data. These observations will be colored with the 'na_color'.

../_images/notebooks_region_similarity_24_1.png

Let’s have a look at some cells witha low IoU, but a high similarity between the nucleus and the whole cell.

[12]:

obs = sdata["table"].obs
df = obs[["cell", "iou", "similarity_nucleus_cell"]].dropna()
df.loc[df["iou"] < 0.1].sort_values("similarity_nucleus_cell", ascending=False).head()

[12]:

	cell	iou	similarity_nucleus_cell
70961	70961	0.099841	0.738821
75471	75471	0.091001	0.727011
71219	71219	0.095490	0.714954
70949	70949	0.090434	0.713838
72782	72782	0.084363	0.700521

We can see that there are some cells with high similarity_nucleus_cell despite low IoU. Let’s investigate one of these cells in more detail. By plotting the cell (centroid marked by grey cross) with its nucleus (centroid marked by black cross) and assigned transcripts (red), we can see that for this cell, the nucleus is small in comparison to the cell, however the transcripts are still somewhat homogeneously distributed within the cell, leading to the high similarity.

[13]:

cid = (
    df.loc[df["iou"] < 0.1]
    .dropna(subset=["similarity_nucleus_cell"])
    .sort_values("similarity_nucleus_cell", ascending=False)
    .iloc[0]["cell"]
)
cid

[13]:

np.float64(70961.0)

[14]:

# helper function for plotting
def plot_cell_with_nucleus_and_transcripts(
    cid: float | int,
    title: str,
    pix_to_um_scale_factor: float = 0.2125,
    repositioned_transcripts: bool = False,
    padding=200,
    genes=None,
    center_layer="nucleus_boundaries",
    outer_layer="cell_boundaries",
):
    # add annotation of this cell to .obs
    sdata["table"].obs["focal_cell"] = sdata["table"].obs.index == cid

    # compute x,y of cell and nucleus centroids in µm space
    centroid_x_cell_px = sdata["cell_boundaries"].loc[cid].geometry.centroid.x
    centroid_y_cell_px = sdata["cell_boundaries"].loc[cid].geometry.centroid.y
    centroid_x_cell = centroid_x_cell_px / pix_to_um_scale_factor
    centroid_y_cell = centroid_y_cell_px / pix_to_um_scale_factor

    nid = sdata["table"].obs.loc[sdata["table"].obs["cell"] == cid, "nucleus_id"]
    centroid_x_nucleus = sdata["nucleus_boundaries"].loc[nid].geometry.centroid.x / pix_to_um_scale_factor
    centroid_y_nucleus = sdata["nucleus_boundaries"].loc[nid].geometry.centroid.y / pix_to_um_scale_factor

    # add annotation of this cell to .points and build new `PointsModel``
    trans = sdata.points["transcripts"].compute()
    trans["focal_cell"] = "other_cells"
    trans.loc[trans["assignment"] == cid, "focal_cell"] = "focal_cell"
    trans["focal_cell"] = trans["focal_cell"].astype("category")
    if repositioned_transcripts:
        trans = trans.drop(columns=["x", "y", "z"])
        trans = trans.rename(columns={"repositioned_x": "x", "repositioned_y": "y", "repositioned_z": "z"})
    sdata.points["transcripts_2"] = sd.models.PointsModel.parse(trans)
    T = sd.transformations.get_transformation(sdata.points["transcripts"])
    sd.transformations.set_transformation(sdata.points["transcripts_2"], T)

    # zoom in for better visibility
    sdata_cropped = sdata.query.bounding_box(
        axes=["x", "y"],
        min_coordinate=[centroid_x_cell - padding, centroid_y_cell - padding],
        max_coordinate=[centroid_x_cell + padding, centroid_y_cell + padding],
        target_coordinate_system="global",
    )

    # plot cell and transcripts of that cell
    axes = plt.subplots(1, 1, figsize=(8, 8), constrained_layout=True)[1]

    plot = sdata_cropped.pl.render_shapes(
        element=center_layer,
        fill_alpha=0.2,
        outline_alpha=1.0,
        outline_width=0.5,
        outline_color="black",
    ).pl.render_shapes(
        element=outer_layer,
        color="focal_cell",
        fill_alpha=0.1,
        outline_alpha=1.0,
        outline_width=0.5,
        outline_color="blue",
    )

    if genes is None:
        plot = plot.pl.render_points(
            "transcripts_2",
            color="focal_cell",
            fill_alpha=0.1,
            groups=["focal_cell"],
            palette=["red"],
        )
    else:
        plot = plot.pl.render_points(
            "transcripts",
            color="gene",
            fill_alpha=0.1,
            groups=genes,
            palette=["red"],
        )

    plot.pl.show(
        ax=axes,
        title=title,
        colorbar=True,
    )

    # landmark for cell and nucleus centroid
    axes.scatter([centroid_x_cell], [centroid_y_cell], marker="+", s=400, c="black", linewidths=2, zorder=10, alpha=0.4)
    axes.scatter([centroid_x_nucleus], [centroid_y_nucleus], marker="+", s=400, c="black", linewidths=2, zorder=10)

[15]:

plot_cell_with_nucleus_and_transcripts(cid, title="Cell with high similarity_nucleus_cell despite high IoU")

../_images/notebooks_region_similarity_30_0.png

Instead of the true (raw) coordinates, we can also plot the repositioned coordinates of transcripts, which shows that proseg places them inside the cell.

[16]:

plot_cell_with_nucleus_and_transcripts(cid, repositioned_transcripts=True, title="Repositioned transcripts")

../_images/notebooks_region_similarity_32_0.png

Such cells can be identified easily via the metric pct_outside_cell, which comptues the percentage of transcripts that are assigned to a cell, but localized outside its border.

[17]:

st.ps.percentage_transcripts_in_compartments().head()

[17]:

	n_total_all_genes	n_outside_cell_all_genes	n_in_nucleus_overlap_all_genes	n_in_cytoplasm_all_genes	pct_outside_cell_all_genes	pct_nucleus_all_genes	pct_cytoplasm_all_genes
assignment
70066.0	16	5	6	5	31.250000	37.500000	31.250000
70067.0	5	2	0	3	40.000000	0.000000	60.000000
70069.0	31	10	11	10	32.258065	35.483871	32.258065
70070.0	86	14	31	41	16.279070	36.046512	47.674419
70071.0	34	12	19	3	35.294118	55.882353	8.823529

So, just looking at the similarity between the nucleus and the entire cell isn’t specific enough. As a better metric, we therefore compute the similarity between the transcripts in the cell region intersecting the nucleus and the transcripts in the remaining cell region (we call this the cytoplasm, however with Proseg, this can also include transcripts that were originally measured outside of the cell, as shown above).

Similarity between Nucleus and Cytoplasm#

The function similarity_nucleus_cytoplasm() computes the cosine similarity between the spatial transcript feature counts within the nucleus and the cytoplasm. It is applicable only when nuclear masks are available. It returns NaN when the regions have no or not a sufficient number of overlapping transcripts (min_transcripts, min_genes). A low correlation/similarity may indicate that the cell boundary extension captures neighborhood or ambient signal rather than true intracellular expression—a concern similarly highlighted by segmentation benchmarks such as Baysor. This is based on the assumption that transcripts are homogeneously distributed within the cell and not localized in specific subcellular regions.

[18]:

sim_nuc_cyto_df = st.rs.similarity_nucleus_cytoplasm()
sim_nuc_cyto_df.head()

[18]:

	cell	nucleus_id	iou	nucleus_fraction	similarity_nucleus_cytoplasm
0	70066	78094.0	0.372469	0.909792	NaN
1	70067	NaN	NaN	NaN	NaN
2	70069	78081.0	0.621191	0.938538	NaN
3	70070	78084.0	0.357129	0.969828	0.547531
4	70071	78083.0	0.662761	0.994992	NaN

The histogram below shows that the cosine similarity between the nucleus and remaining part of the cell is 0.47.

[19]:

plot_histogram(
    df=sdata["table"].obs,
    column="similarity_nucleus_cytoplasm",
    xlabel="Cosine Similarity - Nucleus vs. Cytoplasm",
)

../_images/notebooks_region_similarity_39_0.png

The scatter plot below visualizes the cosine similarity between transcript counts in the intersection and remainder of each cell, plotted against the Intersection over Union (IoU) with the nucleus. The calculated R2 value is 0.006, indicating that similarity_nucleus_cytoplasm is independent of IoU.

[20]:

plot_regression(
    df=sdata["table"].obs,
    x="iou",
    y="similarity_nucleus_cytoplasm",
    title="Cosine sim. vs. IoU with Regression Line",
    ylabel="Cosine similarity between nucleus and cytoplasm",
)

../_images/notebooks_region_similarity_41_0.png

The spatial plots below shows the spatial distribution of computed correlations. The cosine similarity between parts can be a measure of how much neighboring signal is captured.

[21]:

# link annotations with cell boundaries
sdata.tables["table"].obs["region"] = "cell_boundaries"
sdata.set_table_annotates_spatialelement("table", region="cell_boundaries")

axes = plt.subplots(1, 3, figsize=(16, 6), constrained_layout=True)[1].flatten()

sdata.pl.render_shapes(
    element="nucleus_boundaries",
    fill_alpha=0.2,
    outline_width=0.5,
    outline_alpha=1.0,
    outline_color="black",
).pl.render_shapes(
    element="cell_boundaries",
    color="iou",
    cmap="viridis",
    fill_alpha=0.5,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(
    ax=axes[0],
    title="Intersection over Union (IoU)",
    colorbar=True,
    figsize=(6, 6),
)

sdata.pl.render_shapes(
    element="nucleus_boundaries",
    fill_alpha=0.2,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.render_shapes(
    element="cell_boundaries",
    color="similarity_nucleus_cell",
    cmap="viridis",
    fill_alpha=0.5,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(
    ax=axes[1],
    title="Nucleus vs. whole cell",
    colorbar=True,
    figsize=(6, 6),
)

sdata.pl.render_shapes(
    element="nucleus_boundaries",
    fill_alpha=0.2,
    outline_width=0.5,
    outline_alpha=1.0,
    outline_color="black",
).pl.render_shapes(
    element="cell_boundaries",
    color="similarity_nucleus_cytoplasm",
    cmap="viridis",
    fill_alpha=0.5,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(
    ax=axes[2],
    title="Nucleus vs. cytoplasm",
    colorbar=True,
    figsize=(6, 6),
)

WARNING  Found 35 NaN values in color data. These observations will be colored with the 'na_color'.
WARNING  Found 121 NaN values in color data. These observations will be colored with the 'na_color'.
WARNING  Found 274 NaN values in color data. These observations will be colored with the 'na_color'.

../_images/notebooks_region_similarity_43_1.png

Above, we can see that there are some cells with low cosine similarity between parts despite high IoU. Let’s investigate these. This happens when the transcripts of a gene are only localized in the nucleus but not in the cytoplasm or vice versa.

[22]:

obs = sdata.tables["table"].obs
cid = (
    obs.loc[obs["iou"] > 0.65]
    .dropna(subset=["similarity_nucleus_cytoplasm"])
    .sort_values("similarity_nucleus_cytoplasm")
    .iloc[0]["cell"]
)
cid

[22]:

np.uint32(70882)

[23]:

plot_cell_with_nucleus_and_transcripts(
    cid, title="Cell with low cosine similarity between nucleus and cytoplasm despite high IoU"
)

../_images/notebooks_region_similarity_46_0.png

Similarity between the cell border with its neigbhorhood#

The function similarity_nucleus_cytoplasm() computes the cosine similarity between (1) the cell border and an eroded interior and (2) the cell border and the immediate neighborhood.

Specifically, the function:

Erodes each cell polygon to obtain a center region.
Defines the border region as the set difference between the full cell and its eroded center.
Computes gene expression profiles for center and border.
Computes the correlation between center and border expression.
Computes the correlation between border expression and the neighborhood expression profile.
The ratio of the latter two

[24]:

border_nhood_df = st.rs.similarity_border_neighborhood()

[25]:

border_nhood_df.head()

[25]:

	cell	similarity_center_border	similarity_border_neighborhood	ratio_border_neighborhood_to_center
0	70069	0.489025	0.276086	0.564564
1	70070	0.394818	0.179005	0.453386
2	70072	0.480105	0.481392	1.002680
3	70073	0.559856	0.380872	0.680304
4	70074	0.440363	0.355462	0.807202

The histogram below shows the distribution of the correlation between the cell center and border.

[26]:

plot_histogram(
    df=sdata["table"].obs,
    column="similarity_center_border",
    xlabel="Cosine Similarity - Cell center vs. border",
)

../_images/notebooks_region_similarity_51_0.png

The histogram below shows the distribution of the correlation between the cell border and its neigbhorhood (with a distance of twice cell radius from its centroid). As expected, the correlation to the cell center is higher than to the cell neighborhood. This will differ depending on whether the cell is located within a homogeneous or a heterogeneous neighborhood.

[27]:

plot_histogram(
    df=sdata["table"].obs,
    column="similarity_border_neighborhood",
    xlabel="Cosine Similarity - Cell border vs. neighborhoos",
)

../_images/notebooks_region_similarity_53_0.png

The scatter plot below visualizes the cosine similarity between the cell border and its neigbhorhood plotted against the cosine simiarity between the cell border and cell center. The calculated R2 (coefficient of determination) value is 0.323.

[28]:

plot_regression(
    df=sdata["table"].obs,
    x="similarity_center_border",
    y="similarity_border_neighborhood",
)

../_images/notebooks_region_similarity_55_0.png

The spatial plots below shows the spatial distribution of computed correlations.

[29]:

# link annotations with cell boundaries
sdata.tables["table"].obs["region"] = "cell_boundaries"
sdata.set_table_annotates_spatialelement("table", region="cell_boundaries")

axes = plt.subplots(1, 4, figsize=(20, 5), constrained_layout=True)[1].flatten()

sdata.pl.render_shapes(
    element="cell_centers",
    fill_alpha=0.2,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.render_shapes(
    element="cell_boundaries",
    color="similarity_center_border",
    cmap="viridis",
    fill_alpha=0.5,
    outline_alpha=1.0,
    outline_width=0.5,
    outline_color="black",
).pl.show(
    ax=axes[0],
    title="Similarity: center vs. border",
    colorbar=True,
    figsize=(6, 6),
)


sdata.pl.render_shapes("cell_boundaries", color="similarity_center_border").pl.show(
    ax=axes[1], title="Similarity: center vs. border", coordinate_systems="global"
)

sdata.pl.render_shapes("cell_boundaries", color="similarity_border_neighborhood").pl.show(
    ax=axes[2], title="Similarity: border vs. neighborhood", coordinate_systems="global"
)

sdata.pl.render_shapes("cell_boundaries", color="ratio_border_neighborhood_to_center").pl.show(
    ax=axes[3], title="Ratio: sim_border_center/sim_border_nhood", coordinate_systems="global"
)

WARNING  Found 409 NaN values in color data. These observations will be colored with the 'na_color'.
WARNING  Found 409 NaN values in color data. These observations will be colored with the 'na_color'.
WARNING  Found 409 NaN values in color data. These observations will be colored with the 'na_color'.
WARNING  Found 409 NaN values in color data. These observations will be colored with the 'na_color'.

../_images/notebooks_region_similarity_57_1.png

Let’s investigate an example cell with one of the highest ratio_border_neighborhood_to_center.

[30]:

obs = sdata.tables["table"].obs
cid = obs.sort_values("ratio_border_neighborhood_to_center", ascending=False).iloc[0]["cell"]
cid

[30]:

np.uint32(71173)

We select one of the most highly abundant genes in this cell, CCDC80. The plot below shows that CCDC80 is distributed across the cell (yellow) border and its neighborhood, but absent from the cell center.

[31]:

tx = sdata.points["transcripts"].compute()
tx.loc[tx["assignment"] == cid, "gene"].value_counts()

[31]:

gene
CCDC80      5
POSTN       5
MRC1        4
FGL2        3
FBLN1       3
           ..
KLRF1       0
MPO         0
CD79B       0
PDCD1LG2    0
CD3G        0
Name: count, Length: 480, dtype: int64

[32]:

plot_cell_with_nucleus_and_transcripts(
    cid,
    title="Cell with high ratio_border_neighborhood_to_center",
    center_layer="cell_centers",
    outer_layer="cell_boundaries",
    repositioned_transcripts=True,
    genes=["CCDC80"],
)

INFO     input has more than 103 categories. Uniform 'grey' color will be used for all categories.

../_images/notebooks_region_similarity_62_1.png

All of the metrics are also stored in the spatialdata object.

[33]:

sdata.tables["table"]

[33]:

AnnData object with n_obs × n_vars = 1533 × 480
    obs: 'cell', 'original_cell_id', 'centroid_x', 'centroid_y', 'centroid_z', 'component', 'volume', 'surface_area', 'scale', 'region', 'cell_area', 'similarity_nucleus_cell', 'focal_cell', 'n_outside_cell_all_genes', 'n_in_nucleus_overlap_all_genes', 'n_in_cytoplasm_all_genes', 'pct_outside_cell_all_genes', 'pct_nucleus_all_genes', 'pct_cytoplasm_all_genes', 'nucleus_id', 'iou', 'nucleus_fraction', 'similarity_nucleus_cytoplasm', 'similarity_center_border', 'similarity_border_neighborhood', 'ratio_border_neighborhood_to_center'
    var: 'gene', 'total_count', 'lambda_bg_0', 'lambda_bg_1', 'lambda_bg_2', 'lambda_bg_3'
    uns: 'spatialdata_attrs', 'proseg_run'
    obsm: 'spatial'
    obsp: 'transition_counts'

Session Info#

[34]:

print(sd.__version__)  # spatialdata
print(spatialdata_plot.__version__)

0.7.2
0.2.13

Module: Region Similarity

Contents