Label transfer from scRNA-seq dataset via Pearson correlation#

In this notebook, we will show how to transfer labels from a reference single-cell RNA-sequencing (scRNA-seq) dataset via a lightweight approach. We will use Pearson correlation to find the celltype with the highest expression correlation for each cell the spatial transcriptomics datasset, according to CellSPA.

Load scRNA-seq dataset#

We use the scRNA-seq dataset of diffuse large B-cell lymphoma (DLBCL) samples from Dong et al. published on CellxGene.

[1]:

from pathlib import Path

import anndata as ad

scRNAseq_data_path = Path("/g/huber/projects/CODEX/segtraq/data/label_transfer/scRNA_seq_DLBCL_matched_with_2_C1.h5ad")
adata_ref = ad.read(scRNAseq_data_path)

/home/lazic/miniforge3/envs/sc_analysis_sdata/lib/python3.10/site-packages/anndata/__init__.py:42: FutureWarning: `anndata.read` is deprecated, use `anndata.read_h5ad` instead. `ad.read` will be removed in mid 2024.
  warnings.warn(

[2]:

adata_ref

[2]:

AnnData object with n_obs × n_vars = 29184 × 442
    obs: 'cell_type', 'is_primary_data', 'sample_id'

The counts in the reference dataset are already log-normalized, so we can copy that matrix to a layer called ‘logcounts’.

[3]:

adata_ref.X.toarray()

[3]:

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 2.38975334, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], shape=(29184, 442))

[4]:

adata_ref.layers["logcounts"] = adata_ref.X.copy()

We aggregate log-normalized gene expression values (logcounts) across cells grouped by their annotated cell type, computing the mean expression per gene within each cell type. This yields a reference expression profile for each cell type.

[5]:

import numpy as np

logcounts = adata_ref.layers["logcounts"]
# If it's sparse, convert to dense
if not isinstance(logcounts, np.ndarray):
    logcounts = logcounts.toarray()

celltypes = adata_ref.obs["cell_type"]

[6]:

import pandas as pd

logcounts_df = pd.DataFrame(logcounts, columns=adata_ref.var_names)
logcounts_df["celltype"] = celltypes.values

ref_mean_df = logcounts_df.groupby("celltype").mean()

/tmp/ipykernel_3092810/2272690639.py:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  ref_mean_df = logcounts_df.groupby("celltype").mean()

[7]:

ref_mean_df

[7]:

	A2M	ABCA1	ACACA	ACADM	ACKR1	ACKR4	ACTA2	ADAM10	ADAM17	ADCY7	...	IL9	IGLC3	HMGB1	GNAS	IL22	IGLC2	DRAIC	ANGPT2	ICOSLG	IL6ST
celltype
B	0.000000	0.016769	0.156024	0.159486	0.000000	0.022256	0.050791	0.213838	0.229673	0.270148	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
DC_pc	0.036694	0.169812	0.117063	0.444065	0.004440	0.003707	0.140305	0.437622	0.294588	0.784672	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Endothelia	1.471162	0.178528	0.270104	0.289596	0.885165	0.013624	0.112109	0.361705	0.256111	0.163421	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Epithelia	0.019746	0.080427	0.254660	0.178775	0.001124	0.026345	0.017461	0.119152	0.206132	0.108029	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Fibro_Muscle	0.266040	0.312430	0.264807	0.343359	0.072844	0.003145	0.434594	0.369848	0.277516	0.193193	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Mono_Macro	0.656003	0.890168	0.149249	0.331428	0.014655	0.002430	0.114052	0.358778	0.334295	0.585345	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
NK	0.021211	0.106054	0.201798	0.357114	0.005284	0.000000	0.101637	0.559331	0.245895	0.659352	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Pericyte	2.253352	0.360183	0.252609	0.435539	0.326374	0.000000	0.803320	0.553716	0.318709	0.157481	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
T_CD4	0.029272	0.077661	0.152805	0.257307	0.008759	0.000000	0.137884	0.459278	0.233179	0.363228	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
T_CD8	0.031461	0.045086	0.151711	0.311616	0.004273	0.001105	0.106523	0.336085	0.214291	0.386563	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
T_dividing	0.051053	0.032399	0.196901	0.487031	0.003520	0.000000	0.150578	0.765854	0.408252	0.345907	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Tumor_DLBCL	0.017330	0.074456	0.278579	0.419433	0.004762	0.000889	0.196989	0.418151	0.366993	0.365700	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
cDC	0.077264	0.341498	0.163797	0.296956	0.021874	0.005103	0.095915	0.394451	0.305254	0.900343	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

13 rows × 442 columns

Load spatialData object#

Next, we load the spatialdata object of a Xenium DLBCL sample.

[8]:

import spatialdata as sd

sdata = sd.read_zarr(
    "/g/huber/projects/GSK_lazic/B_NHL/BNHL_RICOVER/tma_2/2_C1/proseg_output/spatialData_2_C1_proseg.zarr"
)

/home/lazic/miniforge3/envs/sc_analysis_sdata/lib/python3.10/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(
/home/lazic/miniforge3/envs/sc_analysis_sdata/lib/python3.10/site-packages/xarray_schema/__init__.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import DistributionNotFound, get_distribution
version mismatch: detected: RasterFormatV02, requested: FormatV04
/home/lazic/miniforge3/envs/sc_analysis_sdata/lib/python3.10/site-packages/zarr/creation.py:614: UserWarning: ignoring keyword argument 'read_only'
  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)

We perform quality control steps (filtering based on #transcripts and #genes per cell) and log-normalization.

[9]:

adata = sdata["table"]

[131]:

import scanpy as sc

import segtraq as st

# Compute QC metrics
transcript_counts = st.bl.transcripts_per_cell(sdata)
gene_counts = st.bl.genes_per_cell(sdata)

transcript_counts.index = transcript_counts.index.astype(str)
gene_counts.index = gene_counts.index.astype(str)

transcript_counts_filtered = transcript_counts.loc[adata.obs_names]
gene_counts_filtered = gene_counts.loc[adata.obs_names]

adata.obs["total_transcripts"] = transcript_counts_filtered["transcript_count"]
adata.obs["total_genes"] = gene_counts_filtered["gene_count"]

[144]:

adata.obs

[144]:

	centroid_x	centroid_y	cell_size	axis_minor_length	axis_major_length	centroid-0	centroid-1	eccentricity	solidity	perimeter	sample_id	TMA	region	mask_id	total_transcripts	total_genes
cell_id
1	377.79850	513.17664	395.71875	6.590376	8.878127	512.720930	376.953488	0.670050	0.860000	24.899495	2_C1	2	cell_labels	1	993	73
2	349.86170	643.56384	92.53125	2.653300	5.366563	643.600000	349.000000	0.869227	0.500000	0.000000	2_C1	2	cell_labels	2	985	89
3	401.73760	398.32178	397.68750	5.502993	9.527791	396.305556	400.916667	0.816339	0.800000	21.071068	2_C1	2	cell_labels	3	959	136
4	257.44656	319.80533	773.71875	9.718573	11.625859	319.301205	257.168675	0.548814	0.813725	34.384776	2_C1	2	cell_labels	4	932	161
5	195.35858	549.35860	194.90625	3.843169	5.055352	548.785714	194.928571	0.649668	0.875000	11.414214	2_C1	2	cell_labels	5	926	47
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5843	539.39000	343.35000	196.87500	6.445665	16.611864	343.063492	539.158730	0.921653	0.684783	40.798990	2_C1	2	cell_labels	5843	61	85
5844	256.99690	636.03420	316.96875	7.896618	12.941854	636.506667	256.333333	0.792277	0.842697	35.142136	2_C1	2	cell_labels	5844	61	41
5845	310.64035	493.85090	112.21875	6.976622	7.382183	493.272727	310.151515	0.326891	0.733333	24.727922	2_C1	2	cell_labels	5845	60	109
5846	366.64166	153.53334	236.25000	8.121170	12.488528	152.528571	366.328571	0.759686	0.777778	36.556349	2_C1	2	cell_labels	5846	60	46
5847	494.74417	190.87209	169.31250	8.103957	8.811781	190.180000	494.300000	0.392685	0.847458	26.384776	2_C1	2	cell_labels	5847	60	33

5845 rows × 16 columns

[133]:

# Filter outliers
qc_range = {"total_transcripts": (10, 2000), "total_genes": (5, np.inf)}

for key, (low, high) in qc_range.items():
    mask = (adata.obs[key] >= low) & (adata.obs[key] <= high)
    adata = adata[mask.to_numpy()].copy()

[134]:

# Normalize and log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers["logcounts"] = adata.X.copy()

/home/lazic/miniforge3/envs/sc_analysis_sdata/lib/python3.10/site-packages/legacy_api_wrap/__init__.py:82: UserWarning: Some cells have zero counts
  return fn(*args_all, **kw)

Label transfer via correlation#

Now, we compute the Pearson correlation between each query cell in the processed AnnData object and the mean expression profiles of reference cell types.

We first identify the common genes between the query data and the reference matrix, and then compute a correlation matrix where each entry reflects the similarity between a query cell and a reference cell type.

For each query cell, we:

Assign the most correlated reference cell type (i.e., the one with the highest Pearson correlation),

Store both the correlation value and the predicted cell type.

This provides a label transfer from the reference to the query dataset based on global transcriptomic similarity.

[ ]:

from anndata import AnnData
from scipy.spatial.distance import cdist


def assign_celltype_by_pearson(adata: AnnData, ref_mean_df: pd.DataFrame, layer: str = "logcounts") -> pd.DataFrame:
    """
    Assigns cell types to cells in `adata` by computing Pearson correlation
    with reference expression profiles.

    Parameters
    ----------
    adata : AnnData
        Query dataset with gene expression data in the specified layer.
    ref_mean_df : pd.DataFrame
        Reference expression matrix with shape (cell_types x genes).
    layer : str, default="logcounts"
        Layer in `adata` to use for expression values.

    Returns
    -------
    pd.DataFrame
        DataFrame with `cell_id` and assigned `celltype`.
    """
    # TODO - check if cdist does really Pearson correlation

    # Extract query expression data
    X_query = pd.DataFrame(
        (adata.layers[layer].toarray() if hasattr(adata.layers[layer], "toarray") else adata.layers[layer]),
        index=adata.obs_names,
        columns=adata.var_names,
    )

    # Align gene order
    common_genes = X_query.columns.intersection(ref_mean_df.columns)
    if len(common_genes) == 0:
        raise ValueError("No common genes found between query and reference.")

    X_query = X_query[common_genes]
    X_ref = ref_mean_df[common_genes]

    # Compute Pearson correlation (1 - correlation distance)
    cor_mat = 1 - cdist(X_query.values, X_ref.values, metric="correlation")
    cor_mat_df = pd.DataFrame(cor_mat, index=adata.obs_names, columns=X_ref.index)

    # Assign best-matching reference cell type
    best_celltype = cor_mat_df.idxmax(axis=1)

    # Return as DataFrame
    return pd.DataFrame({"cell_id": adata.obs_names, "celltype": best_celltype.values})

[146]:

result_df = assign_celltype_by_pearson(adata, ref_mean_df)

/tmp/ipykernel_2382657/4082630372.py:48: FutureWarning: The behavior of DataFrame.idxmax with all-NA values, or any-NA and skipna=False, is deprecated. In a future version this will raise ValueError
  best_celltype = cor_mat_df.idxmax(axis=1)

[158]:

result_df["celltype"].value_counts()

[158]:

celltype
Tumor_DLBCL     2293
T_CD8            952
Mono_Macro       789
T_CD4            599
T_dividing       521
Fibro_Muscle     204
cDC              134
Pericyte         123
B                100
Endothelia        62
NK                35
DC_pc             23
Epithelia          6
Name: count, dtype: int64

[156]:

result_df.set_index("cell_id", inplace=True)
adata.obs["celltype_corr"] = result_df.loc[adata.obs_names, "celltype"]

[163]:

# define colors for cell types

col_celltype = {
    "B": "#fb8072",
    "cDC": "#bc80bd",
    "DC_pc": "#910290",
    "Endothelia": "#fdb462",
    "Epithelia": "#959059",
    "Fibro_Muscle": "#fed9a6",
    "Mono_Macro": "#a6cee3",
    "NK": "#2782bb",
    "Pericyte": "#d2cd7e",
    "T_CD4": "#3c7761",
    "T_CD4_reg": "#66a61e",
    "T_CD8": "#66c2a5",
    "T_dividing": "#1b9e77",
    "Tumor_DLBCL": "#d45943",
}

[170]:

import matplotlib.pyplot as plt

# Set up side-by-side plots
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Cell size
sc.pl.embedding(
    adata,
    basis="X_custom_spatial",
    color="cell_size",
    size=50,
    ax=axs[0],
    show=False,
    title="Cell Size",
)

# Plot 2: Cell types with custom colors
sc.pl.embedding(
    adata,
    basis="X_custom_spatial",
    color="celltype_corr",
    size=50,
    palette=col_celltype,  # your custom palette from before
    ax=axs[1],
    show=False,
    title="Cell Type Assignment",
)

plt.tight_layout()
plt.show()

../_images/notebooks_labeling_via_correlation_26_0.png

Label transfer from scRNA-seq dataset via Pearson correlation

Contents

Label transfer from scRNA-seq dataset via Pearson correlation#

Load scRNA-seq dataset#

Load spatialData object#

Label transfer via correlation#