Utility functions#
The utility functions provide helper methods for supervised QC including label transfer and computation of reference marker genes.
- segtraq.run_label_transfer(sdata, adata_ref: AnnData, ref_cell_type: str, tables_raw_counts_layer: str | None = None, ref_raw_counts_layer: str | None = 'raw', tables_key: str = 'table', tables_cell_id_key: str = 'cell_id', tables_gene_key: str | None = None, points_key: str = 'transcripts', points_cell_id_key: str = 'cell_id', points_gene_key: str = 'feature_name', tx_min: float = 10.0, tx_max: float = 2000.0, gn_min: float = 5.0, gn_max: float = inf, cell_type_key: str = 'transferred_cell_type', ref_gene_key: str | None = None, use_hvg: bool = False, exclude_gene_prefixes: tuple[str, ...] = ('MT-', 'RPL', 'RPS'), inplace: bool = True) DataFrame | None#
Transfer cell type labels from a reference AnnData object to cells in a SpatialData table using Pearson correlation to reference mean expression profiles.
Raw counts are selected first, normalized with sc.pp.normalize_total, log-transformed with sc.pp.log1p, and then used for label transfer. If a raw-count layer is provided, it is used preferentially. Otherwise, .X is expected to contain raw counts.
- Parameters:
sdata (SpatialData) – SpatialData object containing the query dataset. Cell-level expression data are expected in sdata.tables[tables_key].
adata_ref (AnnData) – Reference AnnData object containing annotated cells.
ref_cell_type (str) – Column in adata_ref.obs containing the reference cell type labels.
tables_raw_counts_layer (str or None, default=None) – Layer in sdata.tables[tables_key].layers containing raw counts for the query data. If None, raw counts are expected in sdata.tables[tables_key].X.
ref_raw_counts_layer (str or None, default=None) – Layer in adata_ref.layers containing raw counts for the reference data. If None, raw counts are expected in adata_ref.X.
tables_key (str, default="table") – Key identifying the cell-level AnnData table in sdata.tables.
tables_cell_id_key (str, default="cell_id") – Column in sdata.tables[tables_key].obs containing unique cell identifiers.
tables_gene_key (str or None, default=None) – Column in sdata.tables[tables_key].var containing gene identifiers. If None, sdata.tables[tables_key].var_names are used.
points_key (str, default="transcripts") – Key identifying the transcript-level points element in sdata.points.
points_cell_id_key (str, default="cell_id") – Column in the transcript points table containing cell identifiers.
points_gene_key (str, default="feature_name") – Column in the transcript points table containing gene names.
tx_min (float, default=10.0) – Minimum number of detected transcripts required for a cell to be retained.
tx_max (float, default=2000.0) – Maximum number of detected transcripts allowed for a cell to be retained.
gn_min (float, default=5.0) – Minimum number of detected genes required for a cell to be retained.
gn_max (float, default=np.inf) – Maximum number of detected genes allowed for a cell to be retained.
cell_type_key (str, default="transferred_cell_type") – Column name used to store transferred labels in the query table’s .obs.
ref_gene_key (str or None, default=None) – Column in adata_ref.var containing gene identifiers. If None, adata_ref.var_names are used.
use_hvg (bool, default=False) – If True, restrict label transfer to highly variable genes computed from the reference dataset.
exclude_gene_prefixes (tuple of str, default=("MT-", "RPL", "RPS")) – Gene prefixes to exclude from the HVG set before label transfer. Set to an empty tuple to disable this filtering.
inplace (bool, default=True) – If True, write transferred labels to sdata.tables[tables_key].obs[cell_type_key] and return None. If False, return a DataFrame with transferred labels and Pearson correlation scores.
- Returns:
If inplace=False, returns a DataFrame with columns including tables_cell_id_key, cell_type_key, and “pearson_score”. If inplace=True, modifies sdata in place and returns None.
- Return type:
pandas.DataFrame or None
- segtraq.markers_from_reference(adata: AnnData, ref_cell_type: str, ref_gene_key: str | None = None, ref_raw_counts_layer: str | None = 'raw', mode: str = 'de', max_fpr: float | None = None, auc_pos_thresh: float = 0.9, method: str = 'wilcoxon', pval_adj_thresh: float = 0.05, logfc_pos_thresh: float = 1.0, vote_fraction_pos: float = 0.5, min_pos_frac: float = 0.1, max_neg_frac: float = 0.05, t_pos: float = 0.25, t_neg: float = 1.0, min_cells_per_celltype: int = 10, n_jobs: int = 1) dict[str, dict[str, list[str]]]#
Compute positive and negative markers per cell type using pairwise contrasts (AUC/pAUC or DE) followed by voting and a rarity-based definition of negative markers.
Positive markers: For each cell type c, a gene g is considered a positive marker if it is “up in c” in at least ceil(vote_fraction_pos * M_c) of its valid pairwise comparisons (M_c). Additionally, g must be expressed (> 0) in at least min_pos_frac fraction of cells of type c in the reference dataset.
Negative markers: For each ordered pair (a, b) of cell types, take genes up in a vs b and consider them negative-marker candidates for b if (1.) they are expressed (> 0) in at most max_neg_frac fraction of cells of type b, and (2.) are not up in b vs any cell type (computed across all ordered contrasts).
Overlap filtering: Overlap filtering is applied separately to positive and negative markers:
Positive lists: genes appearing in ≥ t_pos * n_types lists are dropped.
Negative lists: genes appearing in ≥ t_neg * n_types lists are dropped.
- Parameters:
adata (AnnData) – Reference single-cell dataset (cells x genes).
ref_cell_type (str) – Column in adata.obs containing cell type labels.
ref_gene_key (str or None, default=None) – Column in adata_ref.var containing gene identifiers. If None, adata_ref.var_names are used.
ref_raw_counts_layer (str or None, default=None) – Layer containing raw counts. If None, raw counts are expected in adata.X.
mode ({"auc", "de"}, optional (default: "de")) –
“auc”: compute markers using pairwise AUC/pAUC.
”de” : compute markers using pairwise DE.
max_fpr (float or None, optional (default: None)) – (AUC mode only) If None, compute full AUC. If in (0, 1], compute standardized pAUC over [0, max_fpr] using sklearn’s roc_auc_score(max_fpr=max_fpr).
auc_pos_thresh (float, optional (default: 0.9)) – (AUC mode only) Minimum AUC/pAUC for a gene to be considered “up in c_i vs c_j”.
method (str, optional (default: "wilcoxon")) – (DE mode only) DE method passed to sc.tl.rank_genes_groups (“wilcoxon”, “t-test”, “logreg”, …).
pval_adj_thresh (float, optional (default: 0.05)) – (DE mode only) FDR (adjusted p-value) cutoff for positive markers.
logfc_pos_thresh (float, optional (default: 1.0)) – (DE mode only) Minimum log fold-change for positive markers (c > d).
vote_fraction_pos (float, optional (default: 0.5)) – Fraction of valid pairwise contrasts in which a gene must be “up in c” (AUC mode) / significantly up in c (DE mode) to be called a positive marker of c.
min_pos_frac (float, optional (default: 0.1)) – Minimum fraction of cells of type c in which a gene must be expressed (counts > 0) in the reference dataset to be considered a positive marker of c.
max_neg_frac (float, optional (default: 0.05)) – Maximum fraction of cells of type c in which a gene may be expressed (counts > 0) in the reference dataset to be considered a negative marker of c.
t_pos (float, optional (default: 0.25)) – Overlap filter threshold for positive markers.
t_neg (float, optional (default: 1.0)) – Overlap filter threshold for negative markers.
min_cells_per_celltype (int, optional (default: 10)) – Minimum number of cells required per cell type to be included in pairwise computations.
n_jobs (int, optional (default: 1)) – Number of parallel jobs for running pairwise computations.
- Returns:
A dictionary mapping each cell type to its positive and negative markers: {cell_type: {“positive”: [genes], “negative”: [genes]}}
- Return type:
dict
- segtraq.validate_spatialdata(sdata: SpatialData, images_key: str | None = 'morphology_focus', tables_key: str = 'table', tables_cell_id_key: str = 'cell_id', tables_area_key: str | None = 'cell_area', tables_centroid_x_key: str | None = 'x_centroid', tables_centroid_y_key: str | None = 'y_centroid', tables_gene_key: str | None = None, tables_raw_counts_layer: str | None = None, points_key: str = 'transcripts', points_cell_id_key: str = 'cell_id', points_background_id: str = 'UNASSIGNED', points_x_key: str = 'x', points_y_key: str = 'y', points_z_key: str | None = 'z', points_gene_key: str = 'feature_name', shapes_key: str | list[str] = 'cell_boundaries', shapes_cell_id_key: str = 'cell_id', nucleus_shapes_key: str | None = 'nucleus_boundaries', nucleus_shapes_cell_id_key: str = 'cell_id') bool#
Validates the integrity of a SpatialData object by checking the consistency of cell IDs across points, shapes, and tables.
This function ensures that: - All points have corresponding shapes, and tables. - Cell IDs in points match those in shapes, and tables. - If shapes are present, they contain all cell IDs from the points. - If tables are present, they contain all cell IDs from the shapes.
- Parameters:
sdata (sd.SpatialData) – The SpatialData object to validate.
tables_key (str, optional) – Key for accessing tables in the SpatialData. Default is “table”.
tables_cell_id_key (str, optional) – Column name in the tables DataFrame (AnnData.obs) that contains cell IDs. Default is “cell_id”.
tables_area_key (str or None, optional, default="cell_area") – Column in the cell table with cell area (2D). If None, area/volume-based metrics will be computed via segtraq.bl.morphological_features.
tables_centroid_x_key (str or None, optional, default="x_centroid") – Column in the cell table with the x-coordinate of the cell centroid.
tables_centroid_y_key (str or None, optional, default="y_centroid") – Column in the cell table with the y-coordinate of the cell centroid.
tables_gene_key (str or None, default=None) – Column in sdata.tables[tables_key].var containing gene identifiers. If None, sdata.tables[tables_key].var_names are used.
tables_raw_counts_layer (str | None, optional) – Layer containing count data. If None, adata.X is used if it looks like counts. If a layer is specified, it must exist and contain count-like values.
points_key (str, optional) – Key for accessing points (e.g., transcripts) in the SpatialData. Default is “transcripts”.
points_cell_id_key (str, optional) – Column name in the points DataFrame indicating cell assignments. Default is “cell_id”.
points_background_id (str, optional) – Identifier used for unassigned or background transcripts in the points DataFrame. Default is “UNASSIGNED”.
shapes_key (str or list of str, optional) – Key(s) for accessing shapes (e.g., cell boundaries) in the SpatialData. Default is “cell_boundaries”. Can be a list if multiple shape layers are present.
shapes_cell_id_key (str, optional, default="cell_id") – Cell ID key for sdata.shapes[shapes_key]. Must match either the shapes index name or a column name (which will be set as the index if needed). If None, the index is assumed to contain cell IDs and renamed to “segtraq_cell_id”.
nucleus_shapes_key (str or None, optional, default="nucleus_boundaries") – Key in sdata.shapes for nucleus boundary polygons, if available. If None, a nucleus mask can be obtained via segtraq.run_cellpose.
nucleus_shapes_cell_id_key (str, optional, default="cell_id") – Cell ID key for sdata.shapes[nucleus_shapes_key]. Must match either the shapes index name or a column name (which will be set as the index if needed).
- Raises:
TypeError – If the input is not an instance of sd.SpatialData.
ValueError – If the SpatialData object does not contain points or if there are inconsistencies in cell IDs.
- Returns:
True if the SpatialData object passes all validation checks. Otherwise, an error or warning is raised.
- Return type:
bool
- segtraq.cellpose(sdata: SpatialData, channel: str = 'DAPI', images_key: str = 'morphology_focus', images_data_key: str = 'scale0/image', shapes_key: str = 'cell_boundaries', key_added: str = 'nucleus_boundaries', inplace: bool = True, **kwargs)#
This function runs the cellpose segmentation algorithm on the provided image data. It extracts the image data from the spatialdata object, applies the cellpose algorithm, and adds the segmentation masks to the spatialdata object. The segmentation masks are stored as polygons in the shapes attribute of the spatialdata object.
- Parameters:
sdata (SpatialData) – A SpatialData object containing segmented and transcript-assigned spatial transcriptomics data (images, tables, points, shapes and optional labels).
channel (str, default="DAPI") – The channel(s) to be used for segmentation.
images_key (str, default="morphology_focus") – Key in sdata.images for a nuclear or morphology image (e.g., DAPI). Used for segmentation.
images_data_key (str, default="scale0/image") – Key for accessing data in sdata.images if they are stored as a DataTree. Consider using a higher scale (lower resolution) for segmentation to speed up computation and reduce memory usage during Cellpose.
shapes_key (str, default="cell_boundaries") – Key in sdata.shapes for cell boundary polygons. Used to get transformations.
key_added (str, default="nucleus_boundaries") – The key under which the segmentation masks will be stored in the shapes attribute of the spatialdata object. Defaults to “nucleus_boundaries”.
inplace – Whether to modify the spatialdata object in place. Defaults to True.
bool – Whether to modify the spatialdata object in place. Defaults to True.
default=True – Whether to modify the spatialdata object in place. Defaults to True.
**kwargs (Additional keyword arguments to be passed to the cellpose algorithm.)