Utility functions#

The utility functions provide helper methods for supervised QC including label transfer and computation of reference marker genes.

segtraq.run_label_transfer(sdata, adata_ref: AnnData, ref_cell_type: str, tables_key: str = 'table', tables_cell_id_key: str = 'cell_id', points_key: str = 'transcripts', points_cell_id_key: str = 'cell_id', points_gene_key: str = 'feature_name', tx_min: float = 10.0, tx_max: float = 2000.0, gn_min: float = 5.0, gn_max: float = inf, cell_type_key: str = 'transferred_cell_type', ref_ensemble_key: str | None = None, query_ensemble_key: str | None = 'gene_ids', inplace: bool = True) DataFrame | None#

Transfer cell labels from a reference AnnData to sdata.tables[tables_key] by Pearson correlation to reference mean profiles.

Parameters:
  • sdata (SpatialData-like) – Container with .tables[tables_key] as AnnData, and points needed for QC if absent. sdata.tables[tables_key].X values are ideally normalized and log1p transformed. Otherwise transformation will be performed before running label transfer.

  • adata_ref (AnnData) – Reference dataset (ideally normalized & log1p). Otherwise transformation will be performed before running label transfer.

  • ref_cell_type (str) – Column in adata_ref.obs with reference cell types.

  • tables_key (str) – Key of the AnnData table in sdata.tables.

  • tables_cell_id_key (str, default="cell_id") – Column in the cell table uniquely identifying each cell.

  • points_key (str, optional) – The key to access the transcript data within sdata.points (default is “transcripts”).

  • points_cell_id_key (str, optional) – The column name in the transcript data representing cell identifiers (default is “cell_id”).

  • points_gene_key (str, optional) – The column name in the transcript data representing gene names (default is “feature_name”).

  • tx_min (float) – Min/max transcripts per cell for pre-filtering.

  • tx_max (float) – Min/max transcripts per cell for pre-filtering.

  • gn_min (float) – Min/max genes per cell for pre-filtering.

  • gn_max (float) – Min/max genes per cell for pre-filtering.

  • cell_type_key (str) – Column name to store transferred labels in .obs when inplace=True.

  • ref_ensemble_key (str or None, default=None) – Column name in adata_ref.var that contains unique gene/ensemble IDs. If None, adata_ref.var_names will be used.

  • query_ensemble_key (str or None, default="gene_ids") – Column name in self.sdata.tables[self.tables_key].var that contains unique gene/ensemble IDs. If None, self.sdata.tables[self.tables_key].var_names will be used.

  • q_gene_key (str)

  • inplace (bool) – If True, writes labels into sdata.tables[tables_key].obs and returns None. If False, returns a DataFrame with [‘cell_id’, ‘transferred_cell_type’, ‘pearson_score’].

Returns:

None when inplace=True; otherwise a DataFrame of assignments.

Return type:

None or pd.DataFrame

segtraq.markers_from_reference(adata: AnnData, cell_type_key: str, mode: str = 'de', max_fpr: float | None = None, auc_pos_thresh: float = 0.9, method: str = 'wilcoxon', pval_adj_thresh: float = 0.05, logfc_pos_thresh: float = 1.0, vote_fraction_pos: float = 0.5, min_pos_frac: float = 0.1, max_neg_frac: float = 0.05, t_pos: float = 0.25, t_neg: float = 1.0, min_cells_per_celltype: int = 10, n_jobs: int = 1) dict[str, dict[str, list[str]]]#

Compute positive and negative markers per cell type using pairwise contrasts (AUC/pAUC or DE) followed by voting and a rarity-based definition of negative markers.

Positive markers: For each cell type c, a gene g is considered a positive marker if it is “up in c” in at least ceil(vote_fraction_pos * M_c) of its valid pairwise comparisons (M_c). Additionally, g must be expressed (> 0) in at least min_pos_frac fraction of cells of type c in the reference dataset.

Negative markers: For each ordered pair (a, b) of cell types, take genes up in a vs b and consider them negative-marker candidates for b if (1.) they are expressed (> 0) in at most max_neg_frac fraction of cells of type b, and (2.) are not up in b vs any cell type (computed across all ordered contrasts).

Overlap filtering: Overlap filtering is applied separately to positive and negative markers:

  • Positive lists: genes appearing in ≥ t_pos * n_types lists are dropped.

  • Negative lists: genes appearing in ≥ t_neg * n_types lists are dropped.

Parameters:
  • adata (AnnData) – Reference single-cell dataset (cells x genes).

  • cell_type_key (str) – Column in adata.obs containing cell type labels.

  • mode ({"auc", "de"}, optional (default: "de")) –

    • “auc”: compute markers using pairwise AUC/pAUC.

    • ”de” : compute markers using pairwise DE.

  • max_fpr (float or None, optional (default: None)) – (AUC mode only) If None, compute full AUC. If in (0, 1], compute standardized pAUC over [0, max_fpr] using sklearn’s roc_auc_score(max_fpr=max_fpr).

  • auc_pos_thresh (float, optional (default: 0.9)) – (AUC mode only) Minimum AUC/pAUC for a gene to be considered “up in c_i vs c_j”.

  • method (str, optional (default: "wilcoxon")) – (DE mode only) DE method passed to sc.tl.rank_genes_groups (“wilcoxon”, “t-test”, “logreg”, …).

  • pval_adj_thresh (float, optional (default: 0.05)) – (DE mode only) FDR (adjusted p-value) cutoff for positive markers.

  • logfc_pos_thresh (float, optional (default: 1.0)) – (DE mode only) Minimum log fold-change for positive markers (c > d).

  • vote_fraction_pos (float, optional (default: 0.5)) – Fraction of valid pairwise contrasts in which a gene must be “up in c” (AUC mode) / significantly up in c (DE mode) to be called a positive marker of c.

  • min_pos_frac (float, optional (default: 0.1)) – Minimum fraction of cells of type c in which a gene must be expressed (counts > 0) in the reference dataset to be considered a positive marker of c.

  • max_neg_frac (float, optional (default: 0.05)) – Maximum fraction of cells of type c in which a gene may be expressed (counts > 0) in the reference dataset to be considered a negative marker of c.

  • t_pos (float, optional (default: 0.25)) – Overlap filter threshold for positive markers.

  • t_neg (float, optional (default: 1.0)) – Overlap filter threshold for negative markers.

  • min_cells_per_celltype (int, optional (default: 10)) – Minimum number of cells required per cell type to be included in pairwise computations.

  • n_jobs (int, optional (default: 1)) – Number of parallel jobs for running pairwise computations.

Returns:

A dictionary mapping each cell type to its positive and negative markers: {cell_type: {“positive”: [genes], “negative”: [genes]}}

Return type:

dict

segtraq.validate_spatialdata(sdata: SpatialData, images_key: str | None = 'morphology_focus', tables_key: str = 'table', tables_cell_id_key: str = 'cell_id', tables_area_key: str | None = 'cell_area', tables_centroid_x_key: str | None = 'x_centroid', tables_centroid_y_key: str | None = 'y_centroid', points_key: str = 'transcripts', points_cell_id_key: str = 'cell_id', points_background_id: str = 'UNASSIGNED', points_x_key: str = 'x', points_y_key: str = 'y', points_z_key: str | None = 'z', points_gene_key: str = 'feature_name', shapes_key: str | list[str] = 'cell_boundaries', shapes_cell_id_key: str = 'cell_id', nucleus_shapes_key: str | None = 'nucleus_boundaries', nucleus_shapes_cell_id_key: str = 'cell_id') bool#

Validates the integrity of a SpatialData object by checking the consistency of cell IDs across points, shapes, and tables.

This function ensures that: - All points have corresponding shapes, and tables. - Cell IDs in points match those in shapes, and tables. - If shapes are present, they contain all cell IDs from the points. - If tables are present, they contain all cell IDs from the shapes.

Parameters:
  • sdata (sd.SpatialData) – The SpatialData object to validate.

  • tables_key (str, optional) – Key for accessing tables in the SpatialData. Default is “table”.

  • tables_cell_id_key (str, optional) – Column name in the tables DataFrame (AnnData.obs) that contains cell IDs. Default is “cell_id”.

  • tables_area_key (str or None, optional, default="cell_area") – Column in the cell table with cell area (2D). If None, area/volume-based metrics will be computed via segtraq.bl.morphological_features.

  • tables_centroid_x_key (str or None, optional, default="x_centroid") – Column in the cell table with the x-coordinate of the cell centroid.

  • tables_centroid_y_key (str or None, optional, default="y_centroid") – Column in the cell table with the y-coordinate of the cell centroid.

  • points_key (str, optional) – Key for accessing points (e.g., transcripts) in the SpatialData. Default is “transcripts”.

  • points_cell_id_key (str, optional) – Column name in the points DataFrame indicating cell assignments. Default is “cell_id”.

  • points_background_id (str, optional) – Identifier used for unassigned or background transcripts in the points DataFrame. Default is “UNASSIGNED”.

  • shapes_key (str or list of str, optional) – Key(s) for accessing shapes (e.g., cell boundaries) in the SpatialData. Default is “cell_boundaries”. Can be a list if multiple shape layers are present.

  • shapes_cell_id_key (str, optional, default="cell_id") – Cell ID key for sdata.shapes[shapes_key]. Must match either the shapes index name or a column name (which will be set as the index if needed).

  • nucleus_shapes_key (str or None, optional, default="nucleus_boundaries") – Key in sdata.shapes for nucleus boundary polygons, if available. If None, a nucleus mask can be obtained via segtraq.run_cellpose.

  • nucleus_shapes_cell_id_key (str, optional, default="cell_id") – Cell ID key for sdata.shapes[nucleus_shapes_key]. Must match either the shapes index name or a column name (which will be set as the index if needed).

Raises:
  • TypeError – If the input is not an instance of sd.SpatialData.

  • ValueError – If the SpatialData object does not contain points or if there are inconsistencies in cell IDs.

Returns:

True if the SpatialData object passes all validation checks. Otherwise, an error or warning is raised.

Return type:

bool

segtraq.add_nuc_shapes_via_cellpose(sdata: SpatialData, channel: str = 'DAPI', images_key: str = 'morphology_focus', images_data_key: str = 'scale0/image', shapes_key: str = 'cell_boundaries', key_added: str = 'nucleus_boundaries', inplace: bool = True, **kwargs)#

This function runs the cellpose segmentation algorithm on the provided image data. It extracts the image data from the spatialdata object, applies the cellpose algorithm, and adds the segmentation masks to the spatialdata object. The segmentation masks are stored as polygons in the shapes attribute of the spatialdata object.

Parameters:
  • sdata (SpatialData) – A SpatialData object containing segmented and transcript-assigned spatial transcriptomics data (images, tables, points, shapes and optional labels).

  • channel (str, default="DAPI") – The channel(s) to be used for segmentation.

  • images_key (str, default="morphology_focus") – Key in sdata.images for a nuclear or morphology image (e.g., DAPI). Used for segmentation.

  • images_data_key (str, default="scale0/image") – Key for accessing data in sdata.images if they are stored as a DataTree. Consider using a higher scale (lower resolution) for segmentation to speed up computation and reduce memory usage during Cellpose.

  • shapes_key (str, default="cell_boundaries") – Key in sdata.shapes for cell boundary polygons. Used to get transformations.

  • key_added (str, default="nucleus_boundaries") – The key under which the segmentation masks will be stored in the shapes attribute of the spatialdata object. Defaults to “nucleus_boundaries”.

  • inplace – Whether to modify the spatialdata object in place. Defaults to True.

  • bool – Whether to modify the spatialdata object in place. Defaults to True.

  • default=True – Whether to modify the spatialdata object in place. Defaults to True.

  • **kwargs (Additional keyword arguments to be passed to the cellpose algorithm.)