Improving spatial transcriptomics data processing and annotation under suboptimal image-based cell segmentation

Igor Davidović*, Mateja Ilić, Jelena Kusic-Tisma, Nevena Vezmar, Aleksandra Vitkovac, Marija Lazić, Mila Ljujić and Aleksandra Divac Rankov

Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Serbia

igor.davidovic [at] imgge.bg.ac.rs

Abstract

Spatial transcriptomics, in addition to providing information about the cell from which an RNA molecule originates, also provides information about its spatial position. In this study, we analyzed spatial transcriptomics data from 3-day-old zebrafish larvae, which were generated using the Stereo-seq platform (BGI STOmics; 1 × 1 cm chip) and imaged via a Leica Thunder fluorescent imaging system. One of the main challenges in spatial analysis is cell segmentation, which is image-based and automatically performed using the Stereo-seq Analysis Workflow (SAW) and StereoMap. Therefore, low-quality images and differences in tissue variability in staining affinity can lead to inaccurate cell boundary definitions.

This limitation can affect clustering, annotation and downstream analyses. To address these issues, spatial smoothing was performed using a square grid-based approach. The optimal observation (square) size was determined by balancing the number of counts per bin with the empirically determined average cell size. The Stereo-seq platform provides a dense grid of spatial bins with a 500 nm spacing between the centres of adjacent bins. For one observation, a bin 20 was selected as the observation unit, corresponding to a 10 × 10 µm observation area. This does not strictly correspond to the average cell size because it slightly exceeds it. The resulting h5ad file was used to benchmark multiple bioinformatic tools, including StereoPy, Scanpy, Tangram, Squidpy, STAGATE, Seurat and GraphST.

Following benchmarking, Scanpy and StereoPy are the most suitable tools for preprocessing and postprocessing under low-quality segmentation. However, square-grid-based smoothing produces observations that approximate real-cell boundaries. As a result, some observations within each cluster do not accurately reflect the cell, and tools that use cell-by-cell annotation are unreliable. Therefore, cluster-level annotation was a better approach. Marker genes are determined for each cluster by comparing each cluster against all remaining clusters. Obtained marker genes are used for manual or semi-automatic annotation based on anatomical enrichment analysis using BgeeDB and topGO.

Collectively, these strategies improved spatial-data processing and annotation under suboptimal imaging conditions, yielding a high-quality dataset suitable for downstream analysis.

Keywords: spatial transcriptomics, Stereo-seq, zebrafish

Acknowledgement: This study is part of the Enhancing Non-Communicable Disease Research Excellence Through Zebrafish Capacity Building (ZeNCure) project, supported by the European Union under the Horizon Europe programme Widening Participation and Spreading Excellence, HORIZON-WIDERA-2023-ACCESS-02, Grant Agreement number 101160259.