Upgraded 'Mind Reading' in the Cellular World! MultiKano: A Data Augmentation Method Based on Cutting-Edge Mathematics and AI to Decode Single-Cell Multi-Omics Identity Codes

Sharing an article co-authored by Associate Professor Chen Shengquan from Nankai University and Associate Professor Yang Qingzhu from the Capital University of Physical Education and Sports, published in the journal Protein & Cell.

The title is “Development of MultiKano: An Automated Cell Type Annotation Tool Based on Kolmogorov-Arnold Networks and Data Augmentation for Cell Type Identification in Single-Cell Multi-Omics Data”.

Research Background

The breakthrough in single-cell omics sequencing technology provides an unprecedented level of detail, allowing biologists to explore gene activity patterns and cellular functional dynamics at the resolution of individual cells.

However, single-omics sequencing technologies, due to their focus on one type of omics data, may fail to capture the complex relationships between biomolecules within individual cells.

Accurate annotation of cell types is key to the effective utilization of single-cell multi-omics data and is a prerequisite for supporting downstream analyses.

Thus, cell type annotation is a core step in the analysis of single-cell multi-omics data.

However, commonly used clustering-based annotation methods have significant shortcomings.

A more efficient and accurate alternative is automated cell type annotation, which utilizes well-labeled datasets to train models and then applies the trained models to annotate newly generated datasets.

Many researchers have proposed various computational methods specifically designed for scRNA-seq data.

Single-omics methods fail to fully leverage the information from multi-omics spectra, limiting their ability to capture cellular complexity and diversity.

Researchers have proposed MultiKano, the first method to integrate single-cell transcriptome and chromatin accessibility data for automated cell type annotation, specifically designed for single-cell multi-omics data.

MultiKano introduces a novel data augmentation strategy based on paired scRNA-seq and scATAC-seq spectra, combined with advanced Kolmogorov-Arnold Networks (KAN) to enhance the model’s generalization ability.

Research Content

1. MultiKano Framework Design

The architecture of MultiKano is divided into three main modules: data preprocessing module, data augmentation module, and KAN module. The annotation capability of multi-omics data is enhanced through data augmentation and the KAN network. The specifics are as follows:

Data Preprocessing

MultiKano first preprocesses the scRNA-seq and scATAC-seq spectra of the given single-cell multi-omics dataset separately.

For example, scRNA-seq data undergoes normalization of gene expression levels, while scATAC-seq data is transformed into computable feature matrices through peak counts or gene activity scores.

To capture the cellular heterogeneity in the extremely noisy single-cell multi-omics data, researchers further designed a data augmentation module.

This module is based on the principle that two cells of the same cell type have similar biological characteristics, allowing for cross-matching of scRNA-seq and scATAC-seq spectra of two cells of the same type to generate synthetic cells, which serve as input for the KAN model.

KAN Module

The KAN model is based on the Kolmogorov-Arnold representation theorem, replacing each weight parameter with a learnable one-dimensional function parameterized as a spline.

This configuration allows KAN to simply linearly sum input signals without applying any non-linearity.

By using learnable activation functions at the edges of the network, KAN can efficiently learn complex nonlinear mappings and reduce the risk of overfitting.

Compared to traditional MLPs, KAN demonstrates better generalization in single-cell multi-omics data by reducing parameter redundancy and enhancing feature representation capabilities, making it a highly suitable algorithm for single-cell data analysis. Currently, KAN models have not been used for annotating single-cell data.

Figure 1. Overview of MultiKano.

(A) Data preprocessing module: Given a paired single-cell multi-omics dataset, MultiKano first preprocesses the scRNA-seq and scATAC-seq spectra separately.

(B) Data augmentation module: Using the preprocessed spectra, MultiKano generates synthetic cells by pairing the scRNA-seq spectrum of one cell with the scATAC-seq spectrum of another cell of the same type.

(C) KAN module: Finally, MultiKano connects the two omics spectra of each cell and uses the KAN model to predict cell types.

2. Experimental Validation

To evaluate whether MultiKano, which integrates single-cell transcriptome and chromatin accessibility data, outperforms existing automated cell type annotation methods for single-omics data, we conducted five-fold cross-validation on six paired single-cell multi-omics datasets covering different species, tissues, and experimental schemes.

Researchers used metrics such as accuracy, Cohen’s kappa (Kappa), and F1 score (F1-macro) to assess annotation performance.

Using the aforementioned six single-cell multi-omics datasets, MultiKano was compared with cell type annotation methods applicable only to scRNA-seq spectra and random guessing (RG).

MultiKano exhibited the best overall performance across all six datasets, with particularly significant advantages in accuracy and Kappa metrics.

Additionally, researchers performed one-sided paired Wilcoxon signed-rank tests, confirming that MultiKano outperformed scPred on all metrics, with a P-value of 2.980×10⁻⁸ for accuracy, 2.980×10⁻⁸ for Kappa, and 0.237 for F1-macro.

Researchers extended the comparison to methods using only scATAC-seq spectra, where MultiKano also demonstrated superior performance across all six datasets.

By combining relatively high-quality and low-noise scRNA-seq data, MultiKano achieved excellent annotation performance.

In summary, MultiKano successfully integrated single-cell transcriptome and chromatin accessibility data for cell type annotation, outperforming methods designed for scRNA-seq and scATAC-seq data.

To further validate the advantages of MultiKano, researchers expanded the benchmarking to include comparisons with machine learning methods using multi-omics spectra.

By processing the raw single-cell multi-omics data in the same way, the scRNA-seq and scATAC-seq spectra of each cell were concatenated as input for the machine learning methods.

The results indicated that the predictive performance of traditional machine learning methods still fell short of that of MultiKano.

3. Ablation Study Validation

Researchers conducted ablation studies on six single-cell multi-omics datasets: (i) comparing the performance of MultiKano using peak counts or gene activity scores to demonstrate the effectiveness of using peak counts as input for scATAC-seq data; (ii) comparing the performance of MultiKano with and without data augmentation to demonstrate the effectiveness of the data augmentation module in MultiKano; (iii) comparing the performance of MultiKano using KAN versus using MLP to demonstrate the effectiveness of the KAN module in MultiKano. All results indicated that each module in MultiKano is an efficient component contributing to making MultiKano an accurate annotation method.

4. Robustness and Cross-Dataset Application

Considering that datasets from different tissues always contain varying numbers of cell types, researchers sought to verify whether MultiKano consistently outperforms other methods regardless of the number of cell types in the dataset.

Researchers gradually removed cell types from the Cortex dataset (from 20 to 2 types), and the robustness results indicated that MultiKano’s performance stability significantly surpassed that of traditional methods.

Additionally, in the BMMC dataset containing nested batch effects, even when there were geographical and donor differences between the test and training sets, MultiKano still accurately annotated, demonstrating its batch effect resistance.

Figure 2. Performance of MultiKano and baseline methods.

(A) Box plots of MultiKano and baseline methods for scRNA-seq data and the impact of RG on annotation performance across six single-cell multi-omics datasets.

Research Significance

The biological significance of this study lies in achieving validation from methods to mechanisms. Taking the SkinA dataset as an example, by comparing the annotation discrepancies between MultiKano and TOSICA, an analysis was conducted on the “Infundibulum-Inconsistent” cells (predicted as Infundibulum by MultiKano and as other types by TOSICA).

The results indicated a high overlap rate of the top 30 differential genes with “Infundibulum-Consistent” cells, with a 69% overlap rate for the top 100 differential genes, suggesting that their transcriptional patterns are closer to the Infundibulum type, with consistent gene expression.

GO and KEGG analyses showed enrichment in biological processes related to skin barrier function, cell cycle regulation, and epidermal development, consistent with the physiological functions of Infundibulum cells, validating the accuracy of MultiKano’s annotations.

Epigenetic validation results indicated that differentially accessible peaks (DAPs) from scATAC-seq were enriched in skin development-related pathways, further supporting the annotation results of MultiKano.

Results and Outlook

The MultiKano constructed in this study achieves automated and efficient annotation of single-cell multi-omics data for the first time by integrating multi-omics data, innovative data augmentation strategies, and KAN networks, with performance comprehensively surpassing single-omics and traditional multi-omics methods.

Not only does it enhance annotation accuracy, but it also reveals gene regulatory mechanisms specific to cell types through multi-omics integration.

Future research on MultiKano could focus on deepening multi-omics data integration strategies and exploring the utilization of unpaired multi-omics data; expanding application scenarios, such as cancer single-cell heterogeneity analysis or cell differentiation trajectory studies.

Written by: Chengzi Aoruizhi

Typeset by: Shenghe Editorial Department

Upgraded ‘Mind Reading’ in the Cellular World! MultiKano: A Data Augmentation Method Based on Cutting-Edge Mathematics and AI to Decode Single-Cell Multi-Omics Identity Codes

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply