Send to

Choose Destination
BMC Bioinformatics. 2019 Sep 18;20(1):480. doi: 10.1186/s12859-019-3050-8.

Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection.

Author information

Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, Utrecht University, David de Wied building,Universiteitsweg 99, Utrecht, 3584 CG, The Netherlands.
Laboratorio de Modelado Molecular, Bioinformática y diseño de fármacos. Departamento de Posgrado. Escuela Superior de Medicina del Instituto Politécnico Nacional (IPN), Mexico City, Mexico.
Faculty of Medicine, National Autonomous University of Mexico; Federico Gomez Children's Hospital of Mexico, Mexico City, Mexico.
Life Sciences and Health, CWI, Amsterdam, Netherlands.
UMR 782 GMPA, Université Paris-Saclay, INRA, AgroParisTech, Thiverval-Grignon, France.



MicroRNAs (miRNAs) are noncoding RNA molecules heavily involved in human tumors, in which few of them circulating the human body. Finding a tumor-associated signature of miRNA, that is, the minimum miRNA entities to be measured for discriminating both different types of cancer and normal tissues, is of utmost importance. Feature selection techniques applied in machine learning can help however they often provide naive or biased results.


An ensemble feature selection strategy for miRNA signatures is proposed. miRNAs are chosen based on consensus on feature relevance from high-accuracy classifiers of different typologies. This methodology aims to identify signatures that are considerably more robust and reliable when used in clinically relevant prediction tasks. Using the proposed method, a 100-miRNA signature is identified in a dataset of 8023 samples, extracted from TCGA. When running eight-state-of-the-art classifiers along with the 100-miRNA signature against the original 1046 features, it could be detected that global accuracy differs only by 1.4%. Importantly, this 100-miRNA signature is sufficient to distinguish between tumor and normal tissues. The approach is then compared against other feature selection methods, such as UFS, RFE, EN, LASSO, Genetic Algorithms, and EFS-CLA. The proposed approach provides better accuracy when tested on a 10-fold cross-validation with different classifiers and it is applied to several GEO datasets across different platforms with some classifiers showing more than 90% classification accuracy, which proves its cross-platform applicability.


The 100-miRNA signature is sufficiently stable to provide almost the same classification accuracy as the complete TCGA dataset, and it is further validated on several GEO datasets, across different types of cancer and platforms. Furthermore, a bibliographic analysis confirms that 77 out of the 100 miRNAs in the signature appear in lists of circulating miRNAs used in cancer studies, in stem-loop or mature-sequence form. The remaining 23 miRNAs offer potentially promising avenues for future research.


Classifiers; Dataset; Feature selection; Machine learning; MicroRNAs; miRNA

Supplemental Content

Full text links

Icon for BioMed Central Icon for PubMed Central
Loading ...
Support Center