Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes

Gene. 2006 Jan 3:365:27-34. doi: 10.1016/j.gene.2005.09.040. Epub 2005 Dec 20.

Abstract

Novel tools are needed for comprehensive comparisons of interspecies characteristics of massive amounts of genomic sequences currently available. An unsupervised neural network algorithm, Self-Organizing Map (SOM), is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We modified the conventional SOM, on the basis of batch-learning SOM, for genome informatics making the learning process and resulting map independent of the order of data input. We generated the SOMs for tri- and tetranucleotide frequencies in 10- and 100-kb sequence fragments from 38 eukaryotes for which almost complete genome sequences are available. SOM recognized species-specific characteristics (key combinations of oligonucleotide frequencies) in the genomic sequences, permitting species-specific classification of the sequences without any information regarding the species. We also generated the SOM for tetranucleotide frequencies in 1-kb sequence fragments from the human genome and found sequences for four functional categories (5' and 3' UTRs, CDSs and introns) were classified primarily according to the categories. Because the classification and visualization power is very high, SOM is an efficient and powerful tool for extracting a wide range of genome information.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • 3' Untranslated Regions
  • 5' Untranslated Regions
  • Algorithms*
  • Animals
  • Base Sequence
  • Chromosome Mapping
  • Computational Biology
  • Eukaryotic Cells*
  • Genome*
  • Genome, Human
  • Genome, Plant
  • Genomics
  • Humans
  • Introns
  • Microsatellite Repeats
  • Molecular Sequence Data
  • Neural Networks, Computer*
  • Oligonucleotides / genetics
  • Sequence Analysis, DNA
  • Species Specificity
  • Trinucleotide Repeats

Substances

  • 3' Untranslated Regions
  • 5' Untranslated Regions
  • Oligonucleotides