HomoloGene Build Procedure
The input for HomoloGene processing consists of the proteins from the input organisms. These sequences are compared to one another (using
blastp) and then are matched up and put into groups, using a tree built from sequence similarity to guide the process, where closer related organisms are matched up first, and then further organisms are added as the tree is traversed toward the root. The protein alignments are mapped back to their corresponding DNA sequences, where distance metrics can be calculated (e.g. molecular distance, Ka/Ks ratio).
Sequences are matched using synteny when applicable. Remaining sequences are matched up by using an algorithm for maximizing the score globally, rather than locally, in a bipartite matching. Cutoffs on bits per position and Ks values are set to prevent unlikely "orthologs" from being grouped together. These cutoffs are calculated based on the respective score distribution for the given groups of organisms. Paralogs are identified by finding sequences that are closer within species than other species.