Overall diagram of the analysis to identify and classify relaxases, T4CPs, and T4SSs. To classify T4SSs, we used selected genes from four well-known systems (A), which were then used as templates to search for homology (B). To cluster homologous proteins, we used all-against-all BLASTP () (E value of <1e−4) and the Markov cluster algorithm (MCL) () (after extensive searches, I = 1.12, using a log transformation of the BLASTP E values as edge weights). We identified VirB4s and T4CPs by BLASTP. These families being homologous, a disambiguation step was necessary. To discriminate between VirB4s and T4CPs, we used an expert annotated data set of these proteins, elaborated by the authors, which was used to query the MCL clusters resulting in a resolution between the two families (I = 1.16). Phylogenetic analyses were then used to remove spurious hits. To classify T4SSs into archetypes, we analyzed the indicated genes of plasmid F (MPFF) (GenBank accession number NC_002483), plasmid Ti (MPFT) (accession number NC_002377), plasmid R64 (MPFI) (accession number NC_005014), and the genomic loci of ICEHin1056 (MPFG) (accession number AJ627386). We made PSI-BLAST searches against the plasmid database (maximum of 30 iterations). Genes with convergent PSI-BLAST results were grouped into colocalized systems (defined as sets of genes <50 genes apart), and the distribution of each gene across all systems was calculated. For each system, we defined a set of marker genes that were nearly universally present in plasmids containing T4SSs and absent from plasmids lacking T4SSs (they were called “type-specific” genes and are shown in green to distinguish them from the remaining “nonspecific” genes, shown in gray). The final list of T4SSs resulted from the joint analysis of type-specific T4SS- and VirB4-containing plasmids. The automatic discovery of relaxases first used iterative BLASTP searches using the 1,730 plasmids and the 300 N-terminal amino acids for each of the six prototypical relaxases previously described (). The E values used were those for TrwC_R388 (MOBF; 1e−8), TraI_R27 (MOBH; 1e−4), TraI_RP4 (MOBP; 1e−4), MobA_RSF1010 (MOBQ; 1e−4), MobM_pMV158 (MOBV; 1e−5), and MobC_CloDF13 (MOBC; 1e−4). After completing a search, the hits with the lowest E values were used as queries to retrieve distantly related sequences. We used this set to query the MCL clusters, with each relaxase family being unambiguously assigned to a cluster, which were then manually validated.