Sprinzak et al. 10.1073/pnas.0603352103.

Supporting Information

Files in this Data Supplement:

Supporting Table 2
Supporting Table 3
Supporting Methods
Supporting Table 4
Supporting Table 6
Supporting Table 7




Table 2. Fusion event parameters

Parameter

Value

Explanation

E value

<0.01

E value for BLAST

Query coverage

>0.6

Fraction of query sequence that is aligned. Computed as the length of the aligned subsequence of the query (without gaps), divided by the total length of the query.

Hit coverage

<0.96

Fraction of hit sequence that is aligned. Computed as the length of the aligned subsequence of the hit (without gaps), divided by the total length of the query.

Minimum query identity fraction

>0.25

Lower limit of identical aligned residues of query. Computed as the total no. of query residues in the alignment that are identical to the hit, divided by the total length of the query.

Maximal identity fraction

<0.96

Upper limit of identical aligned residues. Computed as the no. of identical residues in the alignment divided by the alignment length. (This is the "identities" field of the BLAST output.)

Fusion proteins overlap

£0.3

Overlap between two fusion candidate proteins, query1 and query2. Computed as the overlap length divided by the shorter query. Overlap is calculated by using the hit as a reference where the alignment coordinates of query1 and query2 are mapped to the hit.

All alignment parameters are retrieved from the BLAST output and are summed over all high-scoring segment pairs.





Table 3. Phylogenetic profile analysis parameters

Parameter

Value

Explanation

E value

>0.01

E value for BLAST.

Query coverage

≥0.65

Fraction of query sequence that is aligned. Computed as Length of the aligned subsequence of the query (without gaps), divided by the total length of the query.

Hit coverage

≥0.65

Fraction of hit sequence that is aligned. Computed as Length of the aligned subsequence of the hit (without gaps), divided by the total length of the query.

Query identity*

³0.25

Fraction of identical aligned residues of query. Computed as Total no. of query residues in the alignment that are identical to the hit, divided by the total length of the query.

Hit identity*

>=0.25

Fraction of identical aligned residues of the hit. Computed as Total number of query residues in the alignment that are identical to the query, divided by total length of the hit.

Multiple occurrences

³0.8

Excludes multiple occurrences of query in hit, or vice versa. Computed as Query identity/Hit identity.

Inclusion identity fraction

³0.4

When one of the aligned sequences is fully included in the other one, we require a minimal identity level to consider the sequences as orthologs.

Inclusion alignment fraction

³0.9

When one of the aligned sequences is fully included in the other one, this parameter indicates the required coverage of the included sequence.

Phylogenetic profile mismatch

£2

No. of allowed mismatches when comparing the two phylogenetic profiles.

All alignment parameters are retrieved from the BLAST output and are summed over all high-scoring segment pairs (HSPs).

* Because we sum over all HSPs, the overall no. of identities could differ between hit and query.





Table 4. Fivefold cross-validation

 

GP1

GP2

GP3

GP4

GP5

All data

GP1

1

     

GP2

0.99926

1

    

GP3

0.99968

0.99915

1

   

GP4

0.99887

0.99712

0.9989

1

  

GP5

0.99775

0.99873

0.99838

0.99617

1

 

All data

0.99977

0.99955

0.99989

0.99883

0.9986

1

Correlation between each of the calculated LR attribute coefficients was calculated for each cross-validation group. The 1,466 permanent interacting pairs were divided randomly into five groups; each of the groups was excluded, and the other four groups were used for training the LR model (GP1-GP5). The table includes the correlation coefficients, which were calculated between every set of attribute LR coefficients. The All-data group contains the set of attribute LR coefficients, which were calculated using all data. The attribute coefficient FE (fusion event) was excluded because of its lower appearance in the data, causing errors in random sampling. All correlation coefficients had a calculated probability below 4.1875e-09.

Table 5. Logistic-regression attributes coefficients

Abbreviation

Pair attribute

Parameter

Coefficient

Pr > c2*

 

 

Intercept

-4.8567

<0.0001

DD

Correlated domain signature

X1

1.3585

<0.0001

Fold

Fold combination

X2

0.2257

0.1153

Loc

Colocalization

X3

0.8762

<0.0001

Proc

Co-cellular role

X4

2.5885

<0.0001

FE

Fusion event

X5

1.4438

0.0558

PP

Phylogenetic profiles

X6

0.0743

0.4081

GN

Gene neighborhood

X7

0.5297

0.0303

Exp

Coexpression

X8

-0.3493

0.0025

Reg

Coregulation

X9

0.0145

0.9129

 

* Pr , Probability; only Pr < 0.0001 is significant.





Table 6. Interacting protein pairs that were identified experimentally by less reliable methods and were predicted by the LR model

Protein1

Protein2

Attribute combinations

Predicted probability

LST8

SEC13

111111000

0.84688

RAM1

RAM2

111101000

0.56623

SSB1

SSB2

111101000

0.56623

UBC4

UBC1

111101000

0.56623

YPK1

YPK2

111101000

0.56623

CLA4

STE20

111101000

0.56623

UBC4

UBC5

111101000

0.56623

CCT2

CCT3

111101000

0.56623

UBC5

UBC1

111101000

0.56623

CDC5

ELM1

111101000

0.56623

PRS1

PRS2

111100000

0.54790

SSB1

HBS1

111100000

0.54790

LYS14

CHA4

111100000

0.54790

PKC1

PPZ1

111100000

0.54790

SLG1

BCK1

111100000

0.54790

HBS1

SSB2

111100000

0.54790

PKC1

BCK1

111100000

0.54790

PTP2

PTC1

111100000

0.54790

ACT1

SAC7

111100000

0.54790

PAN1

GLC7

111100000

0.54790

PKC1

SLG1

111100000

0.54790

PRS1

PRS3

111100000

0.54790

YPL133C

YBR239C

111100000

0.54790

PRS1

PRS5

111100000

0.54790

CDC15

CDC5

111100000

0.54790

PKC1

SLT2

111100000

0.54790

RPP0

RPP1B

101101001

0.51382

SEC4

BET3

101101001

0.51382

TPM1

MYO2

001110000

0.51295

INP52

INP53

101101000

0.51020

TUB1

TUB3

101101000

0.51020

RRP4

MTR4

101101000

0.51020

CIN8

DYN1

101101000

0.51020

PMT1

PMT4

101101000

0.51020

SNF2

GCN5

101101000

0.51020

PMT2

PMT4

101101000

0.51020

INP51

INP52

101101000

0.51020

SAC6

CCT4

101101000

0.51020

INP51

INP53

101101000

0.51020

RVB1

RVB2

101101000

0.51020

DYN1

KIP3

101101000

0.51020

PMT2

PMT3

101101000

0.51020

PMT1

PMT3

101101000

0.51020

PMT4

PMT3

101101000

0.51020

PMT2

PMT1

101101000

0.51020

DBP7

DBP6

101101000

0.51020

YPT1

BET3

101101000

0.51020

CIN8

PAC1

101101000

0.51020

GLK1

YDR516C

011111010

0.50062

GIC1

CLA4

101100001

0.49525

GIC1

GIC2

101100001

0.49525

LTE1

TEM1

101100001

0.49525

VPS15

APM3

101100000

0.49163

GNA1

SRP1

101100000

0.49163

TRK1

BAP2

101100000

0.49163

ARP2

END3

101100000

0.49163

PEX14

DCI1

101100000

0.49163

RAD51

SAP1

101100000

0.49163

TRM1

GBP2

101100000

0.49163

OCA1

SIW14

101100000

0.49163

YPT31

YIP3

101100000

0.49163

CDC12

GIC1

101100000

0.49163

CDC28

CDC37

101100000

0.49163

SFB3

SEC13

101100000

0.49163

SNF3

HXT1

101100000

0.49163

CSG2

LCB2

101100000

0.49163

BIK1

BUB3

101100000

0.49163

ARF1

CHC1

101100000

0.49163

PKC1

BRO1

101100000

0.49163

YAK1

GTS1

101100000

0.49163

GTS1

YAP6

101100000

0.49163

BUD14

GLC7

101100000

0.49163

CCT3

YKE2

101100000

0.49163

RIM101

ZAP1

101100000

0.49163

SNF2

SPT7

101100000

0.49163

MYO2

SRO77

101100000

0.49163

DAL80

HDA1

101100000

0.49163

CDC20

CIN8

101100000

0.49163

SNF3

HXT3

101100000

0.49163

CDC15

MOB1

101100000

0.49163

ACT1

RVS161

101100000

0.49163

SUV3

MSU1

101100000

0.49163

NSP1

NUP84

101100000

0.49163

BRO1

BCK1

101100000

0.49163

LYS14

SPP1

101100000

0.49163

PRP5

PRP11

101100000

0.49163

CSG2

TSC10

101100000

0.49163

SNF4

PRR2

101100000

0.49163

SPT23

MGA2

101100000

0.49163

GCN4

YPL039W

101100000

0.49163

PEX14

ECI1

101100000

0.49163

SNF3

RGT2

101100000

0.49163

RAD24

RFC1

101100000

0.49163

STE50

GIC1

101100000

0.49163

PEX1

PEX6

101100000

0.49163

VPS21

VPS41

101100000

0.49163

TAF90

YAP6

101100000

0.49163

CLN3

SAP185

101100000

0.49163

BOI1

RHO4

101100000

0.49163

CCT2

GIM4

101100000

0.49163

HOG1

YPD1

101100000

0.49163

BRO1

SLT2

101100000

0.49163

HMT1

NPL3

101100000

0.49163

ACA1

GTS1

101100000

0.49163

BUD5

BEM1

101100000

0.49163

TUB1

CIN1

101100000

0.49163

LSM2

MTR3

101100000

0.49163

DST1

RPB2

101100000

0.49163

YME1

ATP3

101100000

0.49163

MRS6

YPT31

101100000

0.49163

CDC2

RFC2

101100000

0.49163

TUB3

CIN1

101100000

0.49163

TUB2

BIK1

101100000

0.49163

CLN3

SAP155

101100000

0.49163

POL1

RAD24

101100000

0.49163

PBS2

YPD1

101100000

0.49163

SCO2

COX17

101100000

0.49163

LCB1

CSG2

101100000

0.49163

VAM7

YPT7

101100000

0.49163

APM3

VPS27

101100000

0.49163

YCK2

AKR1

101100000

0.49163

VAM7

APG17

101100000

0.49163

ROK1

RRP5

101100000

0.49163

CDC2

RAD24

101100000

0.49163

YKT6

SEC34

101100000

0.49163

SCO1

COX17

101100000

0.49163

ENT2

ENT1

101100000

0.49163

RAD9

CHK1

101100000

0.49163

GLC7

FYV11

101100000

0.49163

MYO2

SMY1

101100000

0.49163

SSK2

YPD1

101100000

0.49163

YCK1

AKR1

101100000

0.49163

MRS11

TIM18

101100000

0.49163

CCT3

GIM4

101100000

0.49163

SEC22

SEC34

101100000

0.49163

SNF3

HXT4

101100000

0.49163

PRP9

PRP5

101100000

0.49163

PEP7

YPT52

101100000

0.49163

CCT2

YKE2

101100000

0.49163

CDC20

CDC15

101100000

0.49163

RFC3

RFC4

101100000

0.49163

TUB1

YKE2

101100000

0.49163

BEM1

RHO3

101100000

0.49163

PRP16

DIS3

101100000

0.49163

TRK2

BAP2

101100000

0.49163

SNF3

HXT2

101100000

0.49163

FYV11

YKE2

101100000

0.49163

The attribute combination order is: DD, Fold, Loc, Proc, FE, PP, GN, Exp, Reg.





Table 7. All possible theoretical attribute combinations with a P value ≥ 0.49

 

Attribute combinations

Calculated probability based on LR

1

010111100

0.5013

2

010111101

0.5049

3

001110000

0.5129

4

001110001

0.5166

5

011110000

0.5689

6

011110001

0.5725

7

001110100

0.6414

8

001110110

0.5578

9

001110101

0.6447

10

001110111

0.5614

11

011110100

0.6915

12

011110110

0.6125

13

011110101

0.6946

14

011110111

0.6160

15

001111000

0.5315

16

001111001

0.5351

17

011111000

0.5871

18

011111010

0.5006

19

011111001

0.5906

20

011111011

0.5042

21

001111100

0.6583

22

001111110

0.5760

23

001111101

0.6616

24

001111111

0.5796

25

011111100

0.7071

26

011111110

0.6300

27

011111101

0.7101

28

011111111

0.6334

29

100110000

0.6304

30

100110010

0.5461

31

100110001

0.6338

32

100110011

0.5497

33

110110000

0.6813

34

110110010

0.6012

35

110110001

0.6845

36

110110011

0.6047

37

100110100

0.7434

38

100110110

0.6714

39

100110101

0.7462

40

100110111

0.6746

41

110110100

0.7841

42

110110110

0.7191

43

110110101

0.7865

44

110110111

0.7221

45

100111000

0.6476

46

100111010

0.5644

47

100111001

0.6509

48

100111011

0.5680

49

110111000

0.6972

50

110111010

0.6189

51

110111001

0.7003

52

110111011

0.6223

53

100111100

0.7573

54

100111110

0.6876

55

100111101

0.7600

56

100111111

0.6907

57

110111100

0.7964

58

110111110

0.7339

59

110111101

0.7987

60

110111111

0.7367

61

101100000

0.4916

62

101100001

0.4953

63

111100000

0.5479

64

111100001

0.5515

65

101100100

0.6216

66

101100110

0.5367

67

101100101

0.6250

68

101100111

0.5403

69

111100100

0.6730

70

111100110

0.5921

71

111100101

0.6762

72

111100111

0.5956

73

101101000

0.5102

74

101101001

0.5138

75

111101000

0.5662

76

111101001

0.5698

77

101101100

0.6389

78

101101110

0.5551

79

101101101

0.6422

80

101101111

0.5587

81

111101100

0.6892

82

111101110

0.6099

83

111101101

0.6923

84

111101111

0.6133

85

101110000

0.8038

86

101110010

0.7429

87

101110001

0.8061

88

101110011

0.7456

89

111110000

0.8370

90

111110010

0.7836

91

111110001

0.8390

92

111110011

0.7860

93

101110100

0.8744

94

101110110

0.8307

95

101110101

0.8759

96

101110111

0.8327

97

111110100

0.8971

98

111110110

0.8601

99

111110101

0.8985

100

111110111

0.8619

101

101111000

0.8153

102

101111010

0.7568

103

101111001

0.8174

104

101111011

0.7595

105

111111000

0.8469

106

111111010

0.7959

107

111111001

0.8487

108

111111011

0.7983

109

101111100

0.8823

110

101111110

0.8409

111

101111101

0.8838

112

101111111

0.8428

113

111111100

0.9038

114

111111110

0.8688

115

111111101

0.9050

116

111111111

0.8705

The attribute combination order is: DD, Fold, Loc, Proc, FE, PP, GN, Exp, Reg. We examined the attribute combinations that led to probabilities above the threshold. For nine attributes with binary values, there are 512 possible different attribute combinations. For each such combination a probability can be computed by the LR coefficients (listed in Supporting Methods and Table 5). Only 116 attribute combinations had probability values above the threshold of 0.49, and they are listed in this table. Remarkably, all of these combinations included the shared cellular process attribute. Thus, we would not have predicted a protein pair as interacting unless the pair-mates were documented to be involved in the same process.





Supporting Methods

Derivation of Attributes

1. Domain-domain (DD) signatures.

Each yeast protein was characterized according to the InterPro signatures (domains) database (1), and pairs of overrepresented InterPro signatures in the interacting proteins were identified (2). This analysis was carried out on the whole database of 8,695 pairs. To avoid bias, the identification of correlated domain signatures and the assignment of the protein pairs by them were carried out with 3-fold cross-validation.

The dataset of interacting protein pairs of 8,695 was divided into three random sets of equal size, where each set included the same fraction of reliable and less reliable interacting protein pairs. The identification of correlated domain signatures was carried out by using two of the three sets. Domain signature pairs that were found to be 2-fold more frequent than expected at random were recorded as correlated domain signatures. Protein pairs in the third dataset were examined for the DDs. If they contained the recorded DDs, they were assigned as having the attribute. This procedure was repeated three times, each time with different two sets used for training.

2. Fold combinations.

Whole-genome fold assignments using the PSI-BLAST as a fold recognition method was performed by Hegyi et al. (3). We used these published data as a basis for our fold combination analysis. There are fold assignments for only 26% of all yeast proteins. Similarly to the domain signatures analysis, overrepresented fold combinations among the interacting protein pairs were detected. For each protein pair, it is noted whether it corresponds to an overrepresented fold combination (at least twice than expected at random). This analysis was also carried out with 3-fold cross-validation to avoid bias, as described above.

3. Co-cellular localization.

The cellular localization annotation for all yeast proteins was based on the YPD database (4) and on the experimental data of Huh et al. (5). The cellular localization is known for 72% of the yeast proteins. When proteins were annotated as localized to more than one cellular compartment, we considered two pair-mates to be colocalized, if at least one of the localization compartments was shared.

4. Shared cellular process.

Information about the cellular process was taken from the YPD database (4). The YPD contained 42 different cellular process categories, and the assignment was available for 59% of the yeast proteins.

5. Fusion events.

The analysis for finding a gene fusion event was carried out following Marcotte et al. (6) and Enright et al. (7). To identify fusion events between yeast protein-encoding genes, all yeast protein sequences were compared with the nonredundant database at the National Center for Biotechnology Information (NCBI) using the BLAST search tool (8, 9). Criteria to determine fusion events are detailed in Table 2.

6. Phylogenetic profiles.

Phylogenetic profile analysis was done across 80 genomes, both prokaryotes and eukaryotes. To identify orthologs, each yeast protein sequence was compared with the other genome sequences using the BLAST tool (8, 9). Criteria used to determine orthologs, and consistent phylogenetic profiles are detailed in Table 3.

7. Gene adjacency.

We used the published data of von Mering et al. (10), where they provide a list of yeast protein pairs whose genes are adjacent on the chromosome, and this adjacency is conserved also in other genomes.

8. Coexpression.

Ihmels et al. (11) recently developed a method for using genomewide mRNA expression data to cluster genes into expression modules according to their correlated mRNA expression patterns. Based on this clustering, we extracted all possible protein pairs with encoding mRNAs clustered into the same module.

9. Coregulation.

We compiled data from several sources: small-scale regulation data from the YPD database (4), data from the SCPD database (12), and genomewide location analysis of transcription factors (13-16). Thus, for each yeast protein we have a list of transcription factors that were documented to regulate its gene or at least are known to bind upstream from the gene. Such data were available for only 43.3% of the yeast proteins. Next, we checked for each possible protein pair when both respective genes were regulated by at least one common transcription factor.

Logistic-Regression (LR) Analysis.

LR is a statistical method suitable for finding the best fitting and most parsimonious model to describe the relationship between a dichotomous response variable and a set of explanatory variables. The output is the probability for the event to occur, and it is calculated by logit transformation 1/(1 + e-z), where Z = B0 + B1X1 + B2X2 + ...+ BPXP. The coefficients (Bi) of the independent variables (Xi) are computed to maximize the probability of obtaining the observed set of data. The intercept and the coefficients are estimated using the maximum-likelihood method. The significance of each of the coefficients can be estimated by the Wald statistic (17), distributed as c2 with one degree of freedom. We ran the LR using the SAS statistical package (SAS Institute, Inc., Cary, NC).

1. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al. (2003) Nucleic Acids Res 31:315-318.

2. Sprinzak E, Margalit H (2001) J Mol Biol 311:681-692.

3. Hegyi H, Lin J, Greenbaum D, Gerstein M (2002) Proteins 47:126-141.

4. Csank C, Costanzo MC, Hirschman J, Hodges P, Kranz JE, Mangan M, O'Neill K, Robertson LS, Skrzypek MS, Brooks J, et al. (2002) Methods Enzymol 350:347-373.

5. Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK (2003) Nature 425:686-691.

6. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D (1999) Science 285:751-753.

7. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999) Nature 402:86-90.

8. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J Mol Biol 215:403-410.

9. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Nucleic Acids Res 25:3389-3402.

10. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. (2002) Nature 417:399-403.

11. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N (2002) Nat Genet 31:370-377.

12. Zhu J, Zhang MQ (1999) Bioinformatics 15:607-611.

13. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. (2000) Science 290:2306-2309.

14. Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakkola TS, et al. (2001) Cell 106, 697-708.

15. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO (2001) Nature 409:533-538.

16. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al. (2002) Science 298:799-804.

17. Hosmer D, Lemeshow S (2000) Applied Logistic Regression (Wiley, New York).