Springer LINK: Lecture Notes in Computer Science
O. Gascuel, M.-F. Sagot (Eds.): Computational Biology First Internatio...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Springer LINK: Lecture Notes in Computer Science

O. Gascuel, M.-F. Sagot (Eds.): Computational Biology First International Conference on Biology, Informatics, and Mathematics, JOBIM 2000, Montpellier, France, May 3-5, 2000, Selected Papers LNCS 2066 Ordering Information

Table of Contents Title pages in PDF (11 KB) Preface in PDF (26 KB) Conference Organization in PDF (17 KB) Table of Contents in PDF (38 KB) Speeding Up the DIALIGN Multiple Alignment Program by Using the `Greedy Alignment of BIOlogical Sequences LIBrary' (GABIOS-LIB) Saïd Abdeddaïm and Burkhard Morgenstern LNCS 2066, p. 1 ff. Abstract | Full article in PDF (150 KB) GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

http://buffy.lib.unimelb.edu.au:2150/link/service/series/0558/tocs/t2066.htm (1 of 3) [10/18/2002 8:10:49 PM]

Springer LINK: Lecture Notes in Computer Science

G. Broner, B. Spataro, C. Gautier, and F. Rechenmann LNCS 2066, p. 12 ff. Abstract | Full article in PDF (89 KB) Optimal Agreement Supertrees David Bryant LNCS 2066, p. 24 ff. Abstract | Full article in PDF (112 KB) Segmentation by Maximal Predictive Partitioning According to Composition Biases Laurent Guéguen LNCS 2066, p. 32 ff. Abstract | Full article in PDF (176 KB) Can We Have Confidence in a Tree Representation? Alain Guénoche and Henri Garreta LNCS 2066, p. 45 ff. Abstract | Full article in PDF (121 KB) Bayesian Approach to DNA Segmentation into Regions with Different Average Nucleotide Composition Vsevolod Makeev, Vasily Ramensky, Mikhail Gelfand, Mikhail Roytberg, and Vladimir Tumanyan LNCS 2066, p. 57 ff. Abstract | Full article in PDF (141 KB) Exact and Asymptotic Distribution of the Local Score of One i.i.d. Random Sequence Sabine Mercier, Dominique Cellier, François Charlot, and Jean-Jacques Daudin LNCS 2066, p. 74 ff. Abstract | Full article in PDF (147 KB) Phylogenetic Reconstruction Algorithms Based on Weighted 4-Trees Vincent Ranwez and Olivier Gascuel LNCS 2066, p. 84 ff. Abstract | Full article in PDF (207 KB) Computational Complexity of Word Counting Mireille Régnier LNCS 2066, p. 99 ff. Abstract | Full article in PDF (160 KB) EUGÈNE: An Eukaryotic Gene Finder That Combines Several Sources of Evidence Thomas Schiex, Annick Moisan, and Pierre Rouzé LNCS 2066, p. 111 ff. Abstract | Full article in PDF (255 KB) Tree Reconstruction via a Closure Operation on Partial Splits Charles Semple and Mike Steel http://buffy.lib.unimelb.edu.au:2150/link/service/series/0558/tocs/t2066.htm (2 of 3) [10/18/2002 8:10:49 PM]

Springer LINK: Lecture Notes in Computer Science

LNCS 2066, p. 126 ff. Abstract | Full article in PDF (129 KB) InterDB, a Prediction-Oriented Protein Interaction Database for C. elegans Nicolas Thierry-Mieg and Laurent Trilling LNCS 2066, p. 135 ff. Abstract | Full article in PDF (110 KB) Application of Regulatory Sequence Analysis and Metabolic Network Analysis to the Interpretation of Gene Expression Data Jacques van Helden, David Gilbert, Lorenz Wernisch, Michael Schroeder, and Shoshana Wodak LNCS 2066, p. 147 ff. Abstract | Full article in PDF (240 KB) Author Index LNCS 2066, p. 165 Author Index in PDF (14 KB) Online publication: June 28, 2001 [email protected] © Springer-Verlag Berlin Heidelberg 2001

http://buffy.lib.unimelb.edu.au:2150/link/service/series/0558/tocs/t2066.htm (3 of 3) [10/18/2002 8:10:49 PM]

Speeding Up the DIALIGN Multiple Alignment Program by Using the ‘Greedy Alignment of BIOlogical Sequences LIBrary’ (GABIOS-LIB) Sa¨ıd Abdedda¨ım1 and Burkhard Morgenstern2 1

LIFAR - ABISS, Facult´e des Sciences et Techniques, Universit´e de Rouen, 76821 Mont-Saint-Aignan Cedex, France, [email protected] 2 AVENTIS Pharma, Rainham Road South, Essex RM10 7XS, UK. Present address: MIPS, Max-Planck-Institut f¨ ur Biochemie, Am Klopferspitz 18a, 82152 Martinsried Germany [email protected]

Abstract. A sensitive method for multiple sequence alignment should be able to align local motifs that are contained in some but not necessarily in all of the input sequences. In addition, it should be possible to integrate various of such partial local alignments into one single multiple output alignment. This leads to the question of consistency of partial alignments. Based on a new set-theoretical definition of sequence alignment, the consistency problem is discussed theoretically, and a recently developed library of C functions for consistency calculation (GABIOSLIB) is described. GABIOS-LIB has been integrated into the DIALIGN alignment program to carry out consistency tests during the multiple alignment procedure. While the resulting alignments are exactly the same as with the previous version of DIALIGN, the running time of the program has been crucially improved. For large data sets, the new version of DIALIGN is up to 120 times faster than the old version. Availability: http://bibiserv.TechFak.Uni-Bielefeld.DE/dialign/ Keywords: multiple sequence alignment, partial alignment, consistency, consistent equivalence relation, greedy lgorithm.

1

Introduction

Traditionally, there are two different approaches to sequence alignment: global methods that align sequences over their entire length [8,21,26,25] and local methods that try to align the most highly conserved sub-regions of the input sequences [24,23,3,13]. One problem with these approaches is that it is often not known in advance if sequences are globally or only locally related. A versatile alignment tool should align those regions of the input sequences that are sufficiently similar to each other but it would not try to align the non-related parts of the sequences. Thus, such a program would return a global alignment whenever sequences are globally related but a local alignment if only local homology can be O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 1–11, 2001. c Springer-Verlag Berlin Heidelberg 2001

2

S. Abdedda¨ım and B. Morgenstern

detected. One possible way to achieve this is to integrate statistically significant (partial) local alignments P1 , . . . , Pk into one resulting output alignment A. The idea to generate sequence alignments by combining partial alignments of local similarities is not new. Various authors have proposed to generate pairwise local or global alignments by chaining fragment alignments, see Wilbur and Lipman [32], Eppstein et al. [7], and Chao and Miller [4]. These authors have developed time and space efficient fragment-chaining algorithms for near-optimal alignment in the sense of the traditional Needleman-Wunsch [21] and SmithWaterman [24] objective functions. Joseph et al. [11] have proposed a greedy algorithm that is based on statistically significant segment pairs. Algorithms that integrate local alignments have also been proposed for multiple alignment. Here, the problem is to decide whether a collection of local alignments is consistent. Informally, we call a set {P1 , . . . , Pk } of partial alignments consistent if an alignment A of the input sequences exists such that every pair of residues that is aligned by one or several of the alignments Pi is also aligned by A. A formal definition of our notion of consistency will be given in the next section. The question of consistency is easy to decide if each local alignment Pi involves all of the input sequences. Vingron and Argos [30], Vingron and Pevzner [31] and Depiereux et al. [6,5] have proposed multiple alignment methods that search for motifs that are simultaneously contained in all input sequences. In this case, a sufficient condition for consistency is that, for any two local alignments Pi and Pj , either Pi Pj or Pj Pi holds where Pi Pj means that in every sequence, residues aligned by Pi are to the left of residues aligned by Pj . From a biological point of view, however, it is desirable to allow for homologies involving not all but only some of the input sequences. A multiple alignment program that finds only those similarities that are present in all sequences in a given input data set will necessarily miss many biologically important homologies. Recently, we have introduced three heuristics for multiple alignment that integrate partial local alignments not necessarily involving all of the input sequences. These methods generate multiple alignments in a greedy way by incorporating local partial alignments one-by-one into a resulting multiple alignment. SOUNDALIGN [1] assembles multiple alignments from blocks of un-gapped local alignments that may involve two or more sequences, DIALIGN [15,17,18] uses ungapped segment pairs – so-called fragments or diagonals –, and TWOALIGN [2] combines pairwise local alignments in the sense of Smith and Waterman [24] to obtain a final multiple alignment. During the greedy procedures, these three programs have to test new partial alignments for consistency with those alignments that have already been accepted for the final alignment. To this end, they store and update certain data structures that are called transitivity frontiers for SOUNDALIGN and TWOALIGN and consistency bounds for DIALIGN. DIALIGN has been successfully used to detect local homologies in nucleic acid and protein sequences. In a recent study, G¨ ottgens et al. [10] have used DIALIGN to align large genomic sequences from human, mouse and chicken. In the human/mouse alignment, multiple peaks of homology were found, some

Speeding Up the DIALIGN Multiple Alignment Program

3

of which precisely correspond to known enhancers. In addition, the DIALIGN multi-alignment of human, mouse and chicken revealed a new region of homology that was then experimentally shown to be a previously unknown enhancer, see also Miller [14] for a discussion of these results. Thompson et al. [28], have used the BAliBASE of benchmark alignments [27] to systematically compare the most widely used programs for multiple protein alignment. Here, DIALIGN was reported to be the most successful local method. It also performed well on globally related sequence sets though here CLUSTAL W [26], PRRP [9] and SAGA [22] were superior. The paper by Thompson et al., however, addressed also a major weakness of DIALIGN: it is considerably slower than progressive alignment methods. Aligning a set of 89 histone sequences took as much as 13,649 s with DIALIGN compared to 161 s with CLUSTAL. This may not be a serious problem if only a single protein family is studied. However, with the huge amount of genomic sequence data that are now available, automatic alignment of whole data bases of sequence families has become a routine task, see, for example, [12,29]. Here, program running time is a crucial criterion in choosing a suitable alignment method. Test runs have shown that for large sequence sets, the procedure of updating the consistency bounds was by far the most time-consuming step in previous versions of DIALIGN. Abdedda¨ım has recently developed a library of C functions called GABIOS-LIB (Greedy Alignment of BIOlogical Sequences LIBrary) that can be used to efficiently calculate the transitivity frontiers and to consistency-check (partial) local alignments that may have been produced by arbitrary methods. We have integrated GABIOS-LIB into DIALIGN to speed up the consistency check for fragments (segment pairs). In the present paper, the time and space complexity of GABIOS-LIB is analysed theoretically and compared to the method that was previously used in DIALIGN. Experiments with both artificial and real sequence data demonstrate that GABIOS-LIB is far more efficient than the previous procedure. In our test examples, the new version 2.1 of DIALIGN is up to 120 times faster than version 2.0 while the resulting alignments are exactly the same. In addition, GABIOS-LIB has reduced the amount of computer memory used by DIALIGN.

2 2.1

The Consistency Problem for Partial Alignments Definitions and Notations

Let S = {S1 , . . . , SN } be a sequence family and let X be the set of all sites of S where a site x = [i, p] represents the p-th position in the i-th sequence. On X, we define a partial order relation such that for any two sites x = [i, p] and x0 = [i0 , p0 ], x x0 holds if and only if both i = i0 and p ≤ p0 are true. In the language of order theory, is the direct sum of the ‘natural’ linear order relations that are given on the individual sequences. Every binary relation R on X extends the relation to a quasi order relation R = ( ∪R)t on X, where St denotes the transitive closure of a relation S, i.e. the smallest transitive relation containing S, see Fig. 1.

4

S. Abdedda¨ım and B. Morgenstern S1

s

s

S2

s

s

S3

s

u

s

s

s

s

s

s

As

s

wA KA

sA s

A

v

s y

s s

x

s

s

s

s

s

s

z

s

s

s

s

s

s

Fig. 1. A relation R = {(v, w), (x, y)} (represented by arrows) defined on the set X of all sites (black dots) of a sequence family S = {S1 , S2 , S3 } extends the partial order relation on X to a quasi partial ordering R = ( ∪R)t . We have, for example, u v, vRw , w x, xRy, y z and therefore u R z.

We call a relation R on X consistent if the extended relation R preserves the linear order relations on the individual sequences, formally: if all restrictions of R to the individual sequences coincide with their respective ‘natural’ linear order relations. In other words, the requirement is that for any two sites x and y that belong to the same sequence, x R y implies x y. Moreover, we call a set {R1 , . . . , Rn } of relations consistent if the union ∪i Ri is consistent, we say that R1 is consistent with R2 if {R1 , R2 } is consistent, and a pair (x, y) ∈ X 2 is called consistent with a relation R if {(x, y)} is consistent with R. As proposed in [17], a (partial) alignment A of the family S can be defined as a consistent equivalence relation on the set X where we write (x, y) ∈ A or xAy if the sites x and y are either aligned by A or identical. As an equivalence relation, an alignment A partitions X into equivalence classes [x]A = {y ∈ X : (x, y) ∈ A}. It can be shown that an equivalence relation A on X is an alignment in the sense of the above definition if and only if it is possible to introduce gap characters into the sequences Si such that the equivalence classes [x]A , x ∈ X are precisely those sets of sites that are in the same column of the resulting two-dimensional array, see [20] for more details. A common feature of greedy multiple alignment algorithms is that they include partial alignments P1 , . . . , Pk one after the other into a growing multiple alignment – always provided that a new alignment Pi is consistent with those alignments that have been included previously. Formally, a monotonously increasing set A1 ⊂ . . . ⊂ Ak of alignments is defined by A1 = P1 (Ai−1 ∪ Pi )e if Pi is consistent with Ai−1 i = 2, . . . , k. Ai = Ai−1 otherwise,

(1)

A final alignment A is then obtained as the largest alignment A = Ak of this set. Therefore, every greedy alignment approach has to resolve the question of consistency: at any stage of the alignment procedure, it must be known which pairs (x, y) ∈ X 2 are still alignable without leading to inconsistencies with the current alignment Ai , i.e. with those pairs of sites that have already been accepted for the final alignment.

Speeding Up the DIALIGN Multiple Alignment Program

2.2

5

Transitivity Frontiers and Consistency Bounds

An alignment A of a sequence family S imposes for every site x ∈ X and every sequence Si ∈ S a lower bound bA (x, i) and an upper bound bA (x, i) such that a site y = [i, p] ∈ Si is alignable with x without leading to inconsistencies with A if and only if bA (x, i) ≤ p ≤ bA (x, i) holds, see Fig. 2 for an example. Formally, we define bA (x, i) = min{p : (x, [i, p]) consistent with A} and bA (x, i) = max{p : (x, [i, p]) consistent with A}. In order to test fragments for consistency during the greedy procedure, previous versions of DIALIGN calculated and updated these consistency bounds. GABIOS-LIB is using a so-called transitivity frontiers to carry out the same consistency check. Here, the predecessor frontier P redA (x, i) is defined as the position of the right-most site y in sequence Si such that y A x is true, and the successor frontier SuccA (x, i) is defined accordingly as the position of the left-most site y in sequence Si with x A y so we have P redA (x, i) = max{p : [i, p] A x} and SuccA (x, i) = min{p : x A [i, p]}.

1

2

3

4

5

6

S1

s

s

s

s

s

s

S2

s

s

s

s

s

S3

s

s

s

s

s

S4

s

s

s

s

x

s

u

s

7 v

s s

s

s

s

s

8

12

13

s s s s s J J J J J s J s J s J s s J J J s sJ sJ sJ s J J J s s Js Js J s

s

wJ

9

10

11

s s s

Fig. 2. For an alignment A (bold lines) and a site x, the transitivity frontiers with respect to a sequence Si coincide with the corresponding consistency bounds if x is aligned to some site in Si . For example, site u in S2 is aligned with x, so we have bA (x, 2) = SuccA (x, 2) = 6. In sequence S1 , on the other hand, v is the right-most site that can be aligned with x, but w is the left-most site with x A w, so we have bA (x, 1) = 7 but SuccA (x, 1) = 8.

6

S. Abdedda¨ım and B. Morgenstern

The transitivity frontiers are related to the consistency bounds in the following way: if x is already aligned to some site [i, p] in sequence Si , then the predecessor and successor frontiers of x with respect to Si both equal p and they coincide with the consistency bounds, i.e. one has P redA (x, i) = SuccA (x, i) = bA (x, i) = bA (x, i) = p. This is, for example, the case for the frontiers and bounds of x with respect to S2 in Fig. 2. In contrast, if no site in Si is aligned with x, one has P redA (x, i) = bA (x, i) − 1 and

SuccA (x, i) = bA (x, i) + 1

as is the case with the corresponding frontiers and bounds with respect to S1 in Fig. 2. Therefore, if it is known for every site x and every sequence Si whether x is aligned with some site in Si , transitivity frontiers are easily obtained from the consistency bounds and vice versa, so both data structures are equivalent in that they contain the same information about which residue pairs are alignable under the consistency constraints imposed by a given alignment A.

3

GABIOS-LIB

The Greedy Alignment of BIOlogical Sequences LIBrary (GABIOS-LIB) is a set of functions implemented in ANSI C by Abdedda¨ım. These functions can be used by any greedy alignment program in order to test in constant time which sites in a sequence Sj are alignable with a site x of an other sequence Si . Each time two sites are aligned during the greedy procedure, GABIOS-LIB updates the transitivity frontiers using the incremental algorithm EdgeAddition presented in [2]. In addition, GABIOS-LIB uses some ideas first introduced in [1] to further reduce computing time and memory. In this section, we discuss how the successor frontiers SuccA (x, i) for an alignment A are affected if a (partial) alignment P is added to A and how the frontiers SuccB (x, i) for the resulting alignment B = (A ∪ P )e can be calculated. For symmetry reasons, all results apply to the predecessor frontiers as well. 3.1

The Incremental Algorithm EdgeAddition

First of all, a simple but important observation is that if a new partial alignment P is added to an existing alignment A, the frontiers with respect to the new alignment B = (A ∪ P )e need to be calculated only if B is actually different from A which is the case if and only if P is not already a subset of A. EdgeAddition stores for every site x and every sequence Sj the information whether x is already aligned with some residue from sequence Sj ; this information is used to check if a new alignment P is already contained in A.

Speeding Up the DIALIGN Multiple Alignment Program

7

If and only if P 6⊂ A holds, the transitivity frontiers SuccB are different from the frontiers SuccA . In general, however, the frontiers will change not for all but only for some sites, and the computing time can be minimized by identifying those sites. For simplicity, we consider the simplest case where a single pair of sites (x, y) is added to A. Observation 1 Let A be an alignment of a sequence family S and (x, y) a pair of sites that is consistent with A. Let B = (A ∪ {(x, y)})e be the alignment that is obtained by ‘adding’ (x, y) to A. Then for every site u in a sequence Si and for every sequence Sj the successor frontiers of B are min{SuccA (u, j), SuccA (y, j)} if u A x SuccB (u, j) = min{SuccA (u, j), SuccA (x, j)} if u A y otherwise. SuccA (u, j) It follows that the transitivity frontiers can change only for those sites u for which either u A x or u A y is true. 3.2

Further Reduction of Computing Time and Memory

In order to further reduce the computational costs for calculating the consistency frontiers, GABIOS-LIB uses the following two facts. (1) If two sites x and y are aligned by A, i.e. if xAy holds, then they have necessarily the same frontiers SuccA (x, i) = SuccA (y, i) for i = 1, . . . , N . Therefore, rather than processing the transitivity frontiers for all individual sites x, GABIOS-LIB stores and updates the frontiers for those equivalence classes [x]A that consist of more than one single site. (2) Let x = [i, p] be an orphan site in the i-th sequence, i.e. a site that is not aligned with any other site y 6= x. Then the successor frontiers SuccA (x, j) with respect to all sequences Sj 6= Si , coincide with the corresponding frontiers of the left-most non-orphan site y = [i, p0 ] with p0 > p. In Fig. 2, for example, v = [1, 7] is an orphan site in S1 . The left-most non-orphan site [1, p0 ] with p0 > 7, is the site w = [1, 8]. Therefore for j 6= 1, the successor frontiers SuccA (v, j) coincide with the corresponding frontiers for w, i.e. we have SuccA (v, j) = SuccA (w, j) for j = 2, 3, 4. Thus, instead of storing the transitivity frontiers for an orphan site x, the corresponding site y can be stored in a tabular nextClass by defining nextClass[x]= p0 . This way, the frontiers SuccA (x, j) of an orphan site x can be established in constant time each time a new pair of sites is aligned.

4

Time and Space Efficient Multiple Segment-by-Segment Alignment

In order to construct a multiple alignment of a sequence family S, DIALIGN calculates in a first step optimal pairwise alignments for all possible pairs of input sequences as explained in [16]. Optimality, in this context, refers to the segment-based objective function used in DIALIGN as defined in [19,15], i.e. an optimal pairwise alignment is a chain of fragments (gap-free segment pairs)

8

S. Abdedda¨ım and B. Morgenstern

Table 1. Running time t0 and t1 of versions 2.0 and 2.1 of DIALIGN. Version 2.0 is using the old method of calculating the consistency bounds while version 2.1 is calculating the transitivity frontiers using GABIOS-LIB. Sequences are (I) independent and (II) identical random sequences of length 100, (III) Immunoglobulin domain sequences of average length 63.3 and 20% average identity, and (IV) Ribosomal protein L7/L12 C-terminal domain sequences of average length 119.3 and 53% average identity. The running time improvement achieved by GABIOS-LIB strongly increases with the number N of sequences to be aligned. For the two sets of real-world sequences, figures are comparable to the case of identical random sequences where an improvement of up to 120 was achieved. DIALIGN 2.0 t0

t0 /N 3

25 50 (I) 75 100 150 200

22 168 601 1402 5725 14353

1.4 1.3 1.4 1.4 1.6 1.7

−3

10 10−3 10−3 10−3 10−3 10−3

25 50 (II) 75 100 150 200

46 640 3282 10557 52431 177423

2.9 5.1 7.7 1.0 1.5 2.2

25 (III) 50 75 100

25 341 1597 5046

25 (IV) 50 75 100

61 844 4157 11619

N

DIALIGN 2.1

t0 /N 4

t1

2.0/2.1

t1 /N 2

t1 /N 3

t0 /t1

−4

5.6 2.6 1.8 1.4 1.1 8.9

−5

10 10−5 10−5 10−5 10−5 10−6

10 41 100 220 576 1874

0.016 0.016 0.017 0.022 0.025 0.046

6.4 3.3 2.3 2.2 1.7 2.3

10 10−4 10−4 10−4 10−4 10−4

2.2 4.1 6.0 6.8 10.0 7.6

10−3 10−3 10−3 10−2 10−2 10−2

1.1 1.0 1.0 1.0 1.0 1.1

10−4 10−4 10−4 10−4 10−4 10−4

10 43 104 200 555 1429

0.016 0.017 0.018 0.020 0.024 0.035

6.5 3.4 2.4 2.0 1.6 1.7

10−4 10−4 10−4 10−4 10−4 10−4

4.6 14.9 31.6 52.8 94.5 124.2

1.6 2.7 3.7 5.0

10−3 10−3 10−3 10−3

6.4 5.4 5.0 5.0

10−5 10−5 10−5 10−5

7 29 73 147

0.011 0.011 0.012 0.014

4.4 2.3 1.7 1.4

10−4 10−4 10−4 10−4

3.5 11.7 21.8 34.3

3.9 6.7 9.8 1.1

10−3 10−3 10−3 10−2

1.5 1.3 1.3 1.1

10−4 10−4 10−4 10−4

14 63 149 288

0.022 0.025 0.026 0.028

8.9 5.0 3.5 2.8

10−4 10−4 10−4 10−4

4.3 13.3 27.8 40.3

Pk f1 . . . fk such that i=1 w(fi ) is maximal. Here, w is a weighting function defined on the set of all possible fragments, and fi fj means that the end positions of fi are both strictly smaller than the respective beginning positions of fj . Fragments from the respective optimal pairwise alignments are then greedily integrated into a resulting multiple alignment A. Let L be the maximum length of the sequences S1 , . . . , SN . Since O(N 2 ) pairwise alignments are performed each of which consisting of at most L fragments, O(N 2 L) fragments are to be checked for consistency. Updating the consistency bounds takes O(N 2 ) time if a new fragment is accepted for alignment. Since previous versions of DIALIGN calculated the consistency bounds for every fragment that was accepted for alignment, the

Speeding Up the DIALIGN Multiple Alignment Program

9

Table 2. Number of integers allocated by DIALIGN 2.0 and DIALIGN 2.1 for storing consistency bounds (#CB) and transitivity frontiers (#T F ), respectively. Sequence sets are as in Table 1. N 25 50 75 100 25 50 75 100

(I)

(II)

#CB

#T F

#CB #T F

125,000 500,000 1,125,000 2,000,000

5,000 10,000 15,000 20,000

25.0 50.0 75.0 100.0

125,000 500,000 1,125,000 2,000,000

28,050 106,800 204,150 375,000

4.5 4.7 5.5 5.3

(III)

(IV)

#CB

#T F

#CB #T F

85,700 346,600 780,000 1,382,800

11,500 43,900 76,650 142,400

7.5 7.9 10.2 9.7

148,800 609,600 1,356,450 2,403,800

13,300 75,600 145,350 212,200

11.2 8.1 9.3 11.3

worst-case time complexity of the entire greedy procedure was O(N 4 L). Table 1 shows that the running time of DIALIGN 2.0 is in fact proportional to N 4 if identical sequences are aligned where all fragments from the optimal pairwise alignments are necessarily consistent and are therefore integrated into A. Let us consider a consistent set of pairs {(x1 , y1 ), . . . , (xm , ym )} that are successively integrated into a growing set of alignments A1 ⊂ . . . ⊂ Am by defining Ai = {(x1 , y1 ), . . . , (xi , yi )}e . It was shown in [2] that if each pair (xi , yi ) is actually new, i.e. not yet contained in the previous alignment Ai−1 , then GABIOS-LIB takes O(N 2 m + |X|2 ) time to update the transitivity frontiers during the procedure of integrating all m pairs. It was also explained in [2] that there can be at most |X| ‘new’ pairs of sites – every additional pair would either be inconsistent or would already be contained in the current alignment. Therefore, GABIOS-LIB takes O(N 2 |X| + |X|2 ) = O(N 3 L + N 2 L2 ) time to compute the transitivity frontiers while integrating an arbitrary set of pairs. Note that this complexity analysis is a worst-case estimate. As shown in Table 1, the real running time of GABIOS-LIB is better than O(N 3 ). The new version 2.1 of DIALIGN is still slower than the most widely used multi-alignment program CLUSTAL W [26], but the difference in running time is now reduced to a factor of about 10. Parameter optimization should further decrease the running time of DIALIGN. In the previous version of DIALIGN, the consistency bounds were stored for every site x ∈ X. Since for every x, 2N integer values had to be considered – upper and lower bounds with respect to all N sequences –, computer memory had to be allocated for exactly 2N |X| integer values. By contrast, GABIOSLIB stores 2N transitivity frontiers only for non-orphan equivalence classes. For orphan sites, single integers are stored that refer to the next non-orphan site in the same sequence. In the worst case, the number of non-orphan equivalence classes is in the order of |X|, so the space complexity for GABIOS-LIB is upper-

10

S. Abdedda¨ım and B. Morgenstern

bounded by O(|X|N ) = O(N 2 L) which was also the real space complexity of the old version of DIALIGN. Table 2 shows, however, that the actual memory requirement for storing the transitivity frontiers with GABIOS-LIB is far smaller than for storing the consistency bounds in DIALING 2.0.

References 1. S. Abdedda¨ım. Fast and sound two-step algorithms for multiple alignment of nucleic sequences. In Proceedings of the IEEE International Joint Symposia on Intelligence and Systems, pages 4–11, 1996. 2. S. Abdedda¨ım. Incremental computation of transitive closure and greedy alignment. In Proc. of 8-th Annual Symposium on Combinatorial Pattern Matching, volume 1264 of Lecture Notes in Computer Science, pages 167–179, 1997. 3. S. F. Altschul, W. Gish, W. Miller, E. M. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990. 4. K.-M. Chao and W. Miller. Linear-space algorithms that build local alignments from fragments. Algorithmica, 13:106–134, 1995. 5. E. Depiereux, G. Baudoux, P. Briffeuil, I. Reginster, X. D. Boll, C. Vinals, and E. Feytmans. Match-Box server: a multiple sequence alignment tool placing emphasis on reliability. CABIOS, 13:249–256, 1997. 6. E. Depiereux and E. Feytmans. Match-box: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. CABIOS, 8:501–509, 1992. 7. D. Eppstein, Z. Galil, R. Giancarlo, and G. Italiano. Sparse dynamic programming I: Linear cost functions. J. Assoc. Comput. Mach., 39:519–545, 1992. 8. O. Gotoh. An improved algorithm for matching biological sequences. J. Mol. Biol., 162:705–708, 1982. 9. O. Gotoh. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264:823–838, 1996. 10. B. G¨ ottgens, L. Barton, J. Gilbert, A. Bench, M. Sanchez, S. Bahn, S. Mistry, D. Grafham, A. McMurray, M. Vaudin, E. Amaya, D. Bentley, and A. Green. Analysis of vertebrate scl loci identifies conserved enhancers. Nature Biotechnology, 18:181–186, 2000. 11. D. Joseph, J. Meidanis, and P. Tiwari. Determining DNA sequence similarity using maximum independent set algorithms for interval graphs. Lecture Notes in Computer Science, 621:326–337, 1992. 12. A. Krause, P. Nicod`eme, E. Bornberg-Bauer, M. Rehmsmeier, and M. Vingron. Www access to the systers protein sequence cluster set. Bioinformatics, 15:262–263, 1999. 13. C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, 262(5131):208–14, 1993. 14. W. Miller. So many genomes, so little time. Nature Biotechnology, 18:148–149, 2000. 15. B. Morgenstern. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15:211–218, 1999. 16. B. Morgenstern. A space-efficient algorithm for aligning large genomic sequences. Bioinformatics, in press.

Speeding Up the DIALIGN Multiple Alignment Program

11

17. B. Morgenstern, A. W. M. Dress, and T. Werner. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA, 93:12098–12103, 1996. 18. B. Morgenstern, K. Frech, A. W. M. Dress, and T. Werner. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14:290–294, 1998. 19. B. Morgenstern, K. Hahn, W. R. Atchley, and A. W. M. Dress. Segment-based scores for pairwise and multiple sequence alignments. In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen, editors, Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, pages 115– 121, Menlo Parc, CA, 1998. AAAI Press. 20. B. Morgenstern, J. Stoye, and A. W. M. Dress. Consistent equivalence relations: a set-theoretical framework for multiple sequence alignment. Materialien und Preprints 133, University of Bielefeld, 1999. 21. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443–453, 1970. 22. C. Notredame and D. Higgins. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Research, 24:1515 – 1524, 1996. 23. W. R. Pearson and D. J.Lipman. Improved tools for biological sequence comparison. Proc. Nat. Acad. Sci. USA, 85:2444–2448, 1988. 24. T. F. Smith and M. S. Waterman. Comparison of biosequences. Advances in Applied Mathematics, 2:482–489, 1981. 25. J. Stoye. Multiple sequence alignment with the divide-and-conquer method. Gene, 211:GC45–GC56, 1998. 26. J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673–4680, 1994. 27. J. D. Thompson, F. Plewniak, and O. Poch. BAliBASE: A benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999. 28. J. D. Thompson, F. Plewniak, and O. Poch. A comprehensive comparison of protein sequence alignment programs. Nucleic Acids Research, 27:2682–2690, 1999. 29. J. D. Thompson, F. Plewniak, J.-C. Thierry, and O. Poch. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Research, 28:2919–2926, 2000. 30. M. Vingron and P. Argos. Motif recognition and alignment for many sequences by comparison of dot-matrices. J Mol Biol, 218(1):33–43, 1991. 31. M. Vingron and P. Pevzner. Multiple sequence comparison and consistency on multipartite graphs. Advances in Applied Mathematics, 16:1–22, 1995. 32. J. W. Wilbur and D. J. Lipman. The context dependent comparison of biological sequences. SIAM J. Appl. Math., 44:557–567, 1984.

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes 1,2

1

1

2

Gisèle Bronner, Bruno Spataro, Christian Gautier, and François Rechenmann 1

Laboratoire de Biométrie et Biologie Évolutive, UMR – CNRS 5558, Université Claude Bernard Lyon1 43, bd du 11 novembre 1918, 69622 Villeurbanne cedex, France {bronner,bspataro,cgautier}@biomserv.univ-lyon1.fr 2 INRIA Rhône Alpes 655 av. de l’Europe, 38330 Montbonnot St Martin, France Francois.Rechenmann[email protected]

Abstract. Representing genomic mapping knowledge is a complex process due to the diversity of experimental approaches. The lack of obvious equivalence between different kinds of maps1 makes it difficult to use them simultaneously or in a cooperative way. Furthermore, comparative mapping is not limited to map comparison, it also implies the comparison of the elements they are composed of. Comparison thus implies the combination of different computational objects, corresponding to different levels of information. However, from the user’s point of view, systems devoted to comparative mapping should appear to process queries in a homogeneous manner, whatever the objects they imply. We propose a data model that integrates mapping data, genetical data and sequences and takes into account the specific constraints resulting from the comparison of maps between and within different species.

1

Introduction

Comparative interspecies analysis of nucleic or proteinic sequences is helpful in understanding genome evolution and function. Comparisons can be made at different levels - sequence comparisons: codon usage, alignments, phylogeny; - function comparisons: protein functions, functional domains; - structure comparisons: DNA structures, protein strutures, genome structure. The relative position of interesting genes and markers on chromosomes can be drawn on genomic maps to identify regions where candidate sequences corresponding to the markers should lie. A large number of mapping projects are providing increasing amounts of information on the localizations of biological elements. Such information, that is the position of elements on a chromosome, can be compared as sequences or biological elements. An initial analysis suggests that chromosomic structures are relatively conserved in vertebrates [1], allowing prediction of gene positions among 1

Definitions for map, markers and other biological concepts can be found in the Appendix.

O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 12-23, 2001. © Springer-Verlag Berlin Heidelberg 2001

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

13

species. As a consequence, mutual enrichment of maps from different species greatly improves the mapping approach: reference species can be used to help in the analysis of species of interest. Comparative mapping thus emerges as a powerful tool to gather new knowledge about genomes. Present mapping data management systems allow users to access biological data on single objects, but do not permit a global analysis of biological objects according to their genomic localization. However, the interest of comparative mapping is not limited to comparisons of gene localization. It can also help in resolving biological problems. For example, comparative mapping studying the C. elegans daf-18 ortholog has been used to characterize possible phenotypes associated with mutations of the human PTEN tumor suppressor gene [2]. Because of the lack of adequate tools for comparative mapping, we are developing GeMCore, a knowledge base dedicated to the management and comparison of mapping data, as well as functional and sequence data on mammals. GeMCore is part of the GeM project (for Genomic Mapping) which consists in the development of a core knowledge base (GeMCore) connected to graphical interfaces devoted to particular domains. Parallel to GeMCore, the user interface GeMME (for Genomic Mapping and Molecular Evolution) [3], is being developed. GeMME allows queries on several data types such as gene symbol, localization or expression of genes as well as queries related to sequence composition or evolution. It proposes numerical and graphical representation of data allowing visual display of synteny groups and mapto-map switching between species. GeMME contains several analytical tools devoted to molecular evolution. These tools can, for example, highlight specific spatial patterns of the human genome such as isochore [4] or autocorrelation of the silent mutation rate [5]. In the next sections, we will first present the requirements of any system dedicated to comparative mapping, then will describe the data model used in GeMCore.

2

Requirements of Mapping Data Systems

Comparative genomics involving many species requires the use of many data sources because: - each mapping project has its own official database such as the Genome Data Base (GDB) [6] for the human genome or the Mouse Genome Database (MGD) [7] for the mouse genome; - many databases of particular interest exist for different species (GDB, LDB [8], OMIM [9]). Furthermore, the diversity of experimental genome mapping approaches (genetic, cytogenetic, physical RH maps...) produces different types of maps. These maps are not equivalent, and thus define different points of view of the same object. The diversity and heterogeneity of this information (heterogeneity of the data themselves but also of their storage format) is difficult to manage. Three major problems can be described: Data inconsistency: mapping data comparisons can exhibit important contradictions among data, and sometime real errors. There are two kinds of inconsistency. Inconsistency can result from an erroneous representation of the data: in this case, the information is correct but expressed wrongly. For example, the insulin (Ins1) gene,

14

G. Bronner et al.

which has been assigned to the 11p15 region on the human genome, was drawn around the 11p12-11p14 region in the GDB chromosome 11 integrated map. Inconsistency can also be intrinsic to the data. For example, the order of markers on a map can differ depending on the mapping method that is considered (some mapping methods based on probabilistic inference sometimes infer non-consensual marker order). Here, inconsistency cannot be suppressed by correcting the information. Thus, the system must propose explicit strategies of inconsistency management and integrate new user choices in these strategies. Lack of formal links between data: for the human and mouse genomes, links between localization and sequences, genes and gene products are sometimes approximate and far from complete. For other species, the situation is even worse. There have been some attempts to automatically generate these links through semantic analysis of databank annotations. This led to the Virgil [10] or the LocusLink [11] protocols for the human genome. This strategy can associate the genetic name of a gene (and so its localization) to a genomic sequence, but cannot discriminate between different coding sequences within the same fragment. Moreover, alternative splicing which is very frequent in eukaryotes, makes it even more complex to assign a genetic name to a coding sequence. Data redundancy: parallel gathering of similar data from different sources generates strong redundancy in banks. As an example, the human insulin gene sequence is recorded four times in Genbank. Comparative mapping data management systems must be able to handle data diversity, data sources and constraints inherent to such information. They are not limited to gathering and integrating pre-existing data, but consist in distributing a clarified, completed and organized information set capable of reducing the difficulties described above. This pre-processing implies the development of procedures specific to this stage of knowledge base building. This phase also requires the use of external programs such as Clustal [12] or Blast [13, 14] that should not be integrated in the knowledge base. As the data gathering protocol is designed for a specific problematic, it does not meet our needs for producing a generic knowledge base, although a specific interface devoted to the data gathering protocol has been developed. This interface handles some aspects of data consistency management, redundancy avoidance and data linkage as well reorganization of data gathered from diverse databases into a common format. Because certain procedures such as identification of homologies require manual expertise, the interface is currently only partly automated. This procedure allows the building of an integrated dataset. It will constitute the starting point of the data gathering of the knowledge base. As the GeMCore implementation is still in progress, this dataset is currently used to test the system. The dataset consist in cytogenetical descriptions of human and mouse chromosomes, location data for human and mouse genes, sequences and sequence analysis (such as G+C content, CDS length, alignment length, substitution rates...), expression data, functional homology data, evolutive relations and recombination rates. In this dataset, links between genes and coding sequences are built from the NCBI LocusLink file for human and the MGD MRK_Sequence.rpt file for mouse. The relevance of all associations between a CDS (Coding DNA Sequence) and a gene symbol are evaluated. To discriminate between coding sequences associated to the

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

15

Table 1. Description of the dataset used in the evaluation of GeMCore. data locations

Source human mouse

gene/sequence links

+

7203 genes

locuslink4, Hovergen5 MGD, Hovergen

11040 links

sequence analysis

Hovergen, Gccompute

22883 seq.

CDS analysis

Hovergen, Gccompute

21066 seq.

homology

MGD

4044 cples of genes

orthology – substitution rates

3998 cples

human

Orthogen, JaDis6 LDB

mouse

MGD

20 chr.

cytogenetic maps

human

file size 4964 genes

LDB 2 MGD3

mouse

14172 links

23 chr.

same gene symbol, we use similarity information expressed through the family and redundancy numbers of the Hovergen database [15]. Hovergen redundancy numbers group under a common annotation redundant sequences from Genbank. Two sequences are said to be redundant if they have less than 1% divergence and at most 3 nucleotides of difference in length. The family numbers group homologous sequences (orthologous or paralogous genes resulting from duplication anterior to the divergence of vertebrates) under the same identifier. For each locus symbol, LocusLink provides a list of accession numbers, that is a list of genomic fragments of Genbank. A method has been developed to generate a list of relevant couples of gene symbol / gene sequence (CDS) extracted from these fragments. First, all links between a gene symbol and a CDS are listed. For each symbol, if all CDS have the same redundancy number, they are assumed to be several entries of the same gene and are all associated to the locus symbol. If several redundancy numbers are associated to the same gene symbol, we search for a CDS having the locus symbol cited in the corresponding fields of Genbank annotations. If found, only the CDS with this redundancy number are associated to the locus symbol. To keep most of the LocusLink information, we introduce two “weak” links between locus symbols and CDS belonging to the genomic fragment associated in LocusLink: -

“alternative” CDS have a different redundancy number than the one associated to the linked CDS but belong to the same family “putative” CDS correspond to cases where all CDS belong to the same family, but have different redundancy number and no locus symbol in Genbank annotations.

2 Location DataBase - http://cedar.genetics.soton.ac.uk/public_html/ldb.html 3 Mouse Genome Database - http://www.informatics.jax.org 4 LocusLink - http://ncbi.nlm.nih.gov/LocusLink/index.html 5 Homologous Vertebrate Gene Database http://pbil.univ-lyon1.fr/hovergen.html 6 JaDis - http://pbil.univ-lyon1.fr/software/jadis.html

16

G. Bronner et al.

Other cases, and particularly those where inconsistencies are present are also carefully noted. Presently, this data gathering procedure applies exclusively to genes, but we intend to extend it to other genomic elements, first of all to anonymous markers that are mapped in many map types (this would allow the user to directly compare different maps types), but also biologically informative sites such as satellites, breakpoints, CpG islands etc.

3

What Kind of System for Comparative Mapping?

A system managing comparative mapping knowledge must allow comparisons of marker position in several maps, but also the mapping of biological properties (G+C content, expression…) and their inter specific comparisons. This implies constraints both on data representation and querying capabilities. Very few databases are dedicated to comparative mapping. Present systems offer different levels of integration of mapping information. We can mention the Animal Genome database in Japan [16], which focuses on comparative gene mapping among pig, cattle, mouse and human. This database contains information concerning genes related to phenotypic traits and any genetic and cytogenetic localization known. Graphical map display is also available. It allows a one locus access, though precluding global analysis. Homology relations between genes are stipulated in the database, however interspecific links between data are not always symmetrical. There is also MICADO [17], a database that manages bacterial genomic information. MICADO links eubacterial and archaebacterial sequences, genetic maps and information on mutants. A graphical interface allows the parallel comparison of genomes. Hyperlinks are created as a complement, to reference other general and specialized related information resources. Databases based on the ACE model [18] such as AceDB [18] or AtDB [20] allow the comparison of different maps within a species. To a lesser extent, GDB, MGD or Bovmap [21] can also be cited. All of them are in fact specific databases, though they should not be considered as comparative mapping oriented databases. However, in addition to links to other data resources, they contain modules to compare maps among species. Most of present systems thus use a simplified representation of mapping data; queries are mainly limited to intuitive navigation through information. The more complete system (ACE) is the only one allowing recursive representation of mapping (a map becomes a marker in a larger map) but remains monospecific. Examining the querying capabilities is the simplest way to point out limitations of present systems. None of them are able to answer “simple” requests such as: " list pairs of homologous genes localized on the murin chromosome 2 and the human chromosome 1" or " get all human, complete genes sequences with a known orthologous gene in mouse, which has a sequenced neighboring gene at less than 1 cM" Moreover, user-controlled inferences have to be applied to generate the complete answer to these questions. For example, reasoning can enrich one type of map from another to produce a more complete answer (see Fig. 1). Reasoning can also apply to

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

17

inference of markers themselves, for example repetitive elements. User control is also necessary here, particularly when no consensual definition exists, e.g. CpG islands. Finally, the user should find analysis tools that will enable him to answer biological questions. Let us take an example: the study of regional recombination rates. The idea is to exhibit the features of chromosomal regions that present variations in the frequency of the recombination events. To answer this question, genetical distances are to be expressed as a function of physical position in order to obtain a recombination profile along the chromosome. Presently both physical and genetical localization are not associated to genes, but to anonymous fragments of DNA. To associate genic information to recombination frequencies, mapped genes have to be added to the recombination profile. Finally, the analysis of sequences associated to genes makes it possible to study the variations of recombination levels along the chromosome in relation to genic contextual features. a)

cM

cM

b)

Ma. 4

Ma. 3

? Ma. 1

Ma. 2

Seq. 3 Seq. 2

Seq. 1

Fig. 1. Non explicit information can be extracted from positional data. (a) - Both physical (bottom) and genetic (top) localizations may be available for some genes or markers, but when there is only a physical localization, it can be used to infer the genetic localization. Because the relation between genetic and physical maps is not linear, such inference is not obvious. In order to make it more clear, a preliminary modeling operation is required to establish the relation between physical and genetic mapping information. (b) - Local connection of nucleotide sequences can also be used to infer element localization, although this implies taking into consideration orthologous elements. Consequently, considering mapping information, as well as evolutive relations, is a prerequisite to inferring positional information from sequence appendings.

4

What Kind of Model for Comparative Mapping?

As seen above, complex information of very different types must be managed. The complexity results from the diversity of the information, but also from the huge number of interrelations between this information. The chosen model must take into account both type complexity and the central role played by relations. RDBMS (Relational Data Base Management Systems) manage relations efficiently, so they can produce strong and reliable data models. Nevertheless, the relational model is not adapted to complex data schemes. Conceived for tabulated data, it cannot express the complexity of inter-relations of biological objects (see the evolution of the GDB data model [22]). ACEDB is a model dedicated to genomic mapping in one species. As we have seen, its query capabilities and information representation are not sufficient for com-

18

G. Bronner et al.

parative mapping. More generally, the developments needed appear to be too large to construct a new dedicated model, even if these models are always more efficient in the context of their development (see ACEDB or ACNUC [23] for sequence management). OODBMS (Object Oriented Data Base Management Systems) particularly fit for representations of biological concepts: they allow a coupling of the data and the available operators, thus producing a relatively intuitive model of the biological entities represented. Computational object manipulation mimics biological behavior. As genomics is evolving due to huge genome mapping and sequencing projects, data management systems must be flexible and capable of easily integrating new fields of knowledge. Object systems have such capabilities, allowing the construction of modular and hierarchical structures, and thus providing real adaptability to complex data. Nevertheless, relations cannot be clearly defined. In order to express relations within an object oriented system, we designed a conceptual data scheme based on UML (Unified Modeling Language) definitions. This is an object oriented data model that integrates and expresses relations as peculiar objects. Ongoing research at INRIA Rhône-Alpes on AROM [24], an implementation of an UML-like modeling language as a knowledge representation system, is being used for GeMCore development.

5

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

Avoiding redundancy is crucial because redundancy produces inconsistency (when the information to be corrected is recorded many times, some recordings may go uncorrected). Inconsistency management is even more difficult if this information lies in different data structures. Hierarchical structuring of data gives a unique representation of information shared by different elements of the system, thus avoiding the multiplication of storage spaces. Furthermore, when peculiar structures associated to specific information are defined, it is easier to evaluate inconsistency than if the same type of data were stored in heterogeneous structures. In the GeMCore data model, three major hierarchy classes have been defined (Fig. 2): - the Element hierarchy: it structures a collection of genomic elements such as genes, satellites or transposable elements. This hierarchy separates the elements that have a biological reality from artificial elements produced by genome projects (e.g. clones or expressed sequence tags - EST). Biological elements are defined according to their function. Non-informative elements such as anonymous DNA are separated from functional elements. Genes are distinguished from functional elements that are not transcribed. Pseudogenes which are defined at the same level as Genes, inherit from Genes_Related elements. -

the Map hierarchy: three sub-types are defined. Two of them (Physical_map and Genetical_map) correspond exactly to the biological concepts they represent. The third type named Set concerns maps that contain sets of objects that do not have any exact localization. This last group of maps is particularly useful to represent

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

Composed Sequence

19

Ordered Not-ordered

Elementary

SEQUENCES Gene Gene Related

Is Sequenced Genomic Element

Transposable

Elements

EST G.P.Element

Clone

ELEMENTS Is Mapped Physical Map Map

Genetic Map

Pseudogene

Chromosome Element A

ClasseI

Is Syntenic

Is Homolog

Element B

Element C

Continuous Contig Absolute

Map A Allen

Relative Set

Cytog.

Map B

RH MAPS

Fig. 2. Simplified scheme of the GeMCore data model. An entity, for example a gene, can be seen as a functional element, a marker on a map or a sequence depending on the objective of the user. In GeMCore, these different levels of information are described in three hierarchy classes: Sequence, Map and Element (left). To express the links existing between these three kinds of objects, two associations (inserts) were defined. Is_Mapped and Is_sequenced join maps and sequences to the biological element respectively. Associations specific to some data levels are also defined (right). Evolutive relations (Is_Homolog) are defined for biological elements. The Is_syntenic association expresses that elements are on the same chromosome, whereas Allen association allows reasoning on the relative position of maps

cytogenetic bands or hybrid radiation segments. It also allows users to define their own sets of elements with regard to the question they want to answer (for example, the map of disease genes). -

the Sequence hierarchy: it organizes sequences through a hierarchy that distinguishes elementary from non elementary sequences. Elementary sequences are made of a unique nucleic acid fragment, while Composed sequences correspond to entries that describe collections of fragments that are more or less ordered (currently the unfinished genome sequences from the academic human genome project consortium). Some complementary classes such as Species, Protein or Information, have been defined in addition to the hierarchies described above. Basically, two kinds of relations are defined in the model: -

Comparison relations express evolutive relations (Fig. 3) or similarity relations depending on the objects that are linked (biological objects or nucleotide sequences).

20

G. Bronner et al.

Homology Orthology

Paralogy

Fig. 3. Homology relations can be specialized. Orthology, which links two genes that result from a speciation event, can be distinguished from paralogy that links elements deriving from a duplication event. Both naturally inherit from the homology relation.

- Positional relations defined by Allen algebra [25]. Allen introduced a formalism in order to represent intervals by their ends (that is two coordinates on one axis) and suggested the use of constraint satisfaction to reason on these relations. In GeMCore, these relations are structured in a hierarchy (Fig. 4).

Fig. 4. (a) - According to the Allen formalism, for the A gene to be before the B gene, it is sufficient that the upper coordinate of B be bigger than the lower coordinate of A. (b) - With his formalism, Allen defines thirteen binary relations between interval pairs. In the GeMCore Model, these relations are summarized and structured into a hierarchy where the lower the hierarchy level, the more precise the features of the relative positions of the related elements.

As for classes, certain relations lie outside the hierarchies (codes_for, is_product_of, belongs_to_species...). The three hierarchy classes have been defined to facilitate legibility, making it easier to detect inconsistencies and make updates. Since data are separated between three independent structures, one hierarchy can be updated independently of other data structures. This is particularly interesting when data come from different sources that are not updated at the same time. However, the model splits information concerning a given biological entity into three hierarchies, so that it no longer exists as a single entity within the system. The biological notions, the sequence data and the localization relative to a gene belong to a unique entity, which has to be expressed in the model. This is done by "corresponding" relations defined between the biological object, the map and the sequence (Fig. 5).

6

Conclusion

It highly important to associate all the biological data available in order to extract new information. Our goal was at first to define a generic data model allowing not only

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

Sequence Is Sequenced Elements Is Mapped

21

Fig. 5. GenomicElement, Sequence and Map classes are tightly linked: as for a sequence, one or more map can be associated to a gene. Depending on the goals of the user, the gene can be seen through its biological properties, its sequence or its localization. Thus it is necessary to express the links existing between all the objects describing the same biological entity. These links are modeled in GeMCore through the relations Is_Sequenced and Is_Mapped.

Map

data management, but also comparative analysis of mapping of biological and molecular data. GeMCore is an entity-association model designed to perform object comparisons, as well as evaluation and comparison of the relations between the objects in the database. This can be done thanks to the explicit representation of these relations. Our model takes into account the constraints inherent to the domain of comparative mapping and includes many data types (maps, location data, sequence phenotypes, homology…), thus making it valuable for many fields of biology (molecular biology, molecular evolution, medicine, agronomy). A concrete confrontation of the model with real data and analysis will be required for validation. We are thus working on a sub-model in the laboratory of Biométrie et Biologie Evolutive in Lyon using the AROM knowledge representation system. This knowledge base contains human and mouse data. It will be coupled to GeMME and should include other mammalians in the future. Our problematic deals with the comparison of mapping data, and appears to be more complex than simple construction of data management systems. The goal is to produce a tool that will be able to manage profusely diverse data, while keeping in mind the biological nature of the information contained in the data to be processed. This goal implies the use of consensual terms and methodologies. When building the GeMCore data model we had to precisely define many concepts used in the field of comparative mapping and, more broadly, in methodology dedicated to the management and processing of data produced by genome projects. The GeMCore data scheme could also be helpful in defining an ontology dedicated to comparative genomic mapping.

References 1.

2.

Andersson, L., Archibald, A., Ashburner, M., Audun, S., Barendse, W., Bitgood, J., Bottema, C., Broad, T., Brown, S., Burt, D., Charlier, C., Copeland, N., Davis, S., Davisson, M., Edwards, J., Eggen, A., Elgar, G., Eppig, JT., Franklin, I., Grewe, P., Gill, RD T. 3 , Graves, J.A., Hawken, R., Hetzel, J., Womack, J., et al.: Comparative genome organization of vertebrates. The first international workshop on comparative genome organization. Mamm. Genome 7 (1996) 717-734 Rouault, J.P., Kuwabara, P.E., Sinilnikova, O.M., Duret, L., Thierry-Mieg, D., Billaud, M.: Regulation of dauer larva development in Caenorhabditis elegans by daf-18, a homologue of the tumour suppressor PTEN. Current Biology 9 (1999) 329-332

22 3.

4. 5. 6. 7.

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

21. 22. 23.

G. Bronner et al. Spataro, B., Gautier, C.: Analyse comparative des genomes humains et murins avec l’interface GeMME. Recueil des actes, 1ères Journées Ouvertes : Biologie, Mathématiques, Informatique, 3-5 mai 2000, Montpellier - G. Caraux, O. Gascuel, M.F. Sagot (eds.) (2000) 55-64 Bernardi, G.: Isochores and the evolutionary genomics of vertebrates. Gene 241 (2000) 317 Matassi, G., Sharp, P., Gautier, C.: Chromosomal location effects on gene sequence evolution in mammals. Current Biology 9 (1999) 786-791 Letovsky, S.: GDB: integrating genomic maps. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 85-98 Eppig, J.T., Richardson, J.E., Blake, J.A., Davisson, M.T., Kadin, J.A., Rinwald, M.: The mouse genome database and the gene expression database: genotype to phenotype. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 119-128 Collins, A., Frezal, J., Teague, J., Morton, N.E.: A metric map of humans:23,500 loci in 850 bands. Proc. Natl. Acad. Sci. USA 93 (1996) 14771-14775 Scott, A.F., Amberger, J., Brylawski, B., McKusick, V.A.: OMIM: online mendelian. inheritance in man. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 77-84 Achard, F., Cussat-Blanc, C., Viara, E., Barillot, E.: The new Virgil database: a service of rich links. Bioinformatics 14 (1998) 342-348 Pruitt, K.D., Katz, K.S., Sicotte, H., Maglott, D.R.: Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet. 16 (2000) 44-47 Thompson, J.D., Higgins, D.G., T.J., Gibson.: CLUSTALW , improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acid. Res 22 (1994) 4673-4680 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J Mol Biol. 215 (1990) 403-410 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (1997) 3389-3402 Duret, L., Perrière, G., Gouy, M.: HOVERGEN: comparative analysis of homologous vertebrate genes. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 21-36 Wada, Y., Yasue, H.: Development of an animal genome database and its search system Comput. Applied Biosci. 12 (1996) 231-235 Biaudet, V., Samson, F., Bessieres, P.: Micado – a network oriented database for microbial genomes – Comput. Applied Biosci., 13 (1997) 431-438 Thierry-Mieg, J., Thierry-Mieg, D., and Stein, L.: ACEDB: The ACE Database Manager. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 265-278 Durbin, R., Thierry-Mieg, J.: The ACeDB genome database. In: Suhai, S. (ed.) Computational Methods in Genome Research, Plenum Press,New-York (1994) 45-55 Flanders, D.J., Weng, S., Petel, F.X., Cherry, J.M.: AtDB, the Arbidopsis thaliana database, and graphical web display of progress by the Arbidopsis genome initiative. Nucleic Acids Res. 26 (1998) 80-84 http:// locus.jouy.inra.fr/cgi-bin/bovmap/intro2.pl Fasman, K.H., Letovsky, S.I., Cottingham, R.W., Kingsbury, D.T.: Improvements to the GDB human genome database - Nucleic Acids Res. 24 (1996) 57-63 Gouy, M., Gautier, C., attimonelli, M., Lanave, C., di Paola, G.: ACNUC – a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. CABIOS 1 (1985) 167-172

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

23

24. Page, M., Gensel, J., Capponi, C., Bruley, C., Genoud, P., Ziebelin, D.: Representation des connaissances au moyen de classes et d’associations: le système AROM. Conf. langage et modèles à objet (LMO’00), Mt Saint-Hilaire, CA (2000) 91-106 25. Allen, J.F.: Maintaining knowledge about temporal intervals – Comm. of the ACM 26 (1983) 832–843

Appendix: Markers, Maps, and Recombination The genome is made of several chromosomes. Each of them includes a DNA molecule, which is a linear polymer of four nucleotides (adenine, guanine, cytosine, thymine). Genetic information is coded in these nucleotide sequences. Proteins are molecules that ensure most of the cell functions. The information necessary to build proteins is stored in protein genes. A gene is a subsequence of the genome. Expression of its information content requires two successive processes: transcription that generates a mRNA molecule (the messenger) and translation that converts this RNA molecule into protein. Unexpectedly, only a small fraction of the genome contains genes. In mammals, about 95% of genome has no genetic information content. A marker is any genomic object on the genome that can be identified non ambiguously: a gene is a marker as well as an anonymous sequence of a genomic fragment. Genome maps appear mainly as an ordered series of markers. A set of markers located on the same chromosome is said to be syntenic. Several distances between markers can be determined. When the complete genome is known, the number of nucleotides between two markers is known. This is the physical distance between the two markers. When the sequence is not known, the physical distance can be estimated from experimental data. Another important distance is the genetic distance. Each individual has two sets of homologous chromosomes, one from his father, the other from his mother. During gamete production, pairs of homologous chromosomes cross over, exchanging sequences. The genetic distance can be determined from the expected number of cross-overs between two markers (the recombination rate). Comparing physical and genetical distance is useful to address important biological points in various fields of biology including population genetics, medical research, and evolution.

Optimal Agreement Supertrees David Bryant LIRMM, 161 rue Ada, 34392 Montpellier, France Cedex 5 [email protected]

Abstract. An agreement supertree of a collection of unrooted phylogenetic trees {T1 , T2 , . . . , Tk } with leaf sets L(T1 ), L(T2 ), . . . , L(Tk ) is an unrooted tree T with leaf set L(T1 ) ∪ · · · ∪ L(Tk ) such that each tree Ti is an induced subtree of T . In some cases, there may be no possible agreement supertrees of a set of trees, in other cases there may be exponentially many. We present polynomial time algorithms for computing an optimal agreement supertree, if one exists, of a bounded number of binary trees. The criteria of optimality can be one of four standard phylogenetic criteria: binary character compatibility; maximum summed quartet weight; ordinary least squares; and minimum evolution. The techniques can be used to search an exponentially large number of trees in polynomial time.

1

Introduction

The general phylogenetic supertree problem is how to combine phylogenies for different sets of species into one large “super”-phylogeny that classifies all of the species. The problem is motivated by physical, computational, and statistical constraints on the construction of large scale phylogenies. In this paper we present a supertree method for combining a bounded number of binary, unrooted, phylogenetic trees. We search for supertrees such that every input tree is a restriction of the supertree to a subset of the set of species. When multiple supertrees exist we choose the supertree that is optimal with respect to one of four standard phylogenetic optimization criteria. Even though the number of possible supertrees can be exponentially large, we can determine an optimal supertree in polynomial time. 1.1

Agreement Supertrees

We begin with a few definitions. An unrooted phylogenetic tree T is a finite, acyclic connected graph with no internal vertices of degree two and degree one vertices (leaves) labelled uniquely by members of a finite leaf set L(T ). A tree T 0 is an induced subtree of T if T 0 can be obtained from T by deleting vertices and edges and supressing vertices of degree two. Intuitively T 0 represents the restriction of T to a subset of the leaf set. If T 0 is an induced subtree of T with leaf set L0 ⊆ L(T ) then we write T 0 = T |L0 . An agreement supertree of a collection of trees {T1 , T2 , . . . , Tk } with leaf sets L(T1 ), L(T2 ), . . . , L(Tk ) is an O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 24–31, 2001. c Springer-Verlag Berlin Heidelberg 2001

Optimal Agreement Supertrees

25

unrooted tree T with leaf set L(T1 ) ∪ · · · ∪ L(Tk ) such that each tree Ti is an induced subtree of T . The agreement supertree is the dual of the agreement subtree, recalling that S is an agreement subtree of trees T1 , . . . , Tk if S is an induced subtree of each of T1 , . . . , Tk (c.f [3,6]). A collection of trees T1 , . . . , Tk on leaf set L are agreement supertrees of a collection of trees S1 , . . . , Sl on subsets of L if and only if S1 , . . . , Sl are agreement subtrees of T1 , . . . , Tk , and every leaf in L appears in at least one Si . There can be exponentially many agreement supertrees for a collection of trees T , even when the number of trees in T is bounded [4]. When there is more than one agreement supertree we select one that is optimal according to one of four standard phylogenetic optimization criteria. The choice of the four optimization criteria is determined by algorithmic constraints - we discuss the possible extension to other criteria in Sect. 6. Note that the optimization criteria require global information for the entire set of taxa present in the input trees. Hence the algorithms are better tailored to divide and conquer applications, rather than the assembly of different data sets. Though the number of agreement supertrees can be exponentially large, the algorithms determine an optimal agreement supertree in polynomial time, provided that the number of input trees is bounded. The agreement supertree methods presented here are practical for a small number of large input trees, rather than a large number of small input trees. Finally we note that agreement supertree methods do not handle conflict in the set of input trees. If two input trees resolve the relationship between species in a different way then there will be no agreement supertrees for the collection. There are numerous possible ad hoc solutions to this shortcoming (e.g. removing leaves involved in conflicts, modifying input trees). However the more general problem of how to systematically resolve conflict in supertrees is still hotly debated within phylogenetics. At this early stage we present our results as tools to be incorporated into practical supertree methods once some of the fundamental systematic questions have been answered. Despite the false modesty, the algorithmic results presented in this paper have several direct applications, a few of which we discuss in Sect. 5. In particular we stress the ability to search over an exponentially large number of trees in polynomial time.

2

Four Standard Optimization Criteria

In this section we describe the phylogenetic criteria used for optimization. Discussion of the motivations, use, and characteristics of the various criteria can be found, for example, in [10]. Here we give only sufficient detail to describe the problems.

26

2.1

D. Bryant

Binary Character Weights

Given a phylogenetic tree T , removing an edge of T induces a bipartition of the leaf set of T , called a split of T . The set of all splits of T is denoted splits(T ), and a single split is written as A|B, using the standard notation for partitions. A split can be viewed as a binary character where each leaf is assigned a value of 0 or 1 depending on which block of the split it is in. Given a weighted collection of binary characters we can score a tree T by summing the weight of the characters corresponding to splits in T . The total summed weight is called the binary character compatibility score of T . We wish to find T that maximizes the binary character compatibility score. We will assume that all binary characters have non-negative weights. 2.2

Quartet Criteria

A (resolved) quartet is an unrooted binary tree on four leaves. There are three possible quartets for each set of four leaves. For any tree T the quartet set of T , denoted q(T ), is the set of all quartets that are induced subtrees of T . Suppose that we have assigned a weight w(ab|cd) for every possible quartet ab|cd such that {a, b, c, d} ⊆ L(T ). The quartet score of a tree T is the sum of the weights w(ab|cd) over all ab|cd ∈ q(T ). We want to find a tree with maximum quartet score. We will assume that all quartets have non-negative weights. 2.3

Least Squares (OLS)

Suppose that T is an unrooted tree and the edges of T are given real valued weights. The leaf to leaf distance between any two leaves of T is the sum of the edge weights along the path between the two leaves. In this way we can construct a distance matrix p containing all of the leaf to leaf distances between leaves in T . The sum of squares distance between p and a given distance matrix d for L(T ) is defined X X (dxy − pxy )2 . (1) kp − dk2 = x∈L(T ) y∈L(T )

For a fixed tree T the edge weights for T that give a distance matrix p minimizing kp − dk2 are called the OLS edge length estimates for T . The ordinary least squares (OLS) score of a tree T is the value kp − dk2 given by the OLS edge length estimates. We want to find a tree T with minimum OLS score. 2.4

Minimum Evolution (ME)

The minimum evolution criteria is closely related to OLS. Given an unrooted tree T and a distance matrix d we first calculate the OLS edge estimates for T

Optimal Agreement Supertrees

27

from d. The minimum evolution (ME) score for T is then the sum of these edge weights. The minimum evolution score is usually only evaluated for binary trees1 . We want to find a tree with minimum ME score.

3

Split Constrained Phylogenetic Optimization

In this section we describe the results of [1,2,5,7] that provide the machinery for the optimal agreement supertree algorithms. The results all follow the same theme: the introduction of split based constraints that make phylogenetic reconstruction polynomial time solvable. More formally we have: INSTANCE: Set S of splits on leaf set L. Degree bound d. PROBLEM: Is there a tree T with degree bound d such that splits(T ) ⊆ S? If so, find the tree(s) with degree bound d and splits(T ) ⊆ S with 1. maximum binary compatibility score (with respect to a given weighting of splits in S). [1] 2. maximum quartet score (for a given quartet weighting). [7] 3. minimum OLS score (for a given distance matrix). [5] 4. minimum ME score (for a given distance matrix). [2,5] All of these versions of the problem can be solved in polynomial time (in the number of splits and leaves), when d is bounded by a constant.

4

Optimal Agreement Supertree Algorithms

4.1

Split Set Constraints

We show how the split constrained optimization algorithms can be used to construct optimal agreement supertrees. Let T = {T1 , . . . , Tk } be a collection of trees. Put Li = L(Ti ) for each i and L = L1 ∪ · · · ∪ Lk . Define A |B ∈ splits(Ti ) ∪ {∅|Li }, i = 1, 2, . . . , k f. S(T ) = A1 ∪ · · · ∪ Ak |B1 ∪ · · · Bk : i i (A1 ∪ · · · ∪ Ak ) ∩ (B1 ∪ · · · Bk ) = ∅ (2) Note that we will assume Ai |Bi ∈ splits(Ti ) implies Bi |Ai ∈ splits(Ti ). Put n = |L|. There are at most 2n − 3 splits in each tree so |S(T )| is O(2k nk ). Furthermore Theorem 1. Let T be an unrooted phylogenetic tree with leaf set L. If each tree Ti ∈ T is an induced subtree of T then splits(T ) ⊆ S(T ). 1

It would be interesting to determine whether or not optimal minimum evolution trees can be assumed to be binary

28

D. Bryant

Proof. Suppose that each tree Ti ∈ T is an induced subtree of T . Thus Ti = T |Li and splits(Ti ) = splits(T |Li ) = {A ∩ Li |B ∩ Li : A|B ∈ splits(T )} − {Li |∅}.

(3) (4)

Hence for each A|B ∈ splits(T ) we have A ∩ Li |B ∩ Li ∈ splits(Ti ) ∪ {Li |∅} .

(5)

A|B = (A ∩ L1 ) ∪ · · · ∪ (A ∩ Lk )|(B ∩ L1 ) ∪ · · · ∪ (B ∩ Lk )

(6)

It follows that

and A|B belongs to S(T ).

t u

We now take advantage of the fact that the input trees are all binary. Lemma 1. If T contains only binary trees and there is an agreement supertree T of T then there is a binary agreement supertree T 0 of T such that splits(T ) ⊆ splits(T 0 ). Proof. Suppose that T is an agreement supertree of T that is not binary. Clearly there exists a binary tree T 0 such that splits(T ) ⊆ splits(T 0 ). We will show that T 0 is also an agreement supertree for T . For each Ti ∈ T we have Ti = T |Li and, since splits(T ) ⊆ splits(T 0 ), we also have splits(T |Li ) ⊆ splits(T 0 |Li ). The tree Ti is binary, so splits(Ti ) = t u splits(T |Li ) ⊆ splits(T 0 |Li ) implies Ti = T 0 |Li , completing the proof. We now have all the machinery needed for the main result. Theorem 2. Let T = {T1 , . . . , Tk } be a collection of binary trees with leaf sets L1 , . . . , Lk . We can determine whether an agreement supertree of T exists in O(4k n2k+1 ) time. If there is an agreement supertree for T then we can find the agreement supertree with optimum 1. binary character compatibility score (with respect to a given weighting of splits) in O(4k n2k+1 ) time. 2. summed quartet weight (for a given quartet weighting) in O(2k nk+4 +4k n2k+1 ) time. 3. least squares score (for a given distance matrix) in O(8k n3k ) time. 4. ME score (for a given distance matrix) in O(8k n3k ) time. Proof. By Theorem 1 we have that any agreement supertree T for T satisfies splits(T ) ⊆ S(T ). With all of the optimization criteria (except perhaps ME, which is usually defined only for binary trees) a tree T 0 with splits(T ) ⊆ splits(T 0 ) will have at least as good score as T . Hence, using Lemma 1, we can assume that the optimal agreement supertrees are binary. The problem therefore reduces to finding an optimal binary tree contained within a given collection of splits. Applying the methods stated in Sect. 3 is is possible to obtain the given complexities. t u

Optimal Agreement Supertrees

29

We note that, when the number of trees k is bounded by a constant, we have polynomial time algorithms for determining optimal agreement supertrees. When k is unbounded, the problem is NP-complete (by a reduction from QUARTET COMPATIBILITY [9]).

5

Applications to Optimization

To illustrate how these methods may be used during optimization we describe two applications. 5.1

Divide and Conquer

Given two (or more) binary trees we now have a method for combining them in a way that optimizes phylogenetic criteria. The most obvious application of these algorithms is as the basis of a divide and conquer algorithm. There are several possibilities for ways to subdivide the data set. The DCM method [8] uses distance data to divide the sequences into closely related clusters. However the merge algorithms we describe here will work well even if all of the input trees contain leaves that are widely scattered throughout the combined tree. For this reason, the subdivision process appears less critical. We propose the use of a random subdivision, biased to produce subsets that have close to equal sizes. 5.2

Testing Stability

The standard technique for phylogenetic optimization is to conduct a tree search: an initial tree is constructed; Neighbouring trees are examined, where ’neighbouring’ means within one branch swap or NNI exchange of the original tree; If a better tree is found, this is chosen as the next tree and the search continues. At each step at most O(n) or O(n2 ) trees are searched. We can use the merge algorithms presented in this paper to search an exponentially large number of trees without a dramatic increase in complexity. We randomly divide the set of leaves into two parts, calculate the two induced subtrees, and then calculate an optimal agreement supertree for these two trees. An agreement supertree will always exist since the leaf sets of the subtrees are disjoint. If we have found a global optimum then this procedure will always produce the original tree. If it finds a better tree we take this new tree and repeat the process.

6

Discussion

In this abstract we have outlined a method for constructing an optimal agreement supertree of a set of binary trees, if such an supertree exists. The algorithm can search an exponentially large collection of trees (the possible agreement supertrees) in polynomial time. Our results lead to several directions of further research:

30

6.1

D. Bryant

Extension to Other Criteria

It would be useful to determine whether the algorithms can be extended to optimize over other phylogenetic criteria such as parsimony score or likelihood. In both cases we are pessimistic. With parsimony, the structure of one part of a tree can alter how best to resolve the structure of another part of the tree, making parsimony less suitable for dynamic programming algorithms such as those we use here. With maximum likelihood it is unknown whether the likelihood of a single tree can be computed in polynomial time! In both cases, the above results can be used to make a rough search of the tree space that would precede a more detailed, and computationally expensive, local search. 6.2

Extension to Non-binary Trees

The next direction of future research would be to determine if the algorithms can be extended to handle non-binary trees. The problem is that we require a degree bound in order to use the algorithms outlined in Sect. 3. We suspect that an appropriate modification of the algorithms will give a polynomial time algorithm (for a bounded number of trees) even without a degree bound. We have derived the required modifications for the special case when the leaf sets of the input trees are disjoint, but the the complexity of the general problem remains open. 6.3

Trees with No Agreement Supertree

At the moment the algorithms will simply terminate if there is no agreement supertree possible. This occurs when, for example, there are conflicts between the input trees. It would be interesting to extend the algorithm to derive a consensus supertree method capable of handling input trees with conflicts. First, however, we must address the question of how to handle conflict between different trees. The subtree merger algorithm of [8] simply contracts edges causing problems. This will, in general, lead to poorly resolved trees. Once suitable criteria for combining subtrees are defined, it should be possible to extend the optimal agreement supertree algorithms above to give a general optimal subtree merger technique.

Acknowledgements. I thank the anonymous referees for their helpful comments and suggestions. The paper was written while I was a postdoctoral fellow based in the laboratory of Olivier Gascuel.

References 1. Bryant, D.: Hunting for trees in binary character sets. Journal of Computational Biology 3(2), (1996) 275-288.

Optimal Agreement Supertrees

31

2. Bryant, D.: Hunting for trees, building trees and comparing trees: theory and method in phylogenetic analysis. Ph.D. thesis, Dept. Mathematics, University of Canterbury (1997). 3. Finden, C.R. and Gordon. A.D. Obtaining common pruned trees. Journal of Classification, 2 (1986) 255–276. 4. Gordon, A.D. Consensus supertrees: the synthesis of rooted trees containing overlapping sets of leaves. Journal of Classification, 3 (1986) 335–348. 5. Bryant, D. Fast algorithms for constructing optimal trees with respect to OLS and minimum evolution criteria. In preparation. 6. Goddard, W., Kubicka, E., Kubicki, G., and McMorris, F.R.. (1994). The agreement metric for labelled binary trees. Mathematical Biosciences 123 215– 226. 7. Bryant, D. and Steel, M. Constructing optimal trees from quartets. Journal of Algorithms (to appear) 8. D.H. Huson, S. Nettles, and T. Warnow: Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. Proc. 3rd Annual Int. Conf. Comp. Mol. Biol. (RECOMB), (1998) 198–209. 9. M. Steel: The complexity of reconstructing trees from qualitative characters and subtrees. Journal of Classification, 9 (1992) 91-116. 10. Swofford, D. L., G. J. Olsen, P. J. Waddell and D. M. Hillis: Phylogenetic Inference. in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular Systematics, second edition. Sinauer, Sunderland, Mass (1996) 407–514.

Segmentation by Maximal Predictive Partitioning According to Composition Biases Laurent Gu´eguen CEB-LIS – UPMC Paris VI [email protected]

Abstract. We present a method for segmenting qualitative sequences, according to a type of composition criteria whose definition and evaluation are founded on the notion of predictors and additive prediction. Given a set of predictors, a partition of a sequence can be precisely evaluated. We present a language for the declaration of predictors. One of the problems is to optimize the partition of a sequence into a given number of segments. The other problem is to obtain a suitable number of segments for the partitioning of the sequence. We present an algorithm which, given a sequence and a set of predictors, can successively compute the optimal partitions of the sequence for growing numbers of segments. The time- and space-complexity of the algorithm are linear for the length of sequence and number of predictors. Experimentally, the computed partitions are highly stable regard to the number of segments, and we present an application of this approach to the determination of the origins of replication of bacterial chromosomes.

Broadly speaking, the aim of the classification process is to optimize the dividingup a set of objects into classes, in line with given criteria, so as to identify a structure, if any such exist. Once a set is organized into classes, the next aim may be to obtain a description of these classes, which would provide a high-level redescription of the set. This description could be set either after or during the classification process. But it may be very difficult to obtain “good” descriptions of existing classes, so the approach we adopted was to simultaneously describe and construct classes, and to evaluate the quality of the classification according to the accuracy of the descriptions. In this paper, we outline a method for partitioning sequences into described segments in such a way that the descriptions could be used to evaluate the quality of the corresponding segments. We first present the overall approach, and then an efficient, optimal partitioning algorithm. Finally, we link this method to the problem of partitioning genomic sequences according to relevant composition biases. O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 32–44, 2001. c Springer-Verlag Berlin Heidelberg 2001

Segmentation by Maximal Predictive Partitioning

1 1.1

33

Prediction Prediction of Sequences

We use Gower’s concept of prediction [3]: the quality of a class is measured by the ability of an “optimal” predictor belonging to a given set of predictors to predict the properties of the elements of the class in question. For sequences of letters, a property of a given position is the letter or word that is readable at this position. For example, to say that the predictor of a sequence of letters is the letter A means that each letter of this sequence should be an A, and the resulting high-level description of this sequence is: “We can read an A at each position in the sequence”. This description will be more or less accurate, according to the proportion of A in the sequence. To say that the predictor of a sequence is the word ‘AT’means that the word ‘AT’ should be readable at each position in the sequence, but it is clear that such a prediction can be correct for, at most, half the positions in the sequence. The notion of prediction can be extended to entire sequences. For example, to say that the predictor of a class is A with a factor 0.3, and T with a factor 0.7, means that we predict: “We can read an A at 30 % and a T at 70 % of the positions of the sequence”. We call this predictor [A(0.3)T(0.7)] (see Sect. 1.2). Let δ be a predictor, S a sequence, and si a position in this sequence. The prediction of δ at si is termed πδ (si ) ∈ R. For δ = A, if the letter at position si is A, πδ (si ) is 1, otherwise it is 0. For δ = ‘AT’, if the word at position si si+1 is AT, πδ (si ) is 1, otherwise 0. For δ = [A(0.3)T(0.7)], if the letter at si is A, πδ (si ) is 0.3; if the letter is T, πδ (si ) is 0.7; otherwise, it is 0. The prediction of δ on a segment S is the sum of the predictions of δ for all the positions on this segment. We call this prediction πδ (S): X πδ (si ) πδ (S) = si ∈S

In a geometrical representation, predictions for a sequence may be considered as a scalar product. In Fig. 1, for instance, we represent the sequence as a walk on the orthonormal basis (eA , eT ), starting at the origin. If we read an A (resp. T) at the first position, the first step in the walk is eA (resp. eT ). In a given representation, the vector corresponding to sequence S is called the sequence vector of S, noted as vS . In the same way, a predictor δ is also represented by a predictor vector, vδ . In this context, the prediction of the predictor δ on the sequence S is the scalar product of vδ and vS , termed vδ .vS . For a set of predictors D, the optimal predictors of S are those that provide the best prediction on S, i.e. those whose vectors provide the best scalar product with the sequence vector of S. The corresponding optimal prediction is πD (S): πD (S) = max πδ (S) δ∈D

34

L. Gu´eguen A

affine representation of sequence S

S= ATAAATTAATATTTTATTTT

predictor vector of A predictor vector of [A(0.3)T(0.7)]

sequence vector of S

predictor vector ofT T

Fig. 1. Vectorial representation of a sequence, and of predictors A, T and [A(0.3)T(0.7)].

There is competition between the different predictors of D. Following the geometrical representation, and in order to ensure that the predictor chosen among the set D is the one whose vector is the most colinear with the sequence vector, we have to normalize the different predictor vectors. For example, if D is the set {A, T, [A(0.3)T(0.7)]}, the normalization is carried out by multiplying the predictions of [A(0.3)T(0.7)] by 1.313 (see Fig. 2). This predictor is termed [A(0.3)T(0.7)](1.313).

A

S= ATAAATTAATATTTTATTTT

A [A(0.3)T(0.7)](1.313) T representation of normalized predictors T

Fig. 2. Vectorial representation of normalized predictors. [A(0.3)T(0.7)](1.313) is the one that best fits the sequence S.

The

predictor

Segmentation by Maximal Predictive Partitioning

1.2

35

Language of Predictors

We constructed a language that would allow us to declare different predictors. We use the word “lexicon” to designate a set of words belonging to this language which determines the set D of the available predictors. In a lexicon, the words are separated by spaces. Each word corresponds to a predictor, according to the rules given below. Now, the two terms “predictor” and “word” are equivalent. – If α is a letter in S, α is in the language, and its prediction function is the characteristic function, i.e.1 : πα (si ) = 1lα=si =

1 if the letter in position si is α, otherwise 0;

– The sign ’?’ is in the language, its prediction function is: π? (si ) = 1 – If p is a predictor and x ∈ R, p(x) is a predictor belonging to the language and: πp(x) = x.πp – If p1 , . . . , pk are predictors, [p1 p2 . . . pk ] is a predictor belonging to the language, and: k X πp i π[p1 p2 ...pk ] = i=1

– If p1 , . . . , pk are predictors, ‘p1 p2 . . . pk ’ is a predictor belonging to the language, and: π‘p1 p2 ...pk ’ (si ) = πp1 (si ).1l(πp2 (si+1 )>0)∧...∧(πpk (si+k−1 )>0) Combinations within the framework of this syntax make possible the declaration of a very large range of predictors. For instance, the word ‘A?C’(2) stands for a predictor whose prediction is 2 at positions where the letter is A and the letter which occurs two positions after it is C, 0 at all other positions. The predictor [‘A[CA]G’(-2)‘CA’](3) gives a prediction of −6 at positions where the word is ACG or AAG, 3 where it is CA, otherwise 0.

2

Predictive Partitioning of Sequences

2.1

Predictive Partitions

A partition of a sequence is a division of the sequence into separate segments such that their union equals the whole sequence. We call caesura the junction 1

1ltrue = 1 and 1lfalse = 0

36

L. Gu´eguen

of two consecutive segments. We call a k-partitions a partition into k segments, and the set of these k-partitions is termed Pk (S). The prediction of a partition follows the same logic as the prediction of a sequence: our aim is to optimize the prediction of the composition of segments, and the prediction of a partition is the sum of the optimal predictions of its segments. Then, if the k-partition P is made up of segments S1 , S2 , . . . , Sk , its prediction by D is: πD (P ) =

k X

πD (Sj ) =

j=1

k X j=1

max πδ (Sj ) δ∈D

For example, given the sequence ATAAATTAATATTTTATTTT and the lexicon A T [A(0.3)T(0.7)](1.313), then for the 3-partition ATAAATTA ATATT TTATTTT these segments have successively A, [A(0.3)T(0.7)](1.313) and T as optimal predictors, and the prediction of this 3-partition is 5 + 3.55 + 6 = 14.5451. The larger the prediction of a k-partition, the better the predictions that can be made about its segments. Given a set of predictors and a number of classes k, our goal is to obtain the optimal division into k segments, i.e. to find the optimal partitions among the entire set of k-partitions. The evaluation of the maximal k-partitions of S according to the predictors set D is termed MkD (S). MkD (S) =

max

(S1 ,... ,Sk )∈Pk (S)

k X j=1

max πδ (Sj ) δ∈D

In the example given above, an optimal 3-partition is ATAAA TTAATATTTTA TTTT These segments have successively A, [A(0.3)T(0.7)](1.313) and T as optimal predictors, and the corresponding prediction is 16.0093. In terms of geometrical representation, an optimal k-partition is one in which the vectors representing the segments are most colinear to their best predictors among D (see Fig. 3). Finding a “good” partition means finding a compromise between very precise descriptions of only a few of the k segments and a good average of all the k segments. Because the prediction is additive in the neighborhood of a junction between two segments, the caesura is set in such a way that the prediction of the maximum elements is relevant. Given a set of predictors, and considering this geometrical representation, we can say that a sequence has a structure when its affine representation can be divided up into segments whose vectors are distinctly colinear with one of the predictors.

Segmentation by Maximal Predictive Partitioning

37

A

S= ATAAATTAATATTTTATTTT 3−partition: ATAAA TTAATATTTTA TTTT A [A(0.3)T(0.7)](1.313) T T

Fig. 3. Vectorial representation of a 3-partition. Dashed arrows represent the vectors of the segments of this partition. We can see that the vector of predictor A is the most colinear with the first arrow, [A(0.3)T(0.7)](1.313) with the second, and T with the third.

2.2

Number of Segments

A priori, it is reasonnable to ask an experimenter to define a problem in terms of a sequence and a set of predictors2 . But the precise number of segments which will successfully partition a sequence cannot be known in advance. So, from the point of view of classification, and with the hypothesis of a structure in a sequence, it may be useful to estimate the number of segments that “best” discloses this structure. And if there is no structure at all, it should be possible to demonstrate this. In fact, more than one class may be relevant, as in the case of nested clusters. There are numerous ways to estimate the number(s) of classes in a set, and we will not list them all, but a number of them can be found in [2]. However, none of them is perfect. A “good” summary of the composition of the sequence implies the “right” number of classes. Yet by definition, the optimal prediction does not decrease with the number of segments, and the global maximum is reached when all the elements have been separated, which is not a very interesting solution. The “right” number of classes must be found: if there are too few, the high-level redescription obtained will be too vague; if there are too many, this redescription will be too complicated. One way to assess the relative usefulness of the different numbers of classes is to compute the optimal k-partitions for successive k between 1 and a given value. We call this process partitioning, and we give the same name to the result of such a process. A partitioning can be used to study the optimal prediction in terms of the number of classes. Qualitatively, a sharp bend in the curve of the prediction is the sign of a relevant number of segments because, for a lower number of 2

We shall discuss the choice of this set in the Sect. 4.

38

L. Gu´eguen

segments, the prediction will be too vague, and for a higher number, the gains in prediction will be outweighed by the increase in the numbers of classes. Such a point corresponds to a good balance between summarization and description. A further advantage of partitioning a sequence is that it may reveal another structure of the sequence, i.e. the successive revelation of segments according to their prediction by a “good” predictor, the structure becoming more and more detailed with the number of classes. Moreover, the composition of the segments is not subject to a priori limitations, such as that of a threshold. We next present an algorithm for computing the partitioning of a sequence into a given number of segments given a sequence and a set of criteria [4].

3

Algorithm

There already exists an algorithm for computing the optimal partitionings of sequences, for any kind of evaluation function of the classes [5]. This algorithm is based on Fisher’s lemma [1]): if {S1 , S2 , . . . , Sk } is an optimal k-partition of ∪ki=1 Si , then {S1 , S2 , . . . , Sk−1 } is an optimal k − 1-partition of ∪k−1 i=1 Si . The time-complexity of this algorithm is quadratic with the size of the sequence, and this is critical for very long sequences, such as genomic sequences. In the context of prediction, the algorithm we are introducing here uses the previous lemma and another observation: given a set of predictors D, all the positions of a class are predicted by the same predictor of D (whether this prediction is correct or not). The time- and space-complexity of the algorithm are linear with the length of the sequence, which makes possible the partitioning of very long sequences. Let s1 , s2 , . . . , sn be the elements of sequence S, whose size is n. Given i > j, we call hi, ji the set {i, i + 1, . . . , j − 1, j} and shi,ji the subsequence si . . . sj . For each δ ∈ D, and for i > k, if we compute, for all the k-partitions of the subsequence sh1,ii , the prediction such that the segment containing si is predicted – forcibly – by the predictor δ, we call Mkδ (sh1,ii ) the maximum of these predictions. The aim is to compute MkD (sh1,ni ) for each k, and to backtrack the corresponding optimal partitions. In an optimal k-partition, the segment containing si is predicted by an optimal predictor: MkD (sh1,ii ) = max Mkδ (sh1,ii ) δ∈D

Fisher’s lemma is used to compute Mkδ (sh1,ii ). An optimal k-partition of sh1,ii for which the optimal predictor of the segment containing si is δ conforms to at least one of the two possibilities:

Segmentation by Maximal Predictive Partitioning

39

— either si−1 and si belong to the same segment, in which case the segment containing si−1 is predicted by δ,

. . . si−1 si ...δ δ

s 1 s2 . . . and the prediction is:

Mkδ (sh1,ii ) = πδ (si ) + Mkδ (sh1,i−1i )

— or si−1 and si belong to two different segments, in which case the kth segment is {si }, and the other k − 1 segments together constitute an optimal (k − 1)partition of sh1,i−1i ,

. . . si−1 si . . . δ0 δ

s 1 s2 . . . and the prediction is:

D Mkδ (sh1,ii ) = πδ (si ) + Mk−1 (sh1,i−1i ).

The maximum of these two formulae is:

D (sh1,i−1i ), Mkδ (sh1,i−1i ) , Mkδ (sh1,ii ) = πδ (si ) + max Mk−1

and, as there is no k-partition of sh1,k−1i , D (sh1,k−1i ) Mkδ (sh1,ki ) = πδ (sk ) + Mk−1

To sum up: ∀i, M1D (sh1,ii ) = maxδ∈D πδ (sh1,ii ) D (sh1,k−1i ) ∀δ, ∀k, Mkδ (sh1,ki ) = πδ (sk ) + Mk−1

D (sh1,i−1i ), Mkδ (sh1,i−1i ) ∀δ, ∀k, ∀i > k, Mkδ (sh1,ii ) = πδ (si ) + max Mk−1 ∀k, ∀i > k, MkD (sh1,ii ) = maxδ∈D Mkδ (sh1,ii ) D So if, for a given k, the set {Mk−1 (sh1,ii ), i > k − 1} is preserved, we can compute successively, for i > k, the set of the MkD (sh1,ii ) using all the Mkδ (sh1,ii ). We then compute successively the partitioning M1D (S), M2D (S), . . . , MkD (S). Backtracking this computation gives an optimal k-partition for each k. The computation of these optimal partitions is linear over time: for each e ∈ h2, ki, for each i ∈ he, ni, for each δ ∈ D, Meδ (sh1,ii ) is computed in a constant D (sh1,i−1i ). Moreover, for each i ∈ he, ni, the time using Meδ (sh1,i−1i ) and Me−1 D computation of Me (sh1,ii ) requires a comparison between |D| elements. Then, at eachPiteration of the number of segments, the computing time approximan tes to i=e |D| = |D|.(n − e + 1). In conclusion, MkD (S) is computed with a time-complexity of O(k.|S|.|D|). This algorithm can thus be used on very long sequences.

40

L. Gu´eguen

For a given k, there may in fact be several maximally predictive k-partitions, and it may be useful to calculate all these, especially if they have very different caesuras. With adjacency lists, it is possible to find the set of all the equally optimal k-partitions without any increase in complexity.

4 4.1

Applications Composition Biases in Sequences

Genetic sequences are subject to numerous functional and structural constraints, and there are composition biases between their different parts. For us, “composition bias” means that there is a significant difference between different parts of a sequence for a measurable criterion, e.g. the comparison between the proportion of G and the proportion of C on a DNA string. In a number of bacteria the expression of this C--G bias is related to the replication process, due to the fact that these proportions are very different between the leading and the lagging part of a DNA double-strand sequence [6]. In [7], a linear discriminant analysis method is used to find, and especially, to evaluate the relative merits of different criteria for distinguishing between the leading and lagging parts of a bacterial chromosome; these criteria can be based on the C--G bias or the frequency of codons in genes. The aim of detecting such biases is to facilitate the study of biological data, though the explanation of those biases raises new problems for biological research. We plan to link the notion of composition bias to that of prediction. To arrive at an objective criterion of composition bias, we use the notion of prediction, and that of language as defined in Sect. 1.2. Our second aim is to compute such partitionings, using the algorithm presented in Sect. 3. If a composition bias can be translated into a prediction, using this approach, we can partition the sequence into segments that optimally mirror this bias. Taking a geometrical point of view, we consider a composition bias B such that, in an affine representation of a sequence, which has been set up according to the parameters of B, each trend τi of the bias is related to a transect Tτi of the space. This means that if the vector of a segment is in Tτi , the segment follows τi . At the boundary between two transects, the segment follows the two trends to the same extent. Looking at all the trends in a bias, we have a partitioning of the space into transects {Tτ1 , . . . , Tτp } (as in Fig. 4). Looking at a set of predictors {δ1 , . . . , δn }, for each i let Tδi be the set of vectors v such that ∀j 6= i, v.vδj < v.vδi . Tδi is a transect of the space. In order to express bias B using the prediction, we have to find p predictors δi such that {Tδ1 , . . . , Tδp } = {Tτ1 , . . . , Tτp }. This problem can be solved easily by using the fact that at the boundary between two transects, the prediction of a segment is the same for the two corresponding predictors. The algorithm and the declaration of the language used to compute optimal predictive partitionings have been programmed. The computation of partitionings turns out to be very efficient: for example, on a Gateway computer with a

Segmentation by Maximal Predictive Partitioning

41

A 1111111111111111111111111 0000000000000000000000000 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111

1111111111111111111111111 0000000000000000000000000 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 A(2) 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 [TA(1.6)] 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 [T(1.85)A(0.8)] 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 T(2) 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 T 11111111111111111111111111111111

Fig. 4. Predominance transects of the predictors of the lexicon A(2) [TA(1.6)] [T(1.85)A(0.8)] T(2). The striped areas are the transects, which are related to the trends in the studied bias.

Pentium III processor and 256 Mb of RAM, taking a sequence of 2.106 letters and four predictors, the construction of an optimal partition into 100 segments takes less than ten minutes. In this way, it has been possible to study the partitioning process extensively. 4.2

Experiments

The discovered segments are very stable with the numbers of segments. For instance, the partitioning of the sequence shown in the Fig. 5 almost gives the impression of a hierarchical structure. This is not in fact the case, but it reinforces the impression that partitioning reveals hidden structures, since it seems to bring out increasingly detailed features of the composition of the sequence. Visually, such figures do indeed give a thorough view of the intricacy of the expression of the different tendencies of a bias within a given sequence. Moreover, the stability means that the determination of a “suitable” number of segments becomes less critical, because the caesuras remain, along with several numbers of classes. We partitioned several bacterial genomic sequences according to the C--G imbalances in different parts of these sequences. We wanted to check whether the bias noted by Lobry [6] made it possible to determine the origin and terminus of replication. To express this bias, we used the C G lexicon . The chromosomes of the bacteria studied are circular, so we dicided them up into linear sequences. Figure 6 gives the result of partitioning Haemophilus Influenzae into ten segments. We see that the leading and lagging parts are the first to be extracted. Figure 7 shows a very sharp bend in the prediction curve in terms of the number of classes, at 3 segments, revealing that the most significant expression

42

L. Gu´eguen

Fig. 5. Example of the partitioning of a sequence. The sequence is the line at the bottom. Each class within a partition is represented by an arc. The number of classes increases from top to bottom.

of the G--C bias is exactly correlated with the leading and lagging parts of the chromosome. Table 1 shows, for a number of bacteria and archae-bacteria, the limits of the segments of the 2- and 3-partitions, using the same C G lexicon, for which there is a sharp bend in the prediction curve. The selected number of classes is either 2 (which means that the origin of replication is at position 0, i.e. at the position where the circular chromosome has been split) or 3. It seems clearly that this method could be extensively used for determining the composition of sequences such as genomes and proteins. The large range of criteria according to which partitionings can be carried out may be useful tool in biological research, e.g. in determining of the relation between the replication process and other criteria (cf [7]), the study of isochores, or that of variations in codon usage inside a chromosome. And, as this method is precise and objective, it is a good way to evaluate the relevance of particular criteria, relating partitions to knowledge about the properties of particular sequences.

Segmentation by Maximal Predictive Partitioning

43

C 1 C

G

2 C

G

C

C

G

C

3 G

4 C

G

C

G

C

C

G

C

G

C

5 G

6 C

G

C

G

C

G

C

C

G

C

G

C

G

C

7 G

8 C

G

C

G

C

G

C

G

C

C

G

C

G

C

G

C

G

C

9 G

10

1000000

1830135

Fig. 6. Partitioning of H. influenzae into 10 classes, using a C G lexicon. The chromosome, which is circular, is represented as a linear sequence (the graduated line). The black triangles represent the locations of the origin (pointing upwards) and terminus (pointing downwards) of replication.

369341

360000

350723 1

3

10

20

Fig. 7. Prediction in terms of the number of classes of partitioning into twenty classes of H. influenzae, using the C G lexicon.

44

L. Gu´eguen

Table 1. Origins and termini of replication and with the caesurae found by partitioning, in sequences for which an increase in prediction shows a sharp bend. The positions are given as percentages of the overall length of the sequence. “Caesura of C towards G” means that the predictor C (resp. G) is on the left (resp. right) of the caesura. Organism B. burgdorferi B. subtilis C. trachomatis E. coli H. influenzae H. pylori M. tuberculosis M. pneumoniae M. genitalium M. thermoautotrophicum P. horikoshii R. prowazekii T. pallidum

Origin Terminus 50.1 0 69 84.6 32.9 96 0 25.1 0 75 ? 0 0

0 47.8 19 34.1 80.1 47 49 73 ? 24 ? 56 48

Caesura of Caesura of C towards G G towards C 50.3 0 0 46.1 69.1 18.7 84.6 33.4 33.2 80.1 98 48.8 0 46.3 25 74.1 0 50.8 74.9 23.3 0 43.6 0 56.5 0 48.6

References 1. W.D. Fisher. On grouping for maximal homogeneity. Journal of the American Statistical Association, 53:789–798, 1958. 2. A.D. Gordon. Cluster validation. In C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.H. Bock, and Y. Baba, editors, Studies in Classification, Data Analysis, and Knowledge Organization: Data Science, Classification, and Related Methods, pages 22–39, Kobe, March 1996. IFCS, Springer-Verlag. http://www-solar.dcs.st-and.ac.uk/˜allan/. 3. J.C. Gower. Maximal predictive classification. Biometrics, 30:643–654, 1974. 4. L. Gu´eguen, R. Vignes, and J. Lebbe. Maximal predictive clustering with order constraint: a linear and optimal algorithm. In A. Rizzi, M. Vichi, and H. Bock, editors, Advances in Data Science and Classification, pages 137–144. IFCS, Springer Verlag, July 1998. 5. D.M. Hawkins and D.F. Merriam. Optimal zonation of digitized sequential data. Mathematical Geology, 5(4):389–395, 1973. 6. J.R. Lobry. Asymmetric substitution patterns in the two dna strands of bacteria. Mol. Biol. Evol., 13(5):660–665, 1996. 7. E.P.C. Rocha, A. Danchin, and A. Viari. Universal replication biases in bacteria. Molecular Microbiology, 32(1):11–16, 1999.

Can We Have Confidence in a Tree Representation? Alain Gu´enoche1 and Henri Garreta2 1

IML - CNRS LIM - Universit´e de la M´editerran´ee, 163 Av. de Luminy, 13288 Marseille Cedex 9 2

Abstract. A tree representation distance method, applied to any dissimilarity array, always gives a valued tree, even if the tree model is not appropriate. In the first part, we propose some criteria to evaluate the quality of the computed tree. Some of them are metric; their values depend on the edge’s lengths. The other ones only depend on the tree topology. In the second part, we calculate the average and the critical values of these criteria, according to parameters. Three models of distance are tested using simulations. On the one hand, the tree model, and on the other hand, euclidean distances, and boolean distances. In each case, we select at random distances fitting these models and add some noise. We show that the criteria values permit one to differentiate the tree model from the others. Finally, we analyze a distance between proteins and its tree representation that is valid according to the criteria values.

1

Introduction

In this paper we consider aspects of the tree representation of a given proximity measure on a finite set X. This measure can be a distance, satisfying the metricity conditions, or a simple dissimilarity evaluated from any description of the X elements. Briefly speaking, we only have an array D, with dimension n = |X|, that is symmetrical with null values on the diagonal. The tree reconstruction methods define an X-tree having X as set of leaves and its internal nodes connect exactly three edges with non negative length. [Barth´elemy, Gu´enoche 1991]. When X is a set of homologous biological sequences and when the distance values are proportional to the number of mutations, the tree model is justified by the evolution theory of Darwin. But when this distance measures something else, such as a structural similarity or a functional proximity, it is not certified that an underlying tree exists and that the D values permit to infer this tree. There are many other domains where X-trees are calculated from a proximity measure, such as Cognitive Psychology, for studies about perception - noises, smells, etc. In Phylogeny, the tree nodes represent taxa or common ancestors, but in Psychology, these nodes correspond to categories and the reality of a tree organization of these categories is a hypothesis that must be verified and justified. O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 45–56, 2001. c Springer-Verlag Berlin Heidelberg 2001

46

A. Gu´enoche and H. Garreta

The aim of this paper is to study the appropriateness of the tree model to represent a given distance D. To do that, we apply a tree reconstruction method to get a tree A and its associated distance Da . Then, we measure the gap between D and Da using metric and topological criteria. To evaluate how large this gap is, we realize simulations to estimate the observed values of the criteria when the initial distance is close to a tree distance and to determine confidence intervals. If the criteria values for D belong to these intervals, A and the tree model are accepted. But if not, one can try another reconstruction method, but if none of these gives an acceptable tree, the model will be finally abandoned. Then one can check if other classical models, such as euclidean or boolean models are suited. Similar studies have been realized previously. We refer to Prutzansky, Tversky and Caroll [1982] who try to determine if a given distance is better fitted with a hierarchical clustering method or with a multidimensional scaling one, that is if D is closer to the ultrametric model than to the euclidean one. This study extends their work, dealing with the tree model that is more general, adding the boolean model, and offering the possible conclusion that none of these models is appropriate. In Sect. 2, we recall some criteria to evaluate the quality of an X-tree compared to a distance on X. There are two kinds of criteria, metric and typological. The first ones come from Statistics, evaluating differences between the D and Da values. The second ones are less classical. They only depend on the tree topology and are independent from the edges length. These criteria also permit one to compare different trees obtained by several reconstruction methods. Even if they do not agree, they give quantitative arguments to prefer one representation. In Sect. 3, we describe our procedures to generate random distances considering three models: on the one hand the tree distances, and on the other hand, the euclidean distances and the boolean distances established from binary attributes. For the tree distances we fix the ratio between the lengths of the internal and external edges. For the other distances, several dimensions of the spaces are tested. Realizing simulations, we evaluate average and critical values. The comparisons of the corresponding tables show that the tree model cannot be confused with the other distance models. In Sect. 4, we study a TAT proteins family (Twin Arginine Pathway) for which we build a tree that is finally validated by the method described here, according to the tables.

2

Criteria to Evaluate X-Trees

Let D be a n × n dissimilarity array such that for all {x, y} ∈ X 2 , D(x, y) = D(y, x) and for all x ∈ X, D(x, x) = 0. Let A be an X-tree given by any reconstruction method. The tree A, together with an edge-weighting allows one to calculate a tree distance such that for all {x, y} ∈ X 2 , Da (x, y) is equal to the length of the path in A between x and y.

Can We Have Confidence in a Tree Representation?

47

To evaluate the quality of A, we compare the two distances D and Da . This is realized using a program Qualitree working with two files: one for the tree, registered according to the Newick format and the other one for the distance D having the Phylip format. This program can be obtained at http://biol10.biol.umontreal.ca/guenoche/. 2.1

Metric Criteria

Among all the criteria proposed in Statistics, we have selected: - the distortion, which is the average of the percentage of differences: Dis =

X |D(x, y) − Da (x, y)| 2 n(n − 1) D(x, y) x 50) reduces the sensitivity of the programs. The reason is that the larger clusters are less likely to be regulated by a single factor, and might contain a mixture of different signals. The effect of mixing together sequence that contains a given signal with sequences that do not contain it is also to reduce the signal-to-noise ratio. The programs are able, to some extent, to extract multiple signals from a single analysis, but the highest efficiency is clearly obtained by selecting clusters of genes that are likely to be all regulated by the same transcription factor. The choice of the clustering method is thus crucial.

5 Metabolic Network Analysis We focus now on the second question, namely the functional interpretation of clusters of co-regulated genes.

156

J. van Helden et al.

Table 1. Patterns extracted by dyad-analysis with the MET family. Legends: obs occ: observed occurrences; exp occ: expected occurrences; proba: binomial probability; sig: significance index. All patterns with significance value higher than 1 were selected. Some patterns can be grouped together on the basis of sequence similarities, and assembled into larger patterns (contigs). The first group corresponds to the sequence recognized by the protein complex Met4p/Met28p/Cbf1p. The second group describes the site bound by Met31p and its homologue Met32p. These factors are those known to regulate methionine biosynthesis in the yeast Saccharomyces cerevisiae. To our knowledge, the third group and the isolated patterns do not show any obvious similarity to known binding sites, and could reveal new regulatory patterns pattern GTC..GTG.. .TCA.GTG.. .TCACGT... ..CACGTG.. ..CAC.TGA. ..CAC..GAC ...ACGTGA. GTCACGTGAC

reverse complementary ..CAC..GAC ..CAC.TGA. ...ACGTGA. ..CACGTG.. .TCA.GTG.. GTC..GTG.. .TCACGT... GTCACGTGAC

CGCCAC.... .GCCACA... ..CCA..GTT ..CCACAG.. ..CCA.AGT. ...CACAGT. ...CAC.GTT CGCCACAGTT

....GTGGCG ...TGTGGC. AAC..TGG.. ..CTGTGG.. .ACT.TGG.. .ACTGTG... AAC.GTG... AACTGTGGCG

14 21 23 24 21 24 24

2.32 4.06 5.86 3.53 4.59 5.41 5.79

.CCA................GGT ACC................TGG. .CCA...............TGG. ACCA...............TGGT

15 15 22

2.9 5.10E-07 1.7 2.9 5.10E-07 1.7 3.16 1.10E-06 1.3

assembly

ACC................TGG. .CCA................GGT .CCA...............TGG. ACCA...............TGGT

isolated

CAG...TGG

CCA...CTG

17

3.12 4.60E-08 2.7

group 1

assembly

group 2

assembly

group 3

obs occ exp occ proba sig 17 2.61 3.60E-09 3.8 23 5.12 8.50E-09 3.4 21 4.75 4.60E-08 2.7 38 3.37 0 20 23 5.12 8.50E-09 3.4 17 2.61 3.60E-09 3.8 21 4.75 4.60E-08 2.7

2.00E-07 3.30E-09 9.10E-08 2.30E-12 2.60E-08 5.20E-09 1.90E-08

2.1 3.8 2.4 7 2.9 3.6 3.1

Representating Metabolic Pathways as Graphs The set of all possible metabolic reactions can be seen as a graph, with two types of nodes (metabolites and reactions respectively). Arcs represent substrate-reaction and reaction-product relationships. A graph containing all known metabolic reactions 4 would include of the order of 10 nodes and as many arcs. The connectivity is very high for some particular compounds (ATP, Adenosyl-Methionine), but besides these “pool metabolites”, the vast majority of compounds are involved in a very limited number of reactions. Reactions have between 1 and 6 substrates (2 on average) and as many products. The complexity of such a graph is huge and the number of possible pathways is virtually infinite.

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

157

However, only a very restricted number of these possible pathways are effectively followed in living organisms. For instance, the database EcoCyc, which holds the most comprehensive information about Escherichia coli metabolism, only contains 159 distinct pathways. E. coli has been, for several decades, the preferred model organism for biochemists, and even though some parts of its metabolism certainly remain to be discovered, the number of pathways is not expected to increase significantly for this organism. Metabolic Pathway Discovery L-Aspartate

S.cerevisiae

E.coli

2.7.2.4

L-aspartyl-4-P 1.2.1.11

L-aspartic semialdehyde 1.1.1.3

L-Homoserine 2.3.1.31

2.3.1.46

Alpha-succinyl-L-Homoserine 4.2.99.9

O-acetyl-homoserine

Cystathionine 4.2.99.10

4.4.1.8

Homocysteine 2.1.1.14

L-Methionine 2.5.1.6

S-Adenosyl-L-Methionine

Fig. 4. Alternative pathways for methionine biosynthesis in E. coli and S. cerevisiae

Many pathways however remain be discovered in other organisms. Indeed, it is common to observe that different organisms follow distinct pathways for the biosynthesis or degradation of the same molecule. For example, in E. coli, methionine is synthesized in 7 steps from aspartate, whereas the yeast Saccharomyces cerevisiae performs this transformation in 6 steps, 4 of which are common with E. coli (Fig. 4). The case of lysine is more extreme, since E. coli and S. cerevisiae follow completely different pathways to synthesize this metabolite.

158

J. van Helden et al.

In addition, many parts of the metabolism remain largely unexplored, for example the mechanisms of toxic molecule degradation or resistance to extreme conditions observed in some bacteria. In summary, among all the pathways that could be followed in the graph of metabolic reactions, only a very restricted fraction corresponds to already described pathways. Another part corresponds to pathways that are not yet described but might appear to be effectively used by some organisms in response to some conditions. Finally, a vast majority of these pathways might be devoid of any biological relevance. As illustrated below, measuring the transcriptional response of all the genes of an organism could be one way to select those pathways that are most likely to correspond to biological processes. Metabolism and Gene Expression Living organisms can rapidly modify their internal concentration of small molecules (metabolites) via enzymatic catalysis. Controlling metabolite fluxes is essential to cell viability, in that it allows the cell to maintain biochemical compounds in stationary concentrations (homeostasis), in spite of fluctuations of their external availability and internal consumption rates. Several molecular mechanisms are involved in metabolic regulation. Enzymes and transporters are regulated at different levels: transcription rate, RNA stability, translation rate, protein activity, intracellular location, protein degradation. Several of these mechanisms can be combined for the control of the same metabolic pathway. Enzymes and transporters participating in a common metabolic pathway are often co-regulated at the transcriptional level. Thus, when the culture medium is modified by depleting (or adding) a given metabolite, it is expected that the genes that participate in the biosynthesis (or degradation) of the molecule will respond at the transcriptional level. DNA chip and related technologies can be used to unravel the set of genes that respond to a given perturbation of the external conditions (addition/removal of a metabolite) in a given organism. The question is thus to discover, from this set of genes, which particular pathway could be catalyzed. Applying Graph Analysis for a Functional Interpretation of Gene Expression Data The first step is to select, among the set of co-regulated genes, those that code for enzymes, and identify the reactions they could catalyze. These reactions correspond to a subset of nodes in the graph of all possible metabolic reactions (Fig. 5A). The method consists in trying to interconnect all these reactions in a meaningful way (Fig. 5B), in order to extract a sub-graph (Fig. 5C) corresponding to one or several putative metabolic pathways (Fig. 5D). The algorithms for subgraph extraction and maximal path enumeration will be described elsewhere (in prep.), and we will only summarize their principle.

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

A. Seed reactions

Compound Reaction Seed Reaction

C. Subgraph extraction

159

B. Reaction linking

Direct link Intercalated reaction

D. Linear Path Enumeration

Fig. 5. Conceptual schema of the subgraph extraction. A: graph representation of metabolic pathways. Two types of nodes are used to represent metabolites (circles) and reactions (rectangles) respectively. Arcs represent the relationships between reactions and their substrates and products. Filled rectangles represent seed reactions, i.e. those that are catalyzed by genes belonging to the co-regulated cluster. B: A direct link (dark gray) can be established between two reactions when the first one produces a metabolite that is used as substrate by the second one. Optionally, one can decide to allow intercalating reactions (ligt gray) that were not part of the initial seeds, if this improves the connectivity. C: Connected components are then extracted from the graph. The extracted subgraph represents a metabolic pathway that is potentially catalyzed by the cluster of co-regulated genes. D: When the subgraph contains branches, it can be decomposed into non-redundant elmentary paths, which highlight potential endpoints of the metabolic pathways

The simplest way to interconnect reactions is to identify compounds that are produced by one reaction and used as substrate by another one. In a second step, linking can be improved by intercalating reactions that were not part of the initial set. Several reasons could be invoked to justify such an intercalation. Firstly, some genes could be involved in the metabolic pathway without being regulated at the transcriptional level. Secondly, microarray technologies are still limited in reproducibility and some regulations might have escaped detection. A third possibility would be that the gene is present on the chip and its expression level has been measured correctly, but this gene has not been annotated as an enzyme yet. Indeed, for newly sequenced genomes, gene function is usually predicted by sequence similarities, and many genes remain unan-

160

J. van Helden et al.

notated. In such a case, the best candidates to ensure the missing enzymatic catalysis are the genes that belong to the initial cluster itself, but have no assigned function yet. Comparison of Extracted Pathways with Known Pathways Once the subgraph has been extracted, the putative pathway can be compared to the set of known metabolic pathways stored in some metabolic pathway database [8, 9, 19, 20]. In some cases the pathway extracted from the gene cluster will correspond to some previously characterized pathway. For such cases, a simple matching of the set of reactions against a database of metabolic pathways would have provided the same answer. In other cases, one might observe only a partial match with a known pathways. The subgraph extraction might thus reveal an alternative to the pathway followed in the model organism. The method could be applied to study the metabolism of newly sequenced organisms, whose metabolism has been poorly characterized. Finally, in some cases, one should be able to extract completely novel pathways. The co-regulation of the enzyme-coding genes would provide a good support to indicate that this pathway is biologically relevant. An interesting field of application would be to discover metabolic pathways involved in largely unexplored processes, such as resistance to toxic compounds or extreme conditions. Another application is to reveal which metabolic pathways are affected by a new drug. Application of Pathway Analysis to the Study Case We applied the above procedure to the 20 genes belonging to the MET cluster defined by Spellman and co-workers. Seven of these genes code for enzymes, which can catalyze 8 distinct reactions. Subgraph extraction and maximal path enumeration resulted in a linear pathway including 6 of the initial reactions (Fig. 6). In this case, the linear path was obtained without intercalating any reaction that was not part of the initial set. The pathway shows partial matches with two distinct metabolic pathways: the 4 initial steps match the sulfur assimilation pathway, and perform a progressive reduction of sulfate into sulfide. The two last steps match the methionine biosynthesis pathway, and correspond to the incorporation of sulfur into homocysteine, and the transformation of the latter into methionine. Sulfur assimilation and methionine biosynthesis are intrinsically related in the yeast Saccharomyces cerevisiae, since in this organism sulfur amino acids are all derived from the methionine biosynthesis pathway (this differs from Escherichia coli, where sulfur is incorporated into cysteine and then transferred to methionine). It makes thus sense to have a coordinated transcriptional regulation of all the metabolic steps from sulfate to methionine. Indeed, these genes are all known targets of the methionine-regulating transcription factors described above [21].

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

Genes

Enzymes

161

Matching pathways

Pathway extracted Sulfate

MET14

Adenylyl sulfate kinase

MET16

3’-phosphoadenylylsulfate reductase

MET5

Putative Sulfite reductase

1.8.1.2

MET10

Sulfite reductase (NADPH)

sulfide

MET17

O-acetylhomoserine (thiol)-lyase

4.2.99.10

MET6

Methionine synthase (vit B12-independent)

2.7.7.4

ATP PPi

Adenylyl sulfate (APS) 2.7.1.25

ATP ADP

3’-phosphoadenylylsulfate (PAPS) 1.8.99.4

NADPH NADP+; AMP; 3’-phosphate (PAP); H+

sulfite 3 NADPH; 5H+ 3 NADP+; 3 H2O

O-acetyl-homoserine

Homocysteine 2.1.1.14

L-Methionine

5-methyltetrahydropteroyltri-L-glutamate 5-tetrahydropteroyltri-L-glutamate

Methionine biosynthesis

Sulfate adenylyl transferase

Sulfur assimilation

MET3

Fig. 6. Result obtained by pathway analysis with the cell-cycle regulated gene cluster MET from Spellman [10]. Six of the reactions potentially catalyzed by enzymes-coding genes from the cluster can be assembled into a single linear pathway, without need to intercalate any additional reaction. The pathway extracted is the way used by yeast to incorporate sulfur into amino acids, by reduction of sulfate into sulfide, which is incorporated in homocyteine. This pathway matches two distinct pathways from the database: the first 4 steps correspond to sulfur assimilation, whereas the two last steps are part of the methionine biosynthetic pathway

In summary, starting from an unordered set of reactions, the program was able to build a linear metabolic pathway, which correspond to our expectation for the study case. In this particular case, the pathway was already well characterized and a similar result would have been obtained by matching the seed reactions against a database of metabolic pathways like KEGG. However, since this pathway was re-discovered by the program without any a priori information about how reactions do assemble into pathways in the yeast (the matching with known pathways was only done a posteriori), one can hope that the same method will also provide a means of discovering novel pathway. We are currently optimizing the program and evaluating it performances in different conditions, on the basis of well characterized pathways. The optimized program will then be used to provide an interpretation of gene expression data in terms of metabolic pathways.

162

J. van Helden et al.

6 Conclusions In the context of genomic approaches, coding sequence analysis is often insufficient to systematically assign a function to each gene. The function depends not only on the structure of the encoded protein, but also on the context in which this protein exerts its activity. Functional predictions thus require the integration of different levels of information. The possibility to measure the transcriptional response at a genome scale offers exciting perspectives for the discovery of gene function, taking into account the ways genes are associated in functional clusters. By combining regulatory sequence analysis and metabolic pathway analysis, one could obtain two independent and complementary sources of information for these clusters of co-regulated genes. The same methods also apply to clusters of genes obtained from other functional genomics approaches, such as phylogenetic profiles [22] and gene fusion/fission analysis [23-25].

7 Availability Regulatory Sequence Analysis tools are available on the web [26] at the URL http://ucmb.ulb.ac.be/bioinformatics/rsa-tools/. The home page for the EBI project of database on Protein Function and Biochemical Pathways is at http://www.ebi.ac.uk/research/pfbp/. A prototype version of the patwhay analysis tools can be accessed from this site. A prototype version of the visualization tools is available at http://www.soi.city.ac.uk/~msch/. Acknowledgements. Jacques van Helden was funded by the European Commission Contract N0: QLRI-CT-1999-01333.

References [1] DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680-6 (1997). [2] Brown, P.O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nat Genet 21, 33-7 (1999). [3] Eisen, M.B. & Brown, P.O. DNA arrays for analysis of gene expression. Methods Enzymol 303, 179-205 (1999). [4] Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96, 2907-12 (1999). [5] Vilo, J., Brazma, A., Jonassen, I. & Ukkonen, E. Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data. ISMB (2000). [6] Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res 28, 316-319 (2000).

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

163

[7] Salgado, H. et al. RegulonDB (version 3.0): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 28, 65-67 (2000). [8] van Helden, J. et al. From molecular activities and processes to biological function. Briefings in Bioinformatics in press(2001). [9] van Helden, J. et al. Representing and analysing molecular and cellular function using the computer. Biol Chem 381, 921-35 (2000). [10] Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97 (1998). [11] Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8 (1998). [12] Gilbert, D., Schroeder, M. & van Helden, J. Interactive visualization and exploration of relationships between biological objects. Trends in Biotechnology 18, 487-495 (2000). [13] van Helden, J., Andre, B. & Collado-Vides, J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827-42 (1998). [14] van Helden, J., Rios, A.F. & Collado-Vides, J. Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Res 28, 1808-18 (2000). [15] Brazma, A., Jonassen, I., Vilo, J. & Ukkonen, E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res 8, 1202-15 (1998). [16] Graber, J.H., Cantor, C.R., Mohr, S.C. & Smith, T.F. Genomic detection of new yeast premRNA 3’-end-processing signals. Nucleic Acids Res 27, 888-94 (1999). [17] Reinert, G. & Schbath, S. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol 5, 223-53 (1998). [18] van Helden, J., del Olmo, M. & Perez-Ortin, J.E. Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28, 1000-10 (2000). [19] Karp, P.D. et al. The EcoCyc and MetaCyc databases. Nucleic Acids Res 28, 56-59 (2000). [20] Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27-30 (2000). [21] Thomas, D. & Surdin-Kerjan, Y. Metabolism of sulfur amino acids in Saccharomyces cerevisiae. Microbiol Mol Biol Rev 61, 503-32 (1997). [22] Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. & Yeates, T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96, 4285-8 (1999). [23] Marcotte, E.M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751-3 (1999). [24] Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. & Eisenberg, D. A combined algorithm for genome-wide prediction of protein function [see comments]. Nature 402, 83-6 (1999). [25] Enright, A.J., Iliopoulos, I., Kyrpides, N.C. & Ouzounis, C.A. Protein interaction maps for complete genomes based on gene fusion events [see comments]. Nature 402, 86-90 (1999). [26] van Helden, J., Andre, B. & Collado-Vides, J. A web site for the computational analysis of yeast regulatory sequences. Yeast 16, 177-87 (2000).

O. Gascuel, M.-F. Sagot (Eds.): Computational Biology First International Conference on Biology, Informatics, and Mathematics, JOBIM 2000, Montpellier, France, May 3-5, 2000, Selected Papers LNCS 2066 Ordering Information

Table of Contents Title pages in PDF (11 KB) Preface in PDF (26 KB) Conference Organization in PDF (17 KB) Table of Contents in PDF (38 KB) Speeding Up the DIALIGN Multiple Alignment Program by Using the `Greedy Alignment of BIOlogical Sequences LIBrary' (GABIOS-LIB) Saïd Abdeddaïm and Burkhard Morgenstern LNCS 2066, p. 1 ff. Abstract | Full article in PDF (150 KB) GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

http://buffy.lib.unimelb.edu.au:2150/link/service/series/0558/tocs/t2066.htm (1 of 3) [10/18/2002 8:10:49 PM]

Springer LINK: Lecture Notes in Computer Science

G. Broner, B. Spataro, C. Gautier, and F. Rechenmann LNCS 2066, p. 12 ff. Abstract | Full article in PDF (89 KB) Optimal Agreement Supertrees David Bryant LNCS 2066, p. 24 ff. Abstract | Full article in PDF (112 KB) Segmentation by Maximal Predictive Partitioning According to Composition Biases Laurent Guéguen LNCS 2066, p. 32 ff. Abstract | Full article in PDF (176 KB) Can We Have Confidence in a Tree Representation? Alain Guénoche and Henri Garreta LNCS 2066, p. 45 ff. Abstract | Full article in PDF (121 KB) Bayesian Approach to DNA Segmentation into Regions with Different Average Nucleotide Composition Vsevolod Makeev, Vasily Ramensky, Mikhail Gelfand, Mikhail Roytberg, and Vladimir Tumanyan LNCS 2066, p. 57 ff. Abstract | Full article in PDF (141 KB) Exact and Asymptotic Distribution of the Local Score of One i.i.d. Random Sequence Sabine Mercier, Dominique Cellier, François Charlot, and Jean-Jacques Daudin LNCS 2066, p. 74 ff. Abstract | Full article in PDF (147 KB) Phylogenetic Reconstruction Algorithms Based on Weighted 4-Trees Vincent Ranwez and Olivier Gascuel LNCS 2066, p. 84 ff. Abstract | Full article in PDF (207 KB) Computational Complexity of Word Counting Mireille Régnier LNCS 2066, p. 99 ff. Abstract | Full article in PDF (160 KB) EUGÈNE: An Eukaryotic Gene Finder That Combines Several Sources of Evidence Thomas Schiex, Annick Moisan, and Pierre Rouzé LNCS 2066, p. 111 ff. Abstract | Full article in PDF (255 KB) Tree Reconstruction via a Closure Operation on Partial Splits Charles Semple and Mike Steel http://buffy.lib.unimelb.edu.au:2150/link/service/series/0558/tocs/t2066.htm (2 of 3) [10/18/2002 8:10:49 PM]

Springer LINK: Lecture Notes in Computer Science

LNCS 2066, p. 126 ff. Abstract | Full article in PDF (129 KB) InterDB, a Prediction-Oriented Protein Interaction Database for C. elegans Nicolas Thierry-Mieg and Laurent Trilling LNCS 2066, p. 135 ff. Abstract | Full article in PDF (110 KB) Application of Regulatory Sequence Analysis and Metabolic Network Analysis to the Interpretation of Gene Expression Data Jacques van Helden, David Gilbert, Lorenz Wernisch, Michael Schroeder, and Shoshana Wodak LNCS 2066, p. 147 ff. Abstract | Full article in PDF (240 KB) Author Index LNCS 2066, p. 165 Author Index in PDF (14 KB) Online publication: June 28, 2001 [email protected] © Springer-Verlag Berlin Heidelberg 2001

http://buffy.lib.unimelb.edu.au:2150/link/service/series/0558/tocs/t2066.htm (3 of 3) [10/18/2002 8:10:49 PM]

Speeding Up the DIALIGN Multiple Alignment Program by Using the ‘Greedy Alignment of BIOlogical Sequences LIBrary’ (GABIOS-LIB) Sa¨ıd Abdedda¨ım1 and Burkhard Morgenstern2 1

LIFAR - ABISS, Facult´e des Sciences et Techniques, Universit´e de Rouen, 76821 Mont-Saint-Aignan Cedex, France, [email protected] 2 AVENTIS Pharma, Rainham Road South, Essex RM10 7XS, UK. Present address: MIPS, Max-Planck-Institut f¨ ur Biochemie, Am Klopferspitz 18a, 82152 Martinsried Germany [email protected]

Abstract. A sensitive method for multiple sequence alignment should be able to align local motifs that are contained in some but not necessarily in all of the input sequences. In addition, it should be possible to integrate various of such partial local alignments into one single multiple output alignment. This leads to the question of consistency of partial alignments. Based on a new set-theoretical definition of sequence alignment, the consistency problem is discussed theoretically, and a recently developed library of C functions for consistency calculation (GABIOSLIB) is described. GABIOS-LIB has been integrated into the DIALIGN alignment program to carry out consistency tests during the multiple alignment procedure. While the resulting alignments are exactly the same as with the previous version of DIALIGN, the running time of the program has been crucially improved. For large data sets, the new version of DIALIGN is up to 120 times faster than the old version. Availability: http://bibiserv.TechFak.Uni-Bielefeld.DE/dialign/ Keywords: multiple sequence alignment, partial alignment, consistency, consistent equivalence relation, greedy lgorithm.

1

Introduction

Traditionally, there are two different approaches to sequence alignment: global methods that align sequences over their entire length [8,21,26,25] and local methods that try to align the most highly conserved sub-regions of the input sequences [24,23,3,13]. One problem with these approaches is that it is often not known in advance if sequences are globally or only locally related. A versatile alignment tool should align those regions of the input sequences that are sufficiently similar to each other but it would not try to align the non-related parts of the sequences. Thus, such a program would return a global alignment whenever sequences are globally related but a local alignment if only local homology can be O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 1–11, 2001. c Springer-Verlag Berlin Heidelberg 2001

2

S. Abdedda¨ım and B. Morgenstern

detected. One possible way to achieve this is to integrate statistically significant (partial) local alignments P1 , . . . , Pk into one resulting output alignment A. The idea to generate sequence alignments by combining partial alignments of local similarities is not new. Various authors have proposed to generate pairwise local or global alignments by chaining fragment alignments, see Wilbur and Lipman [32], Eppstein et al. [7], and Chao and Miller [4]. These authors have developed time and space efficient fragment-chaining algorithms for near-optimal alignment in the sense of the traditional Needleman-Wunsch [21] and SmithWaterman [24] objective functions. Joseph et al. [11] have proposed a greedy algorithm that is based on statistically significant segment pairs. Algorithms that integrate local alignments have also been proposed for multiple alignment. Here, the problem is to decide whether a collection of local alignments is consistent. Informally, we call a set {P1 , . . . , Pk } of partial alignments consistent if an alignment A of the input sequences exists such that every pair of residues that is aligned by one or several of the alignments Pi is also aligned by A. A formal definition of our notion of consistency will be given in the next section. The question of consistency is easy to decide if each local alignment Pi involves all of the input sequences. Vingron and Argos [30], Vingron and Pevzner [31] and Depiereux et al. [6,5] have proposed multiple alignment methods that search for motifs that are simultaneously contained in all input sequences. In this case, a sufficient condition for consistency is that, for any two local alignments Pi and Pj , either Pi Pj or Pj Pi holds where Pi Pj means that in every sequence, residues aligned by Pi are to the left of residues aligned by Pj . From a biological point of view, however, it is desirable to allow for homologies involving not all but only some of the input sequences. A multiple alignment program that finds only those similarities that are present in all sequences in a given input data set will necessarily miss many biologically important homologies. Recently, we have introduced three heuristics for multiple alignment that integrate partial local alignments not necessarily involving all of the input sequences. These methods generate multiple alignments in a greedy way by incorporating local partial alignments one-by-one into a resulting multiple alignment. SOUNDALIGN [1] assembles multiple alignments from blocks of un-gapped local alignments that may involve two or more sequences, DIALIGN [15,17,18] uses ungapped segment pairs – so-called fragments or diagonals –, and TWOALIGN [2] combines pairwise local alignments in the sense of Smith and Waterman [24] to obtain a final multiple alignment. During the greedy procedures, these three programs have to test new partial alignments for consistency with those alignments that have already been accepted for the final alignment. To this end, they store and update certain data structures that are called transitivity frontiers for SOUNDALIGN and TWOALIGN and consistency bounds for DIALIGN. DIALIGN has been successfully used to detect local homologies in nucleic acid and protein sequences. In a recent study, G¨ ottgens et al. [10] have used DIALIGN to align large genomic sequences from human, mouse and chicken. In the human/mouse alignment, multiple peaks of homology were found, some

Speeding Up the DIALIGN Multiple Alignment Program

3

of which precisely correspond to known enhancers. In addition, the DIALIGN multi-alignment of human, mouse and chicken revealed a new region of homology that was then experimentally shown to be a previously unknown enhancer, see also Miller [14] for a discussion of these results. Thompson et al. [28], have used the BAliBASE of benchmark alignments [27] to systematically compare the most widely used programs for multiple protein alignment. Here, DIALIGN was reported to be the most successful local method. It also performed well on globally related sequence sets though here CLUSTAL W [26], PRRP [9] and SAGA [22] were superior. The paper by Thompson et al., however, addressed also a major weakness of DIALIGN: it is considerably slower than progressive alignment methods. Aligning a set of 89 histone sequences took as much as 13,649 s with DIALIGN compared to 161 s with CLUSTAL. This may not be a serious problem if only a single protein family is studied. However, with the huge amount of genomic sequence data that are now available, automatic alignment of whole data bases of sequence families has become a routine task, see, for example, [12,29]. Here, program running time is a crucial criterion in choosing a suitable alignment method. Test runs have shown that for large sequence sets, the procedure of updating the consistency bounds was by far the most time-consuming step in previous versions of DIALIGN. Abdedda¨ım has recently developed a library of C functions called GABIOS-LIB (Greedy Alignment of BIOlogical Sequences LIBrary) that can be used to efficiently calculate the transitivity frontiers and to consistency-check (partial) local alignments that may have been produced by arbitrary methods. We have integrated GABIOS-LIB into DIALIGN to speed up the consistency check for fragments (segment pairs). In the present paper, the time and space complexity of GABIOS-LIB is analysed theoretically and compared to the method that was previously used in DIALIGN. Experiments with both artificial and real sequence data demonstrate that GABIOS-LIB is far more efficient than the previous procedure. In our test examples, the new version 2.1 of DIALIGN is up to 120 times faster than version 2.0 while the resulting alignments are exactly the same. In addition, GABIOS-LIB has reduced the amount of computer memory used by DIALIGN.

2 2.1

The Consistency Problem for Partial Alignments Definitions and Notations

Let S = {S1 , . . . , SN } be a sequence family and let X be the set of all sites of S where a site x = [i, p] represents the p-th position in the i-th sequence. On X, we define a partial order relation such that for any two sites x = [i, p] and x0 = [i0 , p0 ], x x0 holds if and only if both i = i0 and p ≤ p0 are true. In the language of order theory, is the direct sum of the ‘natural’ linear order relations that are given on the individual sequences. Every binary relation R on X extends the relation to a quasi order relation R = ( ∪R)t on X, where St denotes the transitive closure of a relation S, i.e. the smallest transitive relation containing S, see Fig. 1.

4

S. Abdedda¨ım and B. Morgenstern S1

s

s

S2

s

s

S3

s

u

s

s

s

s

s

s

As

s

wA KA

sA s

A

v

s y

s s

x

s

s

s

s

s

s

z

s

s

s

s

s

s

Fig. 1. A relation R = {(v, w), (x, y)} (represented by arrows) defined on the set X of all sites (black dots) of a sequence family S = {S1 , S2 , S3 } extends the partial order relation on X to a quasi partial ordering R = ( ∪R)t . We have, for example, u v, vRw , w x, xRy, y z and therefore u R z.

We call a relation R on X consistent if the extended relation R preserves the linear order relations on the individual sequences, formally: if all restrictions of R to the individual sequences coincide with their respective ‘natural’ linear order relations. In other words, the requirement is that for any two sites x and y that belong to the same sequence, x R y implies x y. Moreover, we call a set {R1 , . . . , Rn } of relations consistent if the union ∪i Ri is consistent, we say that R1 is consistent with R2 if {R1 , R2 } is consistent, and a pair (x, y) ∈ X 2 is called consistent with a relation R if {(x, y)} is consistent with R. As proposed in [17], a (partial) alignment A of the family S can be defined as a consistent equivalence relation on the set X where we write (x, y) ∈ A or xAy if the sites x and y are either aligned by A or identical. As an equivalence relation, an alignment A partitions X into equivalence classes [x]A = {y ∈ X : (x, y) ∈ A}. It can be shown that an equivalence relation A on X is an alignment in the sense of the above definition if and only if it is possible to introduce gap characters into the sequences Si such that the equivalence classes [x]A , x ∈ X are precisely those sets of sites that are in the same column of the resulting two-dimensional array, see [20] for more details. A common feature of greedy multiple alignment algorithms is that they include partial alignments P1 , . . . , Pk one after the other into a growing multiple alignment – always provided that a new alignment Pi is consistent with those alignments that have been included previously. Formally, a monotonously increasing set A1 ⊂ . . . ⊂ Ak of alignments is defined by A1 = P1 (Ai−1 ∪ Pi )e if Pi is consistent with Ai−1 i = 2, . . . , k. Ai = Ai−1 otherwise,

(1)

A final alignment A is then obtained as the largest alignment A = Ak of this set. Therefore, every greedy alignment approach has to resolve the question of consistency: at any stage of the alignment procedure, it must be known which pairs (x, y) ∈ X 2 are still alignable without leading to inconsistencies with the current alignment Ai , i.e. with those pairs of sites that have already been accepted for the final alignment.

Speeding Up the DIALIGN Multiple Alignment Program

2.2

5

Transitivity Frontiers and Consistency Bounds

An alignment A of a sequence family S imposes for every site x ∈ X and every sequence Si ∈ S a lower bound bA (x, i) and an upper bound bA (x, i) such that a site y = [i, p] ∈ Si is alignable with x without leading to inconsistencies with A if and only if bA (x, i) ≤ p ≤ bA (x, i) holds, see Fig. 2 for an example. Formally, we define bA (x, i) = min{p : (x, [i, p]) consistent with A} and bA (x, i) = max{p : (x, [i, p]) consistent with A}. In order to test fragments for consistency during the greedy procedure, previous versions of DIALIGN calculated and updated these consistency bounds. GABIOS-LIB is using a so-called transitivity frontiers to carry out the same consistency check. Here, the predecessor frontier P redA (x, i) is defined as the position of the right-most site y in sequence Si such that y A x is true, and the successor frontier SuccA (x, i) is defined accordingly as the position of the left-most site y in sequence Si with x A y so we have P redA (x, i) = max{p : [i, p] A x} and SuccA (x, i) = min{p : x A [i, p]}.

1

2

3

4

5

6

S1

s

s

s

s

s

s

S2

s

s

s

s

s

S3

s

s

s

s

s

S4

s

s

s

s

x

s

u

s

7 v

s s

s

s

s

s

8

12

13

s s s s s J J J J J s J s J s J s s J J J s sJ sJ sJ s J J J s s Js Js J s

s

wJ

9

10

11

s s s

Fig. 2. For an alignment A (bold lines) and a site x, the transitivity frontiers with respect to a sequence Si coincide with the corresponding consistency bounds if x is aligned to some site in Si . For example, site u in S2 is aligned with x, so we have bA (x, 2) = SuccA (x, 2) = 6. In sequence S1 , on the other hand, v is the right-most site that can be aligned with x, but w is the left-most site with x A w, so we have bA (x, 1) = 7 but SuccA (x, 1) = 8.

6

S. Abdedda¨ım and B. Morgenstern

The transitivity frontiers are related to the consistency bounds in the following way: if x is already aligned to some site [i, p] in sequence Si , then the predecessor and successor frontiers of x with respect to Si both equal p and they coincide with the consistency bounds, i.e. one has P redA (x, i) = SuccA (x, i) = bA (x, i) = bA (x, i) = p. This is, for example, the case for the frontiers and bounds of x with respect to S2 in Fig. 2. In contrast, if no site in Si is aligned with x, one has P redA (x, i) = bA (x, i) − 1 and

SuccA (x, i) = bA (x, i) + 1

as is the case with the corresponding frontiers and bounds with respect to S1 in Fig. 2. Therefore, if it is known for every site x and every sequence Si whether x is aligned with some site in Si , transitivity frontiers are easily obtained from the consistency bounds and vice versa, so both data structures are equivalent in that they contain the same information about which residue pairs are alignable under the consistency constraints imposed by a given alignment A.

3

GABIOS-LIB

The Greedy Alignment of BIOlogical Sequences LIBrary (GABIOS-LIB) is a set of functions implemented in ANSI C by Abdedda¨ım. These functions can be used by any greedy alignment program in order to test in constant time which sites in a sequence Sj are alignable with a site x of an other sequence Si . Each time two sites are aligned during the greedy procedure, GABIOS-LIB updates the transitivity frontiers using the incremental algorithm EdgeAddition presented in [2]. In addition, GABIOS-LIB uses some ideas first introduced in [1] to further reduce computing time and memory. In this section, we discuss how the successor frontiers SuccA (x, i) for an alignment A are affected if a (partial) alignment P is added to A and how the frontiers SuccB (x, i) for the resulting alignment B = (A ∪ P )e can be calculated. For symmetry reasons, all results apply to the predecessor frontiers as well. 3.1

The Incremental Algorithm EdgeAddition

First of all, a simple but important observation is that if a new partial alignment P is added to an existing alignment A, the frontiers with respect to the new alignment B = (A ∪ P )e need to be calculated only if B is actually different from A which is the case if and only if P is not already a subset of A. EdgeAddition stores for every site x and every sequence Sj the information whether x is already aligned with some residue from sequence Sj ; this information is used to check if a new alignment P is already contained in A.

Speeding Up the DIALIGN Multiple Alignment Program

7

If and only if P 6⊂ A holds, the transitivity frontiers SuccB are different from the frontiers SuccA . In general, however, the frontiers will change not for all but only for some sites, and the computing time can be minimized by identifying those sites. For simplicity, we consider the simplest case where a single pair of sites (x, y) is added to A. Observation 1 Let A be an alignment of a sequence family S and (x, y) a pair of sites that is consistent with A. Let B = (A ∪ {(x, y)})e be the alignment that is obtained by ‘adding’ (x, y) to A. Then for every site u in a sequence Si and for every sequence Sj the successor frontiers of B are min{SuccA (u, j), SuccA (y, j)} if u A x SuccB (u, j) = min{SuccA (u, j), SuccA (x, j)} if u A y otherwise. SuccA (u, j) It follows that the transitivity frontiers can change only for those sites u for which either u A x or u A y is true. 3.2

Further Reduction of Computing Time and Memory

In order to further reduce the computational costs for calculating the consistency frontiers, GABIOS-LIB uses the following two facts. (1) If two sites x and y are aligned by A, i.e. if xAy holds, then they have necessarily the same frontiers SuccA (x, i) = SuccA (y, i) for i = 1, . . . , N . Therefore, rather than processing the transitivity frontiers for all individual sites x, GABIOS-LIB stores and updates the frontiers for those equivalence classes [x]A that consist of more than one single site. (2) Let x = [i, p] be an orphan site in the i-th sequence, i.e. a site that is not aligned with any other site y 6= x. Then the successor frontiers SuccA (x, j) with respect to all sequences Sj 6= Si , coincide with the corresponding frontiers of the left-most non-orphan site y = [i, p0 ] with p0 > p. In Fig. 2, for example, v = [1, 7] is an orphan site in S1 . The left-most non-orphan site [1, p0 ] with p0 > 7, is the site w = [1, 8]. Therefore for j 6= 1, the successor frontiers SuccA (v, j) coincide with the corresponding frontiers for w, i.e. we have SuccA (v, j) = SuccA (w, j) for j = 2, 3, 4. Thus, instead of storing the transitivity frontiers for an orphan site x, the corresponding site y can be stored in a tabular nextClass by defining nextClass[x]= p0 . This way, the frontiers SuccA (x, j) of an orphan site x can be established in constant time each time a new pair of sites is aligned.

4

Time and Space Efficient Multiple Segment-by-Segment Alignment

In order to construct a multiple alignment of a sequence family S, DIALIGN calculates in a first step optimal pairwise alignments for all possible pairs of input sequences as explained in [16]. Optimality, in this context, refers to the segment-based objective function used in DIALIGN as defined in [19,15], i.e. an optimal pairwise alignment is a chain of fragments (gap-free segment pairs)

8

S. Abdedda¨ım and B. Morgenstern

Table 1. Running time t0 and t1 of versions 2.0 and 2.1 of DIALIGN. Version 2.0 is using the old method of calculating the consistency bounds while version 2.1 is calculating the transitivity frontiers using GABIOS-LIB. Sequences are (I) independent and (II) identical random sequences of length 100, (III) Immunoglobulin domain sequences of average length 63.3 and 20% average identity, and (IV) Ribosomal protein L7/L12 C-terminal domain sequences of average length 119.3 and 53% average identity. The running time improvement achieved by GABIOS-LIB strongly increases with the number N of sequences to be aligned. For the two sets of real-world sequences, figures are comparable to the case of identical random sequences where an improvement of up to 120 was achieved. DIALIGN 2.0 t0

t0 /N 3

25 50 (I) 75 100 150 200

22 168 601 1402 5725 14353

1.4 1.3 1.4 1.4 1.6 1.7

−3

10 10−3 10−3 10−3 10−3 10−3

25 50 (II) 75 100 150 200

46 640 3282 10557 52431 177423

2.9 5.1 7.7 1.0 1.5 2.2

25 (III) 50 75 100

25 341 1597 5046

25 (IV) 50 75 100

61 844 4157 11619

N

DIALIGN 2.1

t0 /N 4

t1

2.0/2.1

t1 /N 2

t1 /N 3

t0 /t1

−4

5.6 2.6 1.8 1.4 1.1 8.9

−5

10 10−5 10−5 10−5 10−5 10−6

10 41 100 220 576 1874

0.016 0.016 0.017 0.022 0.025 0.046

6.4 3.3 2.3 2.2 1.7 2.3

10 10−4 10−4 10−4 10−4 10−4

2.2 4.1 6.0 6.8 10.0 7.6

10−3 10−3 10−3 10−2 10−2 10−2

1.1 1.0 1.0 1.0 1.0 1.1

10−4 10−4 10−4 10−4 10−4 10−4

10 43 104 200 555 1429

0.016 0.017 0.018 0.020 0.024 0.035

6.5 3.4 2.4 2.0 1.6 1.7

10−4 10−4 10−4 10−4 10−4 10−4

4.6 14.9 31.6 52.8 94.5 124.2

1.6 2.7 3.7 5.0

10−3 10−3 10−3 10−3

6.4 5.4 5.0 5.0

10−5 10−5 10−5 10−5

7 29 73 147

0.011 0.011 0.012 0.014

4.4 2.3 1.7 1.4

10−4 10−4 10−4 10−4

3.5 11.7 21.8 34.3

3.9 6.7 9.8 1.1

10−3 10−3 10−3 10−2

1.5 1.3 1.3 1.1

10−4 10−4 10−4 10−4

14 63 149 288

0.022 0.025 0.026 0.028

8.9 5.0 3.5 2.8

10−4 10−4 10−4 10−4

4.3 13.3 27.8 40.3

Pk f1 . . . fk such that i=1 w(fi ) is maximal. Here, w is a weighting function defined on the set of all possible fragments, and fi fj means that the end positions of fi are both strictly smaller than the respective beginning positions of fj . Fragments from the respective optimal pairwise alignments are then greedily integrated into a resulting multiple alignment A. Let L be the maximum length of the sequences S1 , . . . , SN . Since O(N 2 ) pairwise alignments are performed each of which consisting of at most L fragments, O(N 2 L) fragments are to be checked for consistency. Updating the consistency bounds takes O(N 2 ) time if a new fragment is accepted for alignment. Since previous versions of DIALIGN calculated the consistency bounds for every fragment that was accepted for alignment, the

Speeding Up the DIALIGN Multiple Alignment Program

9

Table 2. Number of integers allocated by DIALIGN 2.0 and DIALIGN 2.1 for storing consistency bounds (#CB) and transitivity frontiers (#T F ), respectively. Sequence sets are as in Table 1. N 25 50 75 100 25 50 75 100

(I)

(II)

#CB

#T F

#CB #T F

125,000 500,000 1,125,000 2,000,000

5,000 10,000 15,000 20,000

25.0 50.0 75.0 100.0

125,000 500,000 1,125,000 2,000,000

28,050 106,800 204,150 375,000

4.5 4.7 5.5 5.3

(III)

(IV)

#CB

#T F

#CB #T F

85,700 346,600 780,000 1,382,800

11,500 43,900 76,650 142,400

7.5 7.9 10.2 9.7

148,800 609,600 1,356,450 2,403,800

13,300 75,600 145,350 212,200

11.2 8.1 9.3 11.3

worst-case time complexity of the entire greedy procedure was O(N 4 L). Table 1 shows that the running time of DIALIGN 2.0 is in fact proportional to N 4 if identical sequences are aligned where all fragments from the optimal pairwise alignments are necessarily consistent and are therefore integrated into A. Let us consider a consistent set of pairs {(x1 , y1 ), . . . , (xm , ym )} that are successively integrated into a growing set of alignments A1 ⊂ . . . ⊂ Am by defining Ai = {(x1 , y1 ), . . . , (xi , yi )}e . It was shown in [2] that if each pair (xi , yi ) is actually new, i.e. not yet contained in the previous alignment Ai−1 , then GABIOS-LIB takes O(N 2 m + |X|2 ) time to update the transitivity frontiers during the procedure of integrating all m pairs. It was also explained in [2] that there can be at most |X| ‘new’ pairs of sites – every additional pair would either be inconsistent or would already be contained in the current alignment. Therefore, GABIOS-LIB takes O(N 2 |X| + |X|2 ) = O(N 3 L + N 2 L2 ) time to compute the transitivity frontiers while integrating an arbitrary set of pairs. Note that this complexity analysis is a worst-case estimate. As shown in Table 1, the real running time of GABIOS-LIB is better than O(N 3 ). The new version 2.1 of DIALIGN is still slower than the most widely used multi-alignment program CLUSTAL W [26], but the difference in running time is now reduced to a factor of about 10. Parameter optimization should further decrease the running time of DIALIGN. In the previous version of DIALIGN, the consistency bounds were stored for every site x ∈ X. Since for every x, 2N integer values had to be considered – upper and lower bounds with respect to all N sequences –, computer memory had to be allocated for exactly 2N |X| integer values. By contrast, GABIOSLIB stores 2N transitivity frontiers only for non-orphan equivalence classes. For orphan sites, single integers are stored that refer to the next non-orphan site in the same sequence. In the worst case, the number of non-orphan equivalence classes is in the order of |X|, so the space complexity for GABIOS-LIB is upper-

10

S. Abdedda¨ım and B. Morgenstern

bounded by O(|X|N ) = O(N 2 L) which was also the real space complexity of the old version of DIALIGN. Table 2 shows, however, that the actual memory requirement for storing the transitivity frontiers with GABIOS-LIB is far smaller than for storing the consistency bounds in DIALING 2.0.

References 1. S. Abdedda¨ım. Fast and sound two-step algorithms for multiple alignment of nucleic sequences. In Proceedings of the IEEE International Joint Symposia on Intelligence and Systems, pages 4–11, 1996. 2. S. Abdedda¨ım. Incremental computation of transitive closure and greedy alignment. In Proc. of 8-th Annual Symposium on Combinatorial Pattern Matching, volume 1264 of Lecture Notes in Computer Science, pages 167–179, 1997. 3. S. F. Altschul, W. Gish, W. Miller, E. M. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990. 4. K.-M. Chao and W. Miller. Linear-space algorithms that build local alignments from fragments. Algorithmica, 13:106–134, 1995. 5. E. Depiereux, G. Baudoux, P. Briffeuil, I. Reginster, X. D. Boll, C. Vinals, and E. Feytmans. Match-Box server: a multiple sequence alignment tool placing emphasis on reliability. CABIOS, 13:249–256, 1997. 6. E. Depiereux and E. Feytmans. Match-box: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. CABIOS, 8:501–509, 1992. 7. D. Eppstein, Z. Galil, R. Giancarlo, and G. Italiano. Sparse dynamic programming I: Linear cost functions. J. Assoc. Comput. Mach., 39:519–545, 1992. 8. O. Gotoh. An improved algorithm for matching biological sequences. J. Mol. Biol., 162:705–708, 1982. 9. O. Gotoh. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264:823–838, 1996. 10. B. G¨ ottgens, L. Barton, J. Gilbert, A. Bench, M. Sanchez, S. Bahn, S. Mistry, D. Grafham, A. McMurray, M. Vaudin, E. Amaya, D. Bentley, and A. Green. Analysis of vertebrate scl loci identifies conserved enhancers. Nature Biotechnology, 18:181–186, 2000. 11. D. Joseph, J. Meidanis, and P. Tiwari. Determining DNA sequence similarity using maximum independent set algorithms for interval graphs. Lecture Notes in Computer Science, 621:326–337, 1992. 12. A. Krause, P. Nicod`eme, E. Bornberg-Bauer, M. Rehmsmeier, and M. Vingron. Www access to the systers protein sequence cluster set. Bioinformatics, 15:262–263, 1999. 13. C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science, 262(5131):208–14, 1993. 14. W. Miller. So many genomes, so little time. Nature Biotechnology, 18:148–149, 2000. 15. B. Morgenstern. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15:211–218, 1999. 16. B. Morgenstern. A space-efficient algorithm for aligning large genomic sequences. Bioinformatics, in press.

Speeding Up the DIALIGN Multiple Alignment Program

11

17. B. Morgenstern, A. W. M. Dress, and T. Werner. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA, 93:12098–12103, 1996. 18. B. Morgenstern, K. Frech, A. W. M. Dress, and T. Werner. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14:290–294, 1998. 19. B. Morgenstern, K. Hahn, W. R. Atchley, and A. W. M. Dress. Segment-based scores for pairwise and multiple sequence alignments. In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen, editors, Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, pages 115– 121, Menlo Parc, CA, 1998. AAAI Press. 20. B. Morgenstern, J. Stoye, and A. W. M. Dress. Consistent equivalence relations: a set-theoretical framework for multiple sequence alignment. Materialien und Preprints 133, University of Bielefeld, 1999. 21. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443–453, 1970. 22. C. Notredame and D. Higgins. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Research, 24:1515 – 1524, 1996. 23. W. R. Pearson and D. J.Lipman. Improved tools for biological sequence comparison. Proc. Nat. Acad. Sci. USA, 85:2444–2448, 1988. 24. T. F. Smith and M. S. Waterman. Comparison of biosequences. Advances in Applied Mathematics, 2:482–489, 1981. 25. J. Stoye. Multiple sequence alignment with the divide-and-conquer method. Gene, 211:GC45–GC56, 1998. 26. J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673–4680, 1994. 27. J. D. Thompson, F. Plewniak, and O. Poch. BAliBASE: A benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999. 28. J. D. Thompson, F. Plewniak, and O. Poch. A comprehensive comparison of protein sequence alignment programs. Nucleic Acids Research, 27:2682–2690, 1999. 29. J. D. Thompson, F. Plewniak, J.-C. Thierry, and O. Poch. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Research, 28:2919–2926, 2000. 30. M. Vingron and P. Argos. Motif recognition and alignment for many sequences by comparison of dot-matrices. J Mol Biol, 218(1):33–43, 1991. 31. M. Vingron and P. Pevzner. Multiple sequence comparison and consistency on multipartite graphs. Advances in Applied Mathematics, 16:1–22, 1995. 32. J. W. Wilbur and D. J. Lipman. The context dependent comparison of biological sequences. SIAM J. Appl. Math., 44:557–567, 1984.

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes 1,2

1

1

2

Gisèle Bronner, Bruno Spataro, Christian Gautier, and François Rechenmann 1

Laboratoire de Biométrie et Biologie Évolutive, UMR – CNRS 5558, Université Claude Bernard Lyon1 43, bd du 11 novembre 1918, 69622 Villeurbanne cedex, France {bronner,bspataro,cgautier}@biomserv.univ-lyon1.fr 2 INRIA Rhône Alpes 655 av. de l’Europe, 38330 Montbonnot St Martin, France Francois.Rechenmann[email protected]

Abstract. Representing genomic mapping knowledge is a complex process due to the diversity of experimental approaches. The lack of obvious equivalence between different kinds of maps1 makes it difficult to use them simultaneously or in a cooperative way. Furthermore, comparative mapping is not limited to map comparison, it also implies the comparison of the elements they are composed of. Comparison thus implies the combination of different computational objects, corresponding to different levels of information. However, from the user’s point of view, systems devoted to comparative mapping should appear to process queries in a homogeneous manner, whatever the objects they imply. We propose a data model that integrates mapping data, genetical data and sequences and takes into account the specific constraints resulting from the comparison of maps between and within different species.

1

Introduction

Comparative interspecies analysis of nucleic or proteinic sequences is helpful in understanding genome evolution and function. Comparisons can be made at different levels - sequence comparisons: codon usage, alignments, phylogeny; - function comparisons: protein functions, functional domains; - structure comparisons: DNA structures, protein strutures, genome structure. The relative position of interesting genes and markers on chromosomes can be drawn on genomic maps to identify regions where candidate sequences corresponding to the markers should lie. A large number of mapping projects are providing increasing amounts of information on the localizations of biological elements. Such information, that is the position of elements on a chromosome, can be compared as sequences or biological elements. An initial analysis suggests that chromosomic structures are relatively conserved in vertebrates [1], allowing prediction of gene positions among 1

Definitions for map, markers and other biological concepts can be found in the Appendix.

O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 12-23, 2001. © Springer-Verlag Berlin Heidelberg 2001

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

13

species. As a consequence, mutual enrichment of maps from different species greatly improves the mapping approach: reference species can be used to help in the analysis of species of interest. Comparative mapping thus emerges as a powerful tool to gather new knowledge about genomes. Present mapping data management systems allow users to access biological data on single objects, but do not permit a global analysis of biological objects according to their genomic localization. However, the interest of comparative mapping is not limited to comparisons of gene localization. It can also help in resolving biological problems. For example, comparative mapping studying the C. elegans daf-18 ortholog has been used to characterize possible phenotypes associated with mutations of the human PTEN tumor suppressor gene [2]. Because of the lack of adequate tools for comparative mapping, we are developing GeMCore, a knowledge base dedicated to the management and comparison of mapping data, as well as functional and sequence data on mammals. GeMCore is part of the GeM project (for Genomic Mapping) which consists in the development of a core knowledge base (GeMCore) connected to graphical interfaces devoted to particular domains. Parallel to GeMCore, the user interface GeMME (for Genomic Mapping and Molecular Evolution) [3], is being developed. GeMME allows queries on several data types such as gene symbol, localization or expression of genes as well as queries related to sequence composition or evolution. It proposes numerical and graphical representation of data allowing visual display of synteny groups and mapto-map switching between species. GeMME contains several analytical tools devoted to molecular evolution. These tools can, for example, highlight specific spatial patterns of the human genome such as isochore [4] or autocorrelation of the silent mutation rate [5]. In the next sections, we will first present the requirements of any system dedicated to comparative mapping, then will describe the data model used in GeMCore.

2

Requirements of Mapping Data Systems

Comparative genomics involving many species requires the use of many data sources because: - each mapping project has its own official database such as the Genome Data Base (GDB) [6] for the human genome or the Mouse Genome Database (MGD) [7] for the mouse genome; - many databases of particular interest exist for different species (GDB, LDB [8], OMIM [9]). Furthermore, the diversity of experimental genome mapping approaches (genetic, cytogenetic, physical RH maps...) produces different types of maps. These maps are not equivalent, and thus define different points of view of the same object. The diversity and heterogeneity of this information (heterogeneity of the data themselves but also of their storage format) is difficult to manage. Three major problems can be described: Data inconsistency: mapping data comparisons can exhibit important contradictions among data, and sometime real errors. There are two kinds of inconsistency. Inconsistency can result from an erroneous representation of the data: in this case, the information is correct but expressed wrongly. For example, the insulin (Ins1) gene,

14

G. Bronner et al.

which has been assigned to the 11p15 region on the human genome, was drawn around the 11p12-11p14 region in the GDB chromosome 11 integrated map. Inconsistency can also be intrinsic to the data. For example, the order of markers on a map can differ depending on the mapping method that is considered (some mapping methods based on probabilistic inference sometimes infer non-consensual marker order). Here, inconsistency cannot be suppressed by correcting the information. Thus, the system must propose explicit strategies of inconsistency management and integrate new user choices in these strategies. Lack of formal links between data: for the human and mouse genomes, links between localization and sequences, genes and gene products are sometimes approximate and far from complete. For other species, the situation is even worse. There have been some attempts to automatically generate these links through semantic analysis of databank annotations. This led to the Virgil [10] or the LocusLink [11] protocols for the human genome. This strategy can associate the genetic name of a gene (and so its localization) to a genomic sequence, but cannot discriminate between different coding sequences within the same fragment. Moreover, alternative splicing which is very frequent in eukaryotes, makes it even more complex to assign a genetic name to a coding sequence. Data redundancy: parallel gathering of similar data from different sources generates strong redundancy in banks. As an example, the human insulin gene sequence is recorded four times in Genbank. Comparative mapping data management systems must be able to handle data diversity, data sources and constraints inherent to such information. They are not limited to gathering and integrating pre-existing data, but consist in distributing a clarified, completed and organized information set capable of reducing the difficulties described above. This pre-processing implies the development of procedures specific to this stage of knowledge base building. This phase also requires the use of external programs such as Clustal [12] or Blast [13, 14] that should not be integrated in the knowledge base. As the data gathering protocol is designed for a specific problematic, it does not meet our needs for producing a generic knowledge base, although a specific interface devoted to the data gathering protocol has been developed. This interface handles some aspects of data consistency management, redundancy avoidance and data linkage as well reorganization of data gathered from diverse databases into a common format. Because certain procedures such as identification of homologies require manual expertise, the interface is currently only partly automated. This procedure allows the building of an integrated dataset. It will constitute the starting point of the data gathering of the knowledge base. As the GeMCore implementation is still in progress, this dataset is currently used to test the system. The dataset consist in cytogenetical descriptions of human and mouse chromosomes, location data for human and mouse genes, sequences and sequence analysis (such as G+C content, CDS length, alignment length, substitution rates...), expression data, functional homology data, evolutive relations and recombination rates. In this dataset, links between genes and coding sequences are built from the NCBI LocusLink file for human and the MGD MRK_Sequence.rpt file for mouse. The relevance of all associations between a CDS (Coding DNA Sequence) and a gene symbol are evaluated. To discriminate between coding sequences associated to the

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

15

Table 1. Description of the dataset used in the evaluation of GeMCore. data locations

Source human mouse

gene/sequence links

+

7203 genes

locuslink4, Hovergen5 MGD, Hovergen

11040 links

sequence analysis

Hovergen, Gccompute

22883 seq.

CDS analysis

Hovergen, Gccompute

21066 seq.

homology

MGD

4044 cples of genes

orthology – substitution rates

3998 cples

human

Orthogen, JaDis6 LDB

mouse

MGD

20 chr.

cytogenetic maps

human

file size 4964 genes

LDB 2 MGD3

mouse

14172 links

23 chr.

same gene symbol, we use similarity information expressed through the family and redundancy numbers of the Hovergen database [15]. Hovergen redundancy numbers group under a common annotation redundant sequences from Genbank. Two sequences are said to be redundant if they have less than 1% divergence and at most 3 nucleotides of difference in length. The family numbers group homologous sequences (orthologous or paralogous genes resulting from duplication anterior to the divergence of vertebrates) under the same identifier. For each locus symbol, LocusLink provides a list of accession numbers, that is a list of genomic fragments of Genbank. A method has been developed to generate a list of relevant couples of gene symbol / gene sequence (CDS) extracted from these fragments. First, all links between a gene symbol and a CDS are listed. For each symbol, if all CDS have the same redundancy number, they are assumed to be several entries of the same gene and are all associated to the locus symbol. If several redundancy numbers are associated to the same gene symbol, we search for a CDS having the locus symbol cited in the corresponding fields of Genbank annotations. If found, only the CDS with this redundancy number are associated to the locus symbol. To keep most of the LocusLink information, we introduce two “weak” links between locus symbols and CDS belonging to the genomic fragment associated in LocusLink: -

“alternative” CDS have a different redundancy number than the one associated to the linked CDS but belong to the same family “putative” CDS correspond to cases where all CDS belong to the same family, but have different redundancy number and no locus symbol in Genbank annotations.

2 Location DataBase - http://cedar.genetics.soton.ac.uk/public_html/ldb.html 3 Mouse Genome Database - http://www.informatics.jax.org 4 LocusLink - http://ncbi.nlm.nih.gov/LocusLink/index.html 5 Homologous Vertebrate Gene Database http://pbil.univ-lyon1.fr/hovergen.html 6 JaDis - http://pbil.univ-lyon1.fr/software/jadis.html

16

G. Bronner et al.

Other cases, and particularly those where inconsistencies are present are also carefully noted. Presently, this data gathering procedure applies exclusively to genes, but we intend to extend it to other genomic elements, first of all to anonymous markers that are mapped in many map types (this would allow the user to directly compare different maps types), but also biologically informative sites such as satellites, breakpoints, CpG islands etc.

3

What Kind of System for Comparative Mapping?

A system managing comparative mapping knowledge must allow comparisons of marker position in several maps, but also the mapping of biological properties (G+C content, expression…) and their inter specific comparisons. This implies constraints both on data representation and querying capabilities. Very few databases are dedicated to comparative mapping. Present systems offer different levels of integration of mapping information. We can mention the Animal Genome database in Japan [16], which focuses on comparative gene mapping among pig, cattle, mouse and human. This database contains information concerning genes related to phenotypic traits and any genetic and cytogenetic localization known. Graphical map display is also available. It allows a one locus access, though precluding global analysis. Homology relations between genes are stipulated in the database, however interspecific links between data are not always symmetrical. There is also MICADO [17], a database that manages bacterial genomic information. MICADO links eubacterial and archaebacterial sequences, genetic maps and information on mutants. A graphical interface allows the parallel comparison of genomes. Hyperlinks are created as a complement, to reference other general and specialized related information resources. Databases based on the ACE model [18] such as AceDB [18] or AtDB [20] allow the comparison of different maps within a species. To a lesser extent, GDB, MGD or Bovmap [21] can also be cited. All of them are in fact specific databases, though they should not be considered as comparative mapping oriented databases. However, in addition to links to other data resources, they contain modules to compare maps among species. Most of present systems thus use a simplified representation of mapping data; queries are mainly limited to intuitive navigation through information. The more complete system (ACE) is the only one allowing recursive representation of mapping (a map becomes a marker in a larger map) but remains monospecific. Examining the querying capabilities is the simplest way to point out limitations of present systems. None of them are able to answer “simple” requests such as: " list pairs of homologous genes localized on the murin chromosome 2 and the human chromosome 1" or " get all human, complete genes sequences with a known orthologous gene in mouse, which has a sequenced neighboring gene at less than 1 cM" Moreover, user-controlled inferences have to be applied to generate the complete answer to these questions. For example, reasoning can enrich one type of map from another to produce a more complete answer (see Fig. 1). Reasoning can also apply to

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

17

inference of markers themselves, for example repetitive elements. User control is also necessary here, particularly when no consensual definition exists, e.g. CpG islands. Finally, the user should find analysis tools that will enable him to answer biological questions. Let us take an example: the study of regional recombination rates. The idea is to exhibit the features of chromosomal regions that present variations in the frequency of the recombination events. To answer this question, genetical distances are to be expressed as a function of physical position in order to obtain a recombination profile along the chromosome. Presently both physical and genetical localization are not associated to genes, but to anonymous fragments of DNA. To associate genic information to recombination frequencies, mapped genes have to be added to the recombination profile. Finally, the analysis of sequences associated to genes makes it possible to study the variations of recombination levels along the chromosome in relation to genic contextual features. a)

cM

cM

b)

Ma. 4

Ma. 3

? Ma. 1

Ma. 2

Seq. 3 Seq. 2

Seq. 1

Fig. 1. Non explicit information can be extracted from positional data. (a) - Both physical (bottom) and genetic (top) localizations may be available for some genes or markers, but when there is only a physical localization, it can be used to infer the genetic localization. Because the relation between genetic and physical maps is not linear, such inference is not obvious. In order to make it more clear, a preliminary modeling operation is required to establish the relation between physical and genetic mapping information. (b) - Local connection of nucleotide sequences can also be used to infer element localization, although this implies taking into consideration orthologous elements. Consequently, considering mapping information, as well as evolutive relations, is a prerequisite to inferring positional information from sequence appendings.

4

What Kind of Model for Comparative Mapping?

As seen above, complex information of very different types must be managed. The complexity results from the diversity of the information, but also from the huge number of interrelations between this information. The chosen model must take into account both type complexity and the central role played by relations. RDBMS (Relational Data Base Management Systems) manage relations efficiently, so they can produce strong and reliable data models. Nevertheless, the relational model is not adapted to complex data schemes. Conceived for tabulated data, it cannot express the complexity of inter-relations of biological objects (see the evolution of the GDB data model [22]). ACEDB is a model dedicated to genomic mapping in one species. As we have seen, its query capabilities and information representation are not sufficient for com-

18

G. Bronner et al.

parative mapping. More generally, the developments needed appear to be too large to construct a new dedicated model, even if these models are always more efficient in the context of their development (see ACEDB or ACNUC [23] for sequence management). OODBMS (Object Oriented Data Base Management Systems) particularly fit for representations of biological concepts: they allow a coupling of the data and the available operators, thus producing a relatively intuitive model of the biological entities represented. Computational object manipulation mimics biological behavior. As genomics is evolving due to huge genome mapping and sequencing projects, data management systems must be flexible and capable of easily integrating new fields of knowledge. Object systems have such capabilities, allowing the construction of modular and hierarchical structures, and thus providing real adaptability to complex data. Nevertheless, relations cannot be clearly defined. In order to express relations within an object oriented system, we designed a conceptual data scheme based on UML (Unified Modeling Language) definitions. This is an object oriented data model that integrates and expresses relations as peculiar objects. Ongoing research at INRIA Rhône-Alpes on AROM [24], an implementation of an UML-like modeling language as a knowledge representation system, is being used for GeMCore development.

5

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

Avoiding redundancy is crucial because redundancy produces inconsistency (when the information to be corrected is recorded many times, some recordings may go uncorrected). Inconsistency management is even more difficult if this information lies in different data structures. Hierarchical structuring of data gives a unique representation of information shared by different elements of the system, thus avoiding the multiplication of storage spaces. Furthermore, when peculiar structures associated to specific information are defined, it is easier to evaluate inconsistency than if the same type of data were stored in heterogeneous structures. In the GeMCore data model, three major hierarchy classes have been defined (Fig. 2): - the Element hierarchy: it structures a collection of genomic elements such as genes, satellites or transposable elements. This hierarchy separates the elements that have a biological reality from artificial elements produced by genome projects (e.g. clones or expressed sequence tags - EST). Biological elements are defined according to their function. Non-informative elements such as anonymous DNA are separated from functional elements. Genes are distinguished from functional elements that are not transcribed. Pseudogenes which are defined at the same level as Genes, inherit from Genes_Related elements. -

the Map hierarchy: three sub-types are defined. Two of them (Physical_map and Genetical_map) correspond exactly to the biological concepts they represent. The third type named Set concerns maps that contain sets of objects that do not have any exact localization. This last group of maps is particularly useful to represent

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

Composed Sequence

19

Ordered Not-ordered

Elementary

SEQUENCES Gene Gene Related

Is Sequenced Genomic Element

Transposable

Elements

EST G.P.Element

Clone

ELEMENTS Is Mapped Physical Map Map

Genetic Map

Pseudogene

Chromosome Element A

ClasseI

Is Syntenic

Is Homolog

Element B

Element C

Continuous Contig Absolute

Map A Allen

Relative Set

Cytog.

Map B

RH MAPS

Fig. 2. Simplified scheme of the GeMCore data model. An entity, for example a gene, can be seen as a functional element, a marker on a map or a sequence depending on the objective of the user. In GeMCore, these different levels of information are described in three hierarchy classes: Sequence, Map and Element (left). To express the links existing between these three kinds of objects, two associations (inserts) were defined. Is_Mapped and Is_sequenced join maps and sequences to the biological element respectively. Associations specific to some data levels are also defined (right). Evolutive relations (Is_Homolog) are defined for biological elements. The Is_syntenic association expresses that elements are on the same chromosome, whereas Allen association allows reasoning on the relative position of maps

cytogenetic bands or hybrid radiation segments. It also allows users to define their own sets of elements with regard to the question they want to answer (for example, the map of disease genes). -

the Sequence hierarchy: it organizes sequences through a hierarchy that distinguishes elementary from non elementary sequences. Elementary sequences are made of a unique nucleic acid fragment, while Composed sequences correspond to entries that describe collections of fragments that are more or less ordered (currently the unfinished genome sequences from the academic human genome project consortium). Some complementary classes such as Species, Protein or Information, have been defined in addition to the hierarchies described above. Basically, two kinds of relations are defined in the model: -

Comparison relations express evolutive relations (Fig. 3) or similarity relations depending on the objects that are linked (biological objects or nucleotide sequences).

20

G. Bronner et al.

Homology Orthology

Paralogy

Fig. 3. Homology relations can be specialized. Orthology, which links two genes that result from a speciation event, can be distinguished from paralogy that links elements deriving from a duplication event. Both naturally inherit from the homology relation.

- Positional relations defined by Allen algebra [25]. Allen introduced a formalism in order to represent intervals by their ends (that is two coordinates on one axis) and suggested the use of constraint satisfaction to reason on these relations. In GeMCore, these relations are structured in a hierarchy (Fig. 4).

Fig. 4. (a) - According to the Allen formalism, for the A gene to be before the B gene, it is sufficient that the upper coordinate of B be bigger than the lower coordinate of A. (b) - With his formalism, Allen defines thirteen binary relations between interval pairs. In the GeMCore Model, these relations are summarized and structured into a hierarchy where the lower the hierarchy level, the more precise the features of the relative positions of the related elements.

As for classes, certain relations lie outside the hierarchies (codes_for, is_product_of, belongs_to_species...). The three hierarchy classes have been defined to facilitate legibility, making it easier to detect inconsistencies and make updates. Since data are separated between three independent structures, one hierarchy can be updated independently of other data structures. This is particularly interesting when data come from different sources that are not updated at the same time. However, the model splits information concerning a given biological entity into three hierarchies, so that it no longer exists as a single entity within the system. The biological notions, the sequence data and the localization relative to a gene belong to a unique entity, which has to be expressed in the model. This is done by "corresponding" relations defined between the biological object, the map and the sequence (Fig. 5).

6

Conclusion

It highly important to associate all the biological data available in order to extract new information. Our goal was at first to define a generic data model allowing not only

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

Sequence Is Sequenced Elements Is Mapped

21

Fig. 5. GenomicElement, Sequence and Map classes are tightly linked: as for a sequence, one or more map can be associated to a gene. Depending on the goals of the user, the gene can be seen through its biological properties, its sequence or its localization. Thus it is necessary to express the links existing between all the objects describing the same biological entity. These links are modeled in GeMCore through the relations Is_Sequenced and Is_Mapped.

Map

data management, but also comparative analysis of mapping of biological and molecular data. GeMCore is an entity-association model designed to perform object comparisons, as well as evaluation and comparison of the relations between the objects in the database. This can be done thanks to the explicit representation of these relations. Our model takes into account the constraints inherent to the domain of comparative mapping and includes many data types (maps, location data, sequence phenotypes, homology…), thus making it valuable for many fields of biology (molecular biology, molecular evolution, medicine, agronomy). A concrete confrontation of the model with real data and analysis will be required for validation. We are thus working on a sub-model in the laboratory of Biométrie et Biologie Evolutive in Lyon using the AROM knowledge representation system. This knowledge base contains human and mouse data. It will be coupled to GeMME and should include other mammalians in the future. Our problematic deals with the comparison of mapping data, and appears to be more complex than simple construction of data management systems. The goal is to produce a tool that will be able to manage profusely diverse data, while keeping in mind the biological nature of the information contained in the data to be processed. This goal implies the use of consensual terms and methodologies. When building the GeMCore data model we had to precisely define many concepts used in the field of comparative mapping and, more broadly, in methodology dedicated to the management and processing of data produced by genome projects. The GeMCore data scheme could also be helpful in defining an ontology dedicated to comparative genomic mapping.

References 1.

2.

Andersson, L., Archibald, A., Ashburner, M., Audun, S., Barendse, W., Bitgood, J., Bottema, C., Broad, T., Brown, S., Burt, D., Charlier, C., Copeland, N., Davis, S., Davisson, M., Edwards, J., Eggen, A., Elgar, G., Eppig, JT., Franklin, I., Grewe, P., Gill, RD T. 3 , Graves, J.A., Hawken, R., Hetzel, J., Womack, J., et al.: Comparative genome organization of vertebrates. The first international workshop on comparative genome organization. Mamm. Genome 7 (1996) 717-734 Rouault, J.P., Kuwabara, P.E., Sinilnikova, O.M., Duret, L., Thierry-Mieg, D., Billaud, M.: Regulation of dauer larva development in Caenorhabditis elegans by daf-18, a homologue of the tumour suppressor PTEN. Current Biology 9 (1999) 329-332

22 3.

4. 5. 6. 7.

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

21. 22. 23.

G. Bronner et al. Spataro, B., Gautier, C.: Analyse comparative des genomes humains et murins avec l’interface GeMME. Recueil des actes, 1ères Journées Ouvertes : Biologie, Mathématiques, Informatique, 3-5 mai 2000, Montpellier - G. Caraux, O. Gascuel, M.F. Sagot (eds.) (2000) 55-64 Bernardi, G.: Isochores and the evolutionary genomics of vertebrates. Gene 241 (2000) 317 Matassi, G., Sharp, P., Gautier, C.: Chromosomal location effects on gene sequence evolution in mammals. Current Biology 9 (1999) 786-791 Letovsky, S.: GDB: integrating genomic maps. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 85-98 Eppig, J.T., Richardson, J.E., Blake, J.A., Davisson, M.T., Kadin, J.A., Rinwald, M.: The mouse genome database and the gene expression database: genotype to phenotype. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 119-128 Collins, A., Frezal, J., Teague, J., Morton, N.E.: A metric map of humans:23,500 loci in 850 bands. Proc. Natl. Acad. Sci. USA 93 (1996) 14771-14775 Scott, A.F., Amberger, J., Brylawski, B., McKusick, V.A.: OMIM: online mendelian. inheritance in man. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 77-84 Achard, F., Cussat-Blanc, C., Viara, E., Barillot, E.: The new Virgil database: a service of rich links. Bioinformatics 14 (1998) 342-348 Pruitt, K.D., Katz, K.S., Sicotte, H., Maglott, D.R.: Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet. 16 (2000) 44-47 Thompson, J.D., Higgins, D.G., T.J., Gibson.: CLUSTALW , improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acid. Res 22 (1994) 4673-4680 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J Mol Biol. 215 (1990) 403-410 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (1997) 3389-3402 Duret, L., Perrière, G., Gouy, M.: HOVERGEN: comparative analysis of homologous vertebrate genes. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 21-36 Wada, Y., Yasue, H.: Development of an animal genome database and its search system Comput. Applied Biosci. 12 (1996) 231-235 Biaudet, V., Samson, F., Bessieres, P.: Micado – a network oriented database for microbial genomes – Comput. Applied Biosci., 13 (1997) 431-438 Thierry-Mieg, J., Thierry-Mieg, D., and Stein, L.: ACEDB: The ACE Database Manager. In: S. Letovsky (ed.) Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, Dortrecht, London (1999) 265-278 Durbin, R., Thierry-Mieg, J.: The ACeDB genome database. In: Suhai, S. (ed.) Computational Methods in Genome Research, Plenum Press,New-York (1994) 45-55 Flanders, D.J., Weng, S., Petel, F.X., Cherry, J.M.: AtDB, the Arbidopsis thaliana database, and graphical web display of progress by the Arbidopsis genome initiative. Nucleic Acids Res. 26 (1998) 80-84 http:// locus.jouy.inra.fr/cgi-bin/bovmap/intro2.pl Fasman, K.H., Letovsky, S.I., Cottingham, R.W., Kingsbury, D.T.: Improvements to the GDB human genome database - Nucleic Acids Res. 24 (1996) 57-63 Gouy, M., Gautier, C., attimonelli, M., Lanave, C., di Paola, G.: ACNUC – a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. CABIOS 1 (1985) 167-172

GeMCore, a Knowledge Base Dedicated to Mapping Mammalian Genomes

23

24. Page, M., Gensel, J., Capponi, C., Bruley, C., Genoud, P., Ziebelin, D.: Representation des connaissances au moyen de classes et d’associations: le système AROM. Conf. langage et modèles à objet (LMO’00), Mt Saint-Hilaire, CA (2000) 91-106 25. Allen, J.F.: Maintaining knowledge about temporal intervals – Comm. of the ACM 26 (1983) 832–843

Appendix: Markers, Maps, and Recombination The genome is made of several chromosomes. Each of them includes a DNA molecule, which is a linear polymer of four nucleotides (adenine, guanine, cytosine, thymine). Genetic information is coded in these nucleotide sequences. Proteins are molecules that ensure most of the cell functions. The information necessary to build proteins is stored in protein genes. A gene is a subsequence of the genome. Expression of its information content requires two successive processes: transcription that generates a mRNA molecule (the messenger) and translation that converts this RNA molecule into protein. Unexpectedly, only a small fraction of the genome contains genes. In mammals, about 95% of genome has no genetic information content. A marker is any genomic object on the genome that can be identified non ambiguously: a gene is a marker as well as an anonymous sequence of a genomic fragment. Genome maps appear mainly as an ordered series of markers. A set of markers located on the same chromosome is said to be syntenic. Several distances between markers can be determined. When the complete genome is known, the number of nucleotides between two markers is known. This is the physical distance between the two markers. When the sequence is not known, the physical distance can be estimated from experimental data. Another important distance is the genetic distance. Each individual has two sets of homologous chromosomes, one from his father, the other from his mother. During gamete production, pairs of homologous chromosomes cross over, exchanging sequences. The genetic distance can be determined from the expected number of cross-overs between two markers (the recombination rate). Comparing physical and genetical distance is useful to address important biological points in various fields of biology including population genetics, medical research, and evolution.

Optimal Agreement Supertrees David Bryant LIRMM, 161 rue Ada, 34392 Montpellier, France Cedex 5 [email protected]

Abstract. An agreement supertree of a collection of unrooted phylogenetic trees {T1 , T2 , . . . , Tk } with leaf sets L(T1 ), L(T2 ), . . . , L(Tk ) is an unrooted tree T with leaf set L(T1 ) ∪ · · · ∪ L(Tk ) such that each tree Ti is an induced subtree of T . In some cases, there may be no possible agreement supertrees of a set of trees, in other cases there may be exponentially many. We present polynomial time algorithms for computing an optimal agreement supertree, if one exists, of a bounded number of binary trees. The criteria of optimality can be one of four standard phylogenetic criteria: binary character compatibility; maximum summed quartet weight; ordinary least squares; and minimum evolution. The techniques can be used to search an exponentially large number of trees in polynomial time.

1

Introduction

The general phylogenetic supertree problem is how to combine phylogenies for different sets of species into one large “super”-phylogeny that classifies all of the species. The problem is motivated by physical, computational, and statistical constraints on the construction of large scale phylogenies. In this paper we present a supertree method for combining a bounded number of binary, unrooted, phylogenetic trees. We search for supertrees such that every input tree is a restriction of the supertree to a subset of the set of species. When multiple supertrees exist we choose the supertree that is optimal with respect to one of four standard phylogenetic optimization criteria. Even though the number of possible supertrees can be exponentially large, we can determine an optimal supertree in polynomial time. 1.1

Agreement Supertrees

We begin with a few definitions. An unrooted phylogenetic tree T is a finite, acyclic connected graph with no internal vertices of degree two and degree one vertices (leaves) labelled uniquely by members of a finite leaf set L(T ). A tree T 0 is an induced subtree of T if T 0 can be obtained from T by deleting vertices and edges and supressing vertices of degree two. Intuitively T 0 represents the restriction of T to a subset of the leaf set. If T 0 is an induced subtree of T with leaf set L0 ⊆ L(T ) then we write T 0 = T |L0 . An agreement supertree of a collection of trees {T1 , T2 , . . . , Tk } with leaf sets L(T1 ), L(T2 ), . . . , L(Tk ) is an O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 24–31, 2001. c Springer-Verlag Berlin Heidelberg 2001

Optimal Agreement Supertrees

25

unrooted tree T with leaf set L(T1 ) ∪ · · · ∪ L(Tk ) such that each tree Ti is an induced subtree of T . The agreement supertree is the dual of the agreement subtree, recalling that S is an agreement subtree of trees T1 , . . . , Tk if S is an induced subtree of each of T1 , . . . , Tk (c.f [3,6]). A collection of trees T1 , . . . , Tk on leaf set L are agreement supertrees of a collection of trees S1 , . . . , Sl on subsets of L if and only if S1 , . . . , Sl are agreement subtrees of T1 , . . . , Tk , and every leaf in L appears in at least one Si . There can be exponentially many agreement supertrees for a collection of trees T , even when the number of trees in T is bounded [4]. When there is more than one agreement supertree we select one that is optimal according to one of four standard phylogenetic optimization criteria. The choice of the four optimization criteria is determined by algorithmic constraints - we discuss the possible extension to other criteria in Sect. 6. Note that the optimization criteria require global information for the entire set of taxa present in the input trees. Hence the algorithms are better tailored to divide and conquer applications, rather than the assembly of different data sets. Though the number of agreement supertrees can be exponentially large, the algorithms determine an optimal agreement supertree in polynomial time, provided that the number of input trees is bounded. The agreement supertree methods presented here are practical for a small number of large input trees, rather than a large number of small input trees. Finally we note that agreement supertree methods do not handle conflict in the set of input trees. If two input trees resolve the relationship between species in a different way then there will be no agreement supertrees for the collection. There are numerous possible ad hoc solutions to this shortcoming (e.g. removing leaves involved in conflicts, modifying input trees). However the more general problem of how to systematically resolve conflict in supertrees is still hotly debated within phylogenetics. At this early stage we present our results as tools to be incorporated into practical supertree methods once some of the fundamental systematic questions have been answered. Despite the false modesty, the algorithmic results presented in this paper have several direct applications, a few of which we discuss in Sect. 5. In particular we stress the ability to search over an exponentially large number of trees in polynomial time.

2

Four Standard Optimization Criteria

In this section we describe the phylogenetic criteria used for optimization. Discussion of the motivations, use, and characteristics of the various criteria can be found, for example, in [10]. Here we give only sufficient detail to describe the problems.

26

2.1

D. Bryant

Binary Character Weights

Given a phylogenetic tree T , removing an edge of T induces a bipartition of the leaf set of T , called a split of T . The set of all splits of T is denoted splits(T ), and a single split is written as A|B, using the standard notation for partitions. A split can be viewed as a binary character where each leaf is assigned a value of 0 or 1 depending on which block of the split it is in. Given a weighted collection of binary characters we can score a tree T by summing the weight of the characters corresponding to splits in T . The total summed weight is called the binary character compatibility score of T . We wish to find T that maximizes the binary character compatibility score. We will assume that all binary characters have non-negative weights. 2.2

Quartet Criteria

A (resolved) quartet is an unrooted binary tree on four leaves. There are three possible quartets for each set of four leaves. For any tree T the quartet set of T , denoted q(T ), is the set of all quartets that are induced subtrees of T . Suppose that we have assigned a weight w(ab|cd) for every possible quartet ab|cd such that {a, b, c, d} ⊆ L(T ). The quartet score of a tree T is the sum of the weights w(ab|cd) over all ab|cd ∈ q(T ). We want to find a tree with maximum quartet score. We will assume that all quartets have non-negative weights. 2.3

Least Squares (OLS)

Suppose that T is an unrooted tree and the edges of T are given real valued weights. The leaf to leaf distance between any two leaves of T is the sum of the edge weights along the path between the two leaves. In this way we can construct a distance matrix p containing all of the leaf to leaf distances between leaves in T . The sum of squares distance between p and a given distance matrix d for L(T ) is defined X X (dxy − pxy )2 . (1) kp − dk2 = x∈L(T ) y∈L(T )

For a fixed tree T the edge weights for T that give a distance matrix p minimizing kp − dk2 are called the OLS edge length estimates for T . The ordinary least squares (OLS) score of a tree T is the value kp − dk2 given by the OLS edge length estimates. We want to find a tree T with minimum OLS score. 2.4

Minimum Evolution (ME)

The minimum evolution criteria is closely related to OLS. Given an unrooted tree T and a distance matrix d we first calculate the OLS edge estimates for T

Optimal Agreement Supertrees

27

from d. The minimum evolution (ME) score for T is then the sum of these edge weights. The minimum evolution score is usually only evaluated for binary trees1 . We want to find a tree with minimum ME score.

3

Split Constrained Phylogenetic Optimization

In this section we describe the results of [1,2,5,7] that provide the machinery for the optimal agreement supertree algorithms. The results all follow the same theme: the introduction of split based constraints that make phylogenetic reconstruction polynomial time solvable. More formally we have: INSTANCE: Set S of splits on leaf set L. Degree bound d. PROBLEM: Is there a tree T with degree bound d such that splits(T ) ⊆ S? If so, find the tree(s) with degree bound d and splits(T ) ⊆ S with 1. maximum binary compatibility score (with respect to a given weighting of splits in S). [1] 2. maximum quartet score (for a given quartet weighting). [7] 3. minimum OLS score (for a given distance matrix). [5] 4. minimum ME score (for a given distance matrix). [2,5] All of these versions of the problem can be solved in polynomial time (in the number of splits and leaves), when d is bounded by a constant.

4

Optimal Agreement Supertree Algorithms

4.1

Split Set Constraints

We show how the split constrained optimization algorithms can be used to construct optimal agreement supertrees. Let T = {T1 , . . . , Tk } be a collection of trees. Put Li = L(Ti ) for each i and L = L1 ∪ · · · ∪ Lk . Define A |B ∈ splits(Ti ) ∪ {∅|Li }, i = 1, 2, . . . , k f. S(T ) = A1 ∪ · · · ∪ Ak |B1 ∪ · · · Bk : i i (A1 ∪ · · · ∪ Ak ) ∩ (B1 ∪ · · · Bk ) = ∅ (2) Note that we will assume Ai |Bi ∈ splits(Ti ) implies Bi |Ai ∈ splits(Ti ). Put n = |L|. There are at most 2n − 3 splits in each tree so |S(T )| is O(2k nk ). Furthermore Theorem 1. Let T be an unrooted phylogenetic tree with leaf set L. If each tree Ti ∈ T is an induced subtree of T then splits(T ) ⊆ S(T ). 1

It would be interesting to determine whether or not optimal minimum evolution trees can be assumed to be binary

28

D. Bryant

Proof. Suppose that each tree Ti ∈ T is an induced subtree of T . Thus Ti = T |Li and splits(Ti ) = splits(T |Li ) = {A ∩ Li |B ∩ Li : A|B ∈ splits(T )} − {Li |∅}.

(3) (4)

Hence for each A|B ∈ splits(T ) we have A ∩ Li |B ∩ Li ∈ splits(Ti ) ∪ {Li |∅} .

(5)

A|B = (A ∩ L1 ) ∪ · · · ∪ (A ∩ Lk )|(B ∩ L1 ) ∪ · · · ∪ (B ∩ Lk )

(6)

It follows that

and A|B belongs to S(T ).

t u

We now take advantage of the fact that the input trees are all binary. Lemma 1. If T contains only binary trees and there is an agreement supertree T of T then there is a binary agreement supertree T 0 of T such that splits(T ) ⊆ splits(T 0 ). Proof. Suppose that T is an agreement supertree of T that is not binary. Clearly there exists a binary tree T 0 such that splits(T ) ⊆ splits(T 0 ). We will show that T 0 is also an agreement supertree for T . For each Ti ∈ T we have Ti = T |Li and, since splits(T ) ⊆ splits(T 0 ), we also have splits(T |Li ) ⊆ splits(T 0 |Li ). The tree Ti is binary, so splits(Ti ) = t u splits(T |Li ) ⊆ splits(T 0 |Li ) implies Ti = T 0 |Li , completing the proof. We now have all the machinery needed for the main result. Theorem 2. Let T = {T1 , . . . , Tk } be a collection of binary trees with leaf sets L1 , . . . , Lk . We can determine whether an agreement supertree of T exists in O(4k n2k+1 ) time. If there is an agreement supertree for T then we can find the agreement supertree with optimum 1. binary character compatibility score (with respect to a given weighting of splits) in O(4k n2k+1 ) time. 2. summed quartet weight (for a given quartet weighting) in O(2k nk+4 +4k n2k+1 ) time. 3. least squares score (for a given distance matrix) in O(8k n3k ) time. 4. ME score (for a given distance matrix) in O(8k n3k ) time. Proof. By Theorem 1 we have that any agreement supertree T for T satisfies splits(T ) ⊆ S(T ). With all of the optimization criteria (except perhaps ME, which is usually defined only for binary trees) a tree T 0 with splits(T ) ⊆ splits(T 0 ) will have at least as good score as T . Hence, using Lemma 1, we can assume that the optimal agreement supertrees are binary. The problem therefore reduces to finding an optimal binary tree contained within a given collection of splits. Applying the methods stated in Sect. 3 is is possible to obtain the given complexities. t u

Optimal Agreement Supertrees

29

We note that, when the number of trees k is bounded by a constant, we have polynomial time algorithms for determining optimal agreement supertrees. When k is unbounded, the problem is NP-complete (by a reduction from QUARTET COMPATIBILITY [9]).

5

Applications to Optimization

To illustrate how these methods may be used during optimization we describe two applications. 5.1

Divide and Conquer

Given two (or more) binary trees we now have a method for combining them in a way that optimizes phylogenetic criteria. The most obvious application of these algorithms is as the basis of a divide and conquer algorithm. There are several possibilities for ways to subdivide the data set. The DCM method [8] uses distance data to divide the sequences into closely related clusters. However the merge algorithms we describe here will work well even if all of the input trees contain leaves that are widely scattered throughout the combined tree. For this reason, the subdivision process appears less critical. We propose the use of a random subdivision, biased to produce subsets that have close to equal sizes. 5.2

Testing Stability

The standard technique for phylogenetic optimization is to conduct a tree search: an initial tree is constructed; Neighbouring trees are examined, where ’neighbouring’ means within one branch swap or NNI exchange of the original tree; If a better tree is found, this is chosen as the next tree and the search continues. At each step at most O(n) or O(n2 ) trees are searched. We can use the merge algorithms presented in this paper to search an exponentially large number of trees without a dramatic increase in complexity. We randomly divide the set of leaves into two parts, calculate the two induced subtrees, and then calculate an optimal agreement supertree for these two trees. An agreement supertree will always exist since the leaf sets of the subtrees are disjoint. If we have found a global optimum then this procedure will always produce the original tree. If it finds a better tree we take this new tree and repeat the process.

6

Discussion

In this abstract we have outlined a method for constructing an optimal agreement supertree of a set of binary trees, if such an supertree exists. The algorithm can search an exponentially large collection of trees (the possible agreement supertrees) in polynomial time. Our results lead to several directions of further research:

30

6.1

D. Bryant

Extension to Other Criteria

It would be useful to determine whether the algorithms can be extended to optimize over other phylogenetic criteria such as parsimony score or likelihood. In both cases we are pessimistic. With parsimony, the structure of one part of a tree can alter how best to resolve the structure of another part of the tree, making parsimony less suitable for dynamic programming algorithms such as those we use here. With maximum likelihood it is unknown whether the likelihood of a single tree can be computed in polynomial time! In both cases, the above results can be used to make a rough search of the tree space that would precede a more detailed, and computationally expensive, local search. 6.2

Extension to Non-binary Trees

The next direction of future research would be to determine if the algorithms can be extended to handle non-binary trees. The problem is that we require a degree bound in order to use the algorithms outlined in Sect. 3. We suspect that an appropriate modification of the algorithms will give a polynomial time algorithm (for a bounded number of trees) even without a degree bound. We have derived the required modifications for the special case when the leaf sets of the input trees are disjoint, but the the complexity of the general problem remains open. 6.3

Trees with No Agreement Supertree

At the moment the algorithms will simply terminate if there is no agreement supertree possible. This occurs when, for example, there are conflicts between the input trees. It would be interesting to extend the algorithm to derive a consensus supertree method capable of handling input trees with conflicts. First, however, we must address the question of how to handle conflict between different trees. The subtree merger algorithm of [8] simply contracts edges causing problems. This will, in general, lead to poorly resolved trees. Once suitable criteria for combining subtrees are defined, it should be possible to extend the optimal agreement supertree algorithms above to give a general optimal subtree merger technique.

Acknowledgements. I thank the anonymous referees for their helpful comments and suggestions. The paper was written while I was a postdoctoral fellow based in the laboratory of Olivier Gascuel.

References 1. Bryant, D.: Hunting for trees in binary character sets. Journal of Computational Biology 3(2), (1996) 275-288.

Optimal Agreement Supertrees

31

2. Bryant, D.: Hunting for trees, building trees and comparing trees: theory and method in phylogenetic analysis. Ph.D. thesis, Dept. Mathematics, University of Canterbury (1997). 3. Finden, C.R. and Gordon. A.D. Obtaining common pruned trees. Journal of Classification, 2 (1986) 255–276. 4. Gordon, A.D. Consensus supertrees: the synthesis of rooted trees containing overlapping sets of leaves. Journal of Classification, 3 (1986) 335–348. 5. Bryant, D. Fast algorithms for constructing optimal trees with respect to OLS and minimum evolution criteria. In preparation. 6. Goddard, W., Kubicka, E., Kubicki, G., and McMorris, F.R.. (1994). The agreement metric for labelled binary trees. Mathematical Biosciences 123 215– 226. 7. Bryant, D. and Steel, M. Constructing optimal trees from quartets. Journal of Algorithms (to appear) 8. D.H. Huson, S. Nettles, and T. Warnow: Obtaining highly accurate topology estimates of evolutionary trees from very short sequences. Proc. 3rd Annual Int. Conf. Comp. Mol. Biol. (RECOMB), (1998) 198–209. 9. M. Steel: The complexity of reconstructing trees from qualitative characters and subtrees. Journal of Classification, 9 (1992) 91-116. 10. Swofford, D. L., G. J. Olsen, P. J. Waddell and D. M. Hillis: Phylogenetic Inference. in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular Systematics, second edition. Sinauer, Sunderland, Mass (1996) 407–514.

Segmentation by Maximal Predictive Partitioning According to Composition Biases Laurent Gu´eguen CEB-LIS – UPMC Paris VI [email protected]

Abstract. We present a method for segmenting qualitative sequences, according to a type of composition criteria whose definition and evaluation are founded on the notion of predictors and additive prediction. Given a set of predictors, a partition of a sequence can be precisely evaluated. We present a language for the declaration of predictors. One of the problems is to optimize the partition of a sequence into a given number of segments. The other problem is to obtain a suitable number of segments for the partitioning of the sequence. We present an algorithm which, given a sequence and a set of predictors, can successively compute the optimal partitions of the sequence for growing numbers of segments. The time- and space-complexity of the algorithm are linear for the length of sequence and number of predictors. Experimentally, the computed partitions are highly stable regard to the number of segments, and we present an application of this approach to the determination of the origins of replication of bacterial chromosomes.

Broadly speaking, the aim of the classification process is to optimize the dividingup a set of objects into classes, in line with given criteria, so as to identify a structure, if any such exist. Once a set is organized into classes, the next aim may be to obtain a description of these classes, which would provide a high-level redescription of the set. This description could be set either after or during the classification process. But it may be very difficult to obtain “good” descriptions of existing classes, so the approach we adopted was to simultaneously describe and construct classes, and to evaluate the quality of the classification according to the accuracy of the descriptions. In this paper, we outline a method for partitioning sequences into described segments in such a way that the descriptions could be used to evaluate the quality of the corresponding segments. We first present the overall approach, and then an efficient, optimal partitioning algorithm. Finally, we link this method to the problem of partitioning genomic sequences according to relevant composition biases. O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 32–44, 2001. c Springer-Verlag Berlin Heidelberg 2001

Segmentation by Maximal Predictive Partitioning

1 1.1

33

Prediction Prediction of Sequences

We use Gower’s concept of prediction [3]: the quality of a class is measured by the ability of an “optimal” predictor belonging to a given set of predictors to predict the properties of the elements of the class in question. For sequences of letters, a property of a given position is the letter or word that is readable at this position. For example, to say that the predictor of a sequence of letters is the letter A means that each letter of this sequence should be an A, and the resulting high-level description of this sequence is: “We can read an A at each position in the sequence”. This description will be more or less accurate, according to the proportion of A in the sequence. To say that the predictor of a sequence is the word ‘AT’means that the word ‘AT’ should be readable at each position in the sequence, but it is clear that such a prediction can be correct for, at most, half the positions in the sequence. The notion of prediction can be extended to entire sequences. For example, to say that the predictor of a class is A with a factor 0.3, and T with a factor 0.7, means that we predict: “We can read an A at 30 % and a T at 70 % of the positions of the sequence”. We call this predictor [A(0.3)T(0.7)] (see Sect. 1.2). Let δ be a predictor, S a sequence, and si a position in this sequence. The prediction of δ at si is termed πδ (si ) ∈ R. For δ = A, if the letter at position si is A, πδ (si ) is 1, otherwise it is 0. For δ = ‘AT’, if the word at position si si+1 is AT, πδ (si ) is 1, otherwise 0. For δ = [A(0.3)T(0.7)], if the letter at si is A, πδ (si ) is 0.3; if the letter is T, πδ (si ) is 0.7; otherwise, it is 0. The prediction of δ on a segment S is the sum of the predictions of δ for all the positions on this segment. We call this prediction πδ (S): X πδ (si ) πδ (S) = si ∈S

In a geometrical representation, predictions for a sequence may be considered as a scalar product. In Fig. 1, for instance, we represent the sequence as a walk on the orthonormal basis (eA , eT ), starting at the origin. If we read an A (resp. T) at the first position, the first step in the walk is eA (resp. eT ). In a given representation, the vector corresponding to sequence S is called the sequence vector of S, noted as vS . In the same way, a predictor δ is also represented by a predictor vector, vδ . In this context, the prediction of the predictor δ on the sequence S is the scalar product of vδ and vS , termed vδ .vS . For a set of predictors D, the optimal predictors of S are those that provide the best prediction on S, i.e. those whose vectors provide the best scalar product with the sequence vector of S. The corresponding optimal prediction is πD (S): πD (S) = max πδ (S) δ∈D

34

L. Gu´eguen A

affine representation of sequence S

S= ATAAATTAATATTTTATTTT

predictor vector of A predictor vector of [A(0.3)T(0.7)]

sequence vector of S

predictor vector ofT T

Fig. 1. Vectorial representation of a sequence, and of predictors A, T and [A(0.3)T(0.7)].

There is competition between the different predictors of D. Following the geometrical representation, and in order to ensure that the predictor chosen among the set D is the one whose vector is the most colinear with the sequence vector, we have to normalize the different predictor vectors. For example, if D is the set {A, T, [A(0.3)T(0.7)]}, the normalization is carried out by multiplying the predictions of [A(0.3)T(0.7)] by 1.313 (see Fig. 2). This predictor is termed [A(0.3)T(0.7)](1.313).

A

S= ATAAATTAATATTTTATTTT

A [A(0.3)T(0.7)](1.313) T representation of normalized predictors T

Fig. 2. Vectorial representation of normalized predictors. [A(0.3)T(0.7)](1.313) is the one that best fits the sequence S.

The

predictor

Segmentation by Maximal Predictive Partitioning

1.2

35

Language of Predictors

We constructed a language that would allow us to declare different predictors. We use the word “lexicon” to designate a set of words belonging to this language which determines the set D of the available predictors. In a lexicon, the words are separated by spaces. Each word corresponds to a predictor, according to the rules given below. Now, the two terms “predictor” and “word” are equivalent. – If α is a letter in S, α is in the language, and its prediction function is the characteristic function, i.e.1 : πα (si ) = 1lα=si =

1 if the letter in position si is α, otherwise 0;

– The sign ’?’ is in the language, its prediction function is: π? (si ) = 1 – If p is a predictor and x ∈ R, p(x) is a predictor belonging to the language and: πp(x) = x.πp – If p1 , . . . , pk are predictors, [p1 p2 . . . pk ] is a predictor belonging to the language, and: k X πp i π[p1 p2 ...pk ] = i=1

– If p1 , . . . , pk are predictors, ‘p1 p2 . . . pk ’ is a predictor belonging to the language, and: π‘p1 p2 ...pk ’ (si ) = πp1 (si ).1l(πp2 (si+1 )>0)∧...∧(πpk (si+k−1 )>0) Combinations within the framework of this syntax make possible the declaration of a very large range of predictors. For instance, the word ‘A?C’(2) stands for a predictor whose prediction is 2 at positions where the letter is A and the letter which occurs two positions after it is C, 0 at all other positions. The predictor [‘A[CA]G’(-2)‘CA’](3) gives a prediction of −6 at positions where the word is ACG or AAG, 3 where it is CA, otherwise 0.

2

Predictive Partitioning of Sequences

2.1

Predictive Partitions

A partition of a sequence is a division of the sequence into separate segments such that their union equals the whole sequence. We call caesura the junction 1

1ltrue = 1 and 1lfalse = 0

36

L. Gu´eguen

of two consecutive segments. We call a k-partitions a partition into k segments, and the set of these k-partitions is termed Pk (S). The prediction of a partition follows the same logic as the prediction of a sequence: our aim is to optimize the prediction of the composition of segments, and the prediction of a partition is the sum of the optimal predictions of its segments. Then, if the k-partition P is made up of segments S1 , S2 , . . . , Sk , its prediction by D is: πD (P ) =

k X

πD (Sj ) =

j=1

k X j=1

max πδ (Sj ) δ∈D

For example, given the sequence ATAAATTAATATTTTATTTT and the lexicon A T [A(0.3)T(0.7)](1.313), then for the 3-partition ATAAATTA ATATT TTATTTT these segments have successively A, [A(0.3)T(0.7)](1.313) and T as optimal predictors, and the prediction of this 3-partition is 5 + 3.55 + 6 = 14.5451. The larger the prediction of a k-partition, the better the predictions that can be made about its segments. Given a set of predictors and a number of classes k, our goal is to obtain the optimal division into k segments, i.e. to find the optimal partitions among the entire set of k-partitions. The evaluation of the maximal k-partitions of S according to the predictors set D is termed MkD (S). MkD (S) =

max

(S1 ,... ,Sk )∈Pk (S)

k X j=1

max πδ (Sj ) δ∈D

In the example given above, an optimal 3-partition is ATAAA TTAATATTTTA TTTT These segments have successively A, [A(0.3)T(0.7)](1.313) and T as optimal predictors, and the corresponding prediction is 16.0093. In terms of geometrical representation, an optimal k-partition is one in which the vectors representing the segments are most colinear to their best predictors among D (see Fig. 3). Finding a “good” partition means finding a compromise between very precise descriptions of only a few of the k segments and a good average of all the k segments. Because the prediction is additive in the neighborhood of a junction between two segments, the caesura is set in such a way that the prediction of the maximum elements is relevant. Given a set of predictors, and considering this geometrical representation, we can say that a sequence has a structure when its affine representation can be divided up into segments whose vectors are distinctly colinear with one of the predictors.

Segmentation by Maximal Predictive Partitioning

37

A

S= ATAAATTAATATTTTATTTT 3−partition: ATAAA TTAATATTTTA TTTT A [A(0.3)T(0.7)](1.313) T T

Fig. 3. Vectorial representation of a 3-partition. Dashed arrows represent the vectors of the segments of this partition. We can see that the vector of predictor A is the most colinear with the first arrow, [A(0.3)T(0.7)](1.313) with the second, and T with the third.

2.2

Number of Segments

A priori, it is reasonnable to ask an experimenter to define a problem in terms of a sequence and a set of predictors2 . But the precise number of segments which will successfully partition a sequence cannot be known in advance. So, from the point of view of classification, and with the hypothesis of a structure in a sequence, it may be useful to estimate the number of segments that “best” discloses this structure. And if there is no structure at all, it should be possible to demonstrate this. In fact, more than one class may be relevant, as in the case of nested clusters. There are numerous ways to estimate the number(s) of classes in a set, and we will not list them all, but a number of them can be found in [2]. However, none of them is perfect. A “good” summary of the composition of the sequence implies the “right” number of classes. Yet by definition, the optimal prediction does not decrease with the number of segments, and the global maximum is reached when all the elements have been separated, which is not a very interesting solution. The “right” number of classes must be found: if there are too few, the high-level redescription obtained will be too vague; if there are too many, this redescription will be too complicated. One way to assess the relative usefulness of the different numbers of classes is to compute the optimal k-partitions for successive k between 1 and a given value. We call this process partitioning, and we give the same name to the result of such a process. A partitioning can be used to study the optimal prediction in terms of the number of classes. Qualitatively, a sharp bend in the curve of the prediction is the sign of a relevant number of segments because, for a lower number of 2

We shall discuss the choice of this set in the Sect. 4.

38

L. Gu´eguen

segments, the prediction will be too vague, and for a higher number, the gains in prediction will be outweighed by the increase in the numbers of classes. Such a point corresponds to a good balance between summarization and description. A further advantage of partitioning a sequence is that it may reveal another structure of the sequence, i.e. the successive revelation of segments according to their prediction by a “good” predictor, the structure becoming more and more detailed with the number of classes. Moreover, the composition of the segments is not subject to a priori limitations, such as that of a threshold. We next present an algorithm for computing the partitioning of a sequence into a given number of segments given a sequence and a set of criteria [4].

3

Algorithm

There already exists an algorithm for computing the optimal partitionings of sequences, for any kind of evaluation function of the classes [5]. This algorithm is based on Fisher’s lemma [1]): if {S1 , S2 , . . . , Sk } is an optimal k-partition of ∪ki=1 Si , then {S1 , S2 , . . . , Sk−1 } is an optimal k − 1-partition of ∪k−1 i=1 Si . The time-complexity of this algorithm is quadratic with the size of the sequence, and this is critical for very long sequences, such as genomic sequences. In the context of prediction, the algorithm we are introducing here uses the previous lemma and another observation: given a set of predictors D, all the positions of a class are predicted by the same predictor of D (whether this prediction is correct or not). The time- and space-complexity of the algorithm are linear with the length of the sequence, which makes possible the partitioning of very long sequences. Let s1 , s2 , . . . , sn be the elements of sequence S, whose size is n. Given i > j, we call hi, ji the set {i, i + 1, . . . , j − 1, j} and shi,ji the subsequence si . . . sj . For each δ ∈ D, and for i > k, if we compute, for all the k-partitions of the subsequence sh1,ii , the prediction such that the segment containing si is predicted – forcibly – by the predictor δ, we call Mkδ (sh1,ii ) the maximum of these predictions. The aim is to compute MkD (sh1,ni ) for each k, and to backtrack the corresponding optimal partitions. In an optimal k-partition, the segment containing si is predicted by an optimal predictor: MkD (sh1,ii ) = max Mkδ (sh1,ii ) δ∈D

Fisher’s lemma is used to compute Mkδ (sh1,ii ). An optimal k-partition of sh1,ii for which the optimal predictor of the segment containing si is δ conforms to at least one of the two possibilities:

Segmentation by Maximal Predictive Partitioning

39

— either si−1 and si belong to the same segment, in which case the segment containing si−1 is predicted by δ,

. . . si−1 si ...δ δ

s 1 s2 . . . and the prediction is:

Mkδ (sh1,ii ) = πδ (si ) + Mkδ (sh1,i−1i )

— or si−1 and si belong to two different segments, in which case the kth segment is {si }, and the other k − 1 segments together constitute an optimal (k − 1)partition of sh1,i−1i ,

. . . si−1 si . . . δ0 δ

s 1 s2 . . . and the prediction is:

D Mkδ (sh1,ii ) = πδ (si ) + Mk−1 (sh1,i−1i ).

The maximum of these two formulae is:

D (sh1,i−1i ), Mkδ (sh1,i−1i ) , Mkδ (sh1,ii ) = πδ (si ) + max Mk−1

and, as there is no k-partition of sh1,k−1i , D (sh1,k−1i ) Mkδ (sh1,ki ) = πδ (sk ) + Mk−1

To sum up: ∀i, M1D (sh1,ii ) = maxδ∈D πδ (sh1,ii ) D (sh1,k−1i ) ∀δ, ∀k, Mkδ (sh1,ki ) = πδ (sk ) + Mk−1

D (sh1,i−1i ), Mkδ (sh1,i−1i ) ∀δ, ∀k, ∀i > k, Mkδ (sh1,ii ) = πδ (si ) + max Mk−1 ∀k, ∀i > k, MkD (sh1,ii ) = maxδ∈D Mkδ (sh1,ii ) D So if, for a given k, the set {Mk−1 (sh1,ii ), i > k − 1} is preserved, we can compute successively, for i > k, the set of the MkD (sh1,ii ) using all the Mkδ (sh1,ii ). We then compute successively the partitioning M1D (S), M2D (S), . . . , MkD (S). Backtracking this computation gives an optimal k-partition for each k. The computation of these optimal partitions is linear over time: for each e ∈ h2, ki, for each i ∈ he, ni, for each δ ∈ D, Meδ (sh1,ii ) is computed in a constant D (sh1,i−1i ). Moreover, for each i ∈ he, ni, the time using Meδ (sh1,i−1i ) and Me−1 D computation of Me (sh1,ii ) requires a comparison between |D| elements. Then, at eachPiteration of the number of segments, the computing time approximan tes to i=e |D| = |D|.(n − e + 1). In conclusion, MkD (S) is computed with a time-complexity of O(k.|S|.|D|). This algorithm can thus be used on very long sequences.

40

L. Gu´eguen

For a given k, there may in fact be several maximally predictive k-partitions, and it may be useful to calculate all these, especially if they have very different caesuras. With adjacency lists, it is possible to find the set of all the equally optimal k-partitions without any increase in complexity.

4 4.1

Applications Composition Biases in Sequences

Genetic sequences are subject to numerous functional and structural constraints, and there are composition biases between their different parts. For us, “composition bias” means that there is a significant difference between different parts of a sequence for a measurable criterion, e.g. the comparison between the proportion of G and the proportion of C on a DNA string. In a number of bacteria the expression of this C--G bias is related to the replication process, due to the fact that these proportions are very different between the leading and the lagging part of a DNA double-strand sequence [6]. In [7], a linear discriminant analysis method is used to find, and especially, to evaluate the relative merits of different criteria for distinguishing between the leading and lagging parts of a bacterial chromosome; these criteria can be based on the C--G bias or the frequency of codons in genes. The aim of detecting such biases is to facilitate the study of biological data, though the explanation of those biases raises new problems for biological research. We plan to link the notion of composition bias to that of prediction. To arrive at an objective criterion of composition bias, we use the notion of prediction, and that of language as defined in Sect. 1.2. Our second aim is to compute such partitionings, using the algorithm presented in Sect. 3. If a composition bias can be translated into a prediction, using this approach, we can partition the sequence into segments that optimally mirror this bias. Taking a geometrical point of view, we consider a composition bias B such that, in an affine representation of a sequence, which has been set up according to the parameters of B, each trend τi of the bias is related to a transect Tτi of the space. This means that if the vector of a segment is in Tτi , the segment follows τi . At the boundary between two transects, the segment follows the two trends to the same extent. Looking at all the trends in a bias, we have a partitioning of the space into transects {Tτ1 , . . . , Tτp } (as in Fig. 4). Looking at a set of predictors {δ1 , . . . , δn }, for each i let Tδi be the set of vectors v such that ∀j 6= i, v.vδj < v.vδi . Tδi is a transect of the space. In order to express bias B using the prediction, we have to find p predictors δi such that {Tδ1 , . . . , Tδp } = {Tτ1 , . . . , Tτp }. This problem can be solved easily by using the fact that at the boundary between two transects, the prediction of a segment is the same for the two corresponding predictors. The algorithm and the declaration of the language used to compute optimal predictive partitionings have been programmed. The computation of partitionings turns out to be very efficient: for example, on a Gateway computer with a

Segmentation by Maximal Predictive Partitioning

41

A 1111111111111111111111111 0000000000000000000000000 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111

1111111111111111111111111 0000000000000000000000000 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 A(2) 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 [TA(1.6)] 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 [T(1.85)A(0.8)] 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 T(2) 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 11111111111111111111111111111111 000000000 111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 00000000000000000000000000000000 T 11111111111111111111111111111111

Fig. 4. Predominance transects of the predictors of the lexicon A(2) [TA(1.6)] [T(1.85)A(0.8)] T(2). The striped areas are the transects, which are related to the trends in the studied bias.

Pentium III processor and 256 Mb of RAM, taking a sequence of 2.106 letters and four predictors, the construction of an optimal partition into 100 segments takes less than ten minutes. In this way, it has been possible to study the partitioning process extensively. 4.2

Experiments

The discovered segments are very stable with the numbers of segments. For instance, the partitioning of the sequence shown in the Fig. 5 almost gives the impression of a hierarchical structure. This is not in fact the case, but it reinforces the impression that partitioning reveals hidden structures, since it seems to bring out increasingly detailed features of the composition of the sequence. Visually, such figures do indeed give a thorough view of the intricacy of the expression of the different tendencies of a bias within a given sequence. Moreover, the stability means that the determination of a “suitable” number of segments becomes less critical, because the caesuras remain, along with several numbers of classes. We partitioned several bacterial genomic sequences according to the C--G imbalances in different parts of these sequences. We wanted to check whether the bias noted by Lobry [6] made it possible to determine the origin and terminus of replication. To express this bias, we used the C G lexicon . The chromosomes of the bacteria studied are circular, so we dicided them up into linear sequences. Figure 6 gives the result of partitioning Haemophilus Influenzae into ten segments. We see that the leading and lagging parts are the first to be extracted. Figure 7 shows a very sharp bend in the prediction curve in terms of the number of classes, at 3 segments, revealing that the most significant expression

42

L. Gu´eguen

Fig. 5. Example of the partitioning of a sequence. The sequence is the line at the bottom. Each class within a partition is represented by an arc. The number of classes increases from top to bottom.

of the G--C bias is exactly correlated with the leading and lagging parts of the chromosome. Table 1 shows, for a number of bacteria and archae-bacteria, the limits of the segments of the 2- and 3-partitions, using the same C G lexicon, for which there is a sharp bend in the prediction curve. The selected number of classes is either 2 (which means that the origin of replication is at position 0, i.e. at the position where the circular chromosome has been split) or 3. It seems clearly that this method could be extensively used for determining the composition of sequences such as genomes and proteins. The large range of criteria according to which partitionings can be carried out may be useful tool in biological research, e.g. in determining of the relation between the replication process and other criteria (cf [7]), the study of isochores, or that of variations in codon usage inside a chromosome. And, as this method is precise and objective, it is a good way to evaluate the relevance of particular criteria, relating partitions to knowledge about the properties of particular sequences.

Segmentation by Maximal Predictive Partitioning

43

C 1 C

G

2 C

G

C

C

G

C

3 G

4 C

G

C

G

C

C

G

C

G

C

5 G

6 C

G

C

G

C

G

C

C

G

C

G

C

G

C

7 G

8 C

G

C

G

C

G

C

G

C

C

G

C

G

C

G

C

G

C

9 G

10

1000000

1830135

Fig. 6. Partitioning of H. influenzae into 10 classes, using a C G lexicon. The chromosome, which is circular, is represented as a linear sequence (the graduated line). The black triangles represent the locations of the origin (pointing upwards) and terminus (pointing downwards) of replication.

369341

360000

350723 1

3

10

20

Fig. 7. Prediction in terms of the number of classes of partitioning into twenty classes of H. influenzae, using the C G lexicon.

44

L. Gu´eguen

Table 1. Origins and termini of replication and with the caesurae found by partitioning, in sequences for which an increase in prediction shows a sharp bend. The positions are given as percentages of the overall length of the sequence. “Caesura of C towards G” means that the predictor C (resp. G) is on the left (resp. right) of the caesura. Organism B. burgdorferi B. subtilis C. trachomatis E. coli H. influenzae H. pylori M. tuberculosis M. pneumoniae M. genitalium M. thermoautotrophicum P. horikoshii R. prowazekii T. pallidum

Origin Terminus 50.1 0 69 84.6 32.9 96 0 25.1 0 75 ? 0 0

0 47.8 19 34.1 80.1 47 49 73 ? 24 ? 56 48

Caesura of Caesura of C towards G G towards C 50.3 0 0 46.1 69.1 18.7 84.6 33.4 33.2 80.1 98 48.8 0 46.3 25 74.1 0 50.8 74.9 23.3 0 43.6 0 56.5 0 48.6

References 1. W.D. Fisher. On grouping for maximal homogeneity. Journal of the American Statistical Association, 53:789–798, 1958. 2. A.D. Gordon. Cluster validation. In C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.H. Bock, and Y. Baba, editors, Studies in Classification, Data Analysis, and Knowledge Organization: Data Science, Classification, and Related Methods, pages 22–39, Kobe, March 1996. IFCS, Springer-Verlag. http://www-solar.dcs.st-and.ac.uk/˜allan/. 3. J.C. Gower. Maximal predictive classification. Biometrics, 30:643–654, 1974. 4. L. Gu´eguen, R. Vignes, and J. Lebbe. Maximal predictive clustering with order constraint: a linear and optimal algorithm. In A. Rizzi, M. Vichi, and H. Bock, editors, Advances in Data Science and Classification, pages 137–144. IFCS, Springer Verlag, July 1998. 5. D.M. Hawkins and D.F. Merriam. Optimal zonation of digitized sequential data. Mathematical Geology, 5(4):389–395, 1973. 6. J.R. Lobry. Asymmetric substitution patterns in the two dna strands of bacteria. Mol. Biol. Evol., 13(5):660–665, 1996. 7. E.P.C. Rocha, A. Danchin, and A. Viari. Universal replication biases in bacteria. Molecular Microbiology, 32(1):11–16, 1999.

Can We Have Confidence in a Tree Representation? Alain Gu´enoche1 and Henri Garreta2 1

IML - CNRS LIM - Universit´e de la M´editerran´ee, 163 Av. de Luminy, 13288 Marseille Cedex 9 2

Abstract. A tree representation distance method, applied to any dissimilarity array, always gives a valued tree, even if the tree model is not appropriate. In the first part, we propose some criteria to evaluate the quality of the computed tree. Some of them are metric; their values depend on the edge’s lengths. The other ones only depend on the tree topology. In the second part, we calculate the average and the critical values of these criteria, according to parameters. Three models of distance are tested using simulations. On the one hand, the tree model, and on the other hand, euclidean distances, and boolean distances. In each case, we select at random distances fitting these models and add some noise. We show that the criteria values permit one to differentiate the tree model from the others. Finally, we analyze a distance between proteins and its tree representation that is valid according to the criteria values.

1

Introduction

In this paper we consider aspects of the tree representation of a given proximity measure on a finite set X. This measure can be a distance, satisfying the metricity conditions, or a simple dissimilarity evaluated from any description of the X elements. Briefly speaking, we only have an array D, with dimension n = |X|, that is symmetrical with null values on the diagonal. The tree reconstruction methods define an X-tree having X as set of leaves and its internal nodes connect exactly three edges with non negative length. [Barth´elemy, Gu´enoche 1991]. When X is a set of homologous biological sequences and when the distance values are proportional to the number of mutations, the tree model is justified by the evolution theory of Darwin. But when this distance measures something else, such as a structural similarity or a functional proximity, it is not certified that an underlying tree exists and that the D values permit to infer this tree. There are many other domains where X-trees are calculated from a proximity measure, such as Cognitive Psychology, for studies about perception - noises, smells, etc. In Phylogeny, the tree nodes represent taxa or common ancestors, but in Psychology, these nodes correspond to categories and the reality of a tree organization of these categories is a hypothesis that must be verified and justified. O. Gascuel, M.-F. Sagot (Eds.): JOBIM 2000, LNCS 2066, pp. 45–56, 2001. c Springer-Verlag Berlin Heidelberg 2001

46

A. Gu´enoche and H. Garreta

The aim of this paper is to study the appropriateness of the tree model to represent a given distance D. To do that, we apply a tree reconstruction method to get a tree A and its associated distance Da . Then, we measure the gap between D and Da using metric and topological criteria. To evaluate how large this gap is, we realize simulations to estimate the observed values of the criteria when the initial distance is close to a tree distance and to determine confidence intervals. If the criteria values for D belong to these intervals, A and the tree model are accepted. But if not, one can try another reconstruction method, but if none of these gives an acceptable tree, the model will be finally abandoned. Then one can check if other classical models, such as euclidean or boolean models are suited. Similar studies have been realized previously. We refer to Prutzansky, Tversky and Caroll [1982] who try to determine if a given distance is better fitted with a hierarchical clustering method or with a multidimensional scaling one, that is if D is closer to the ultrametric model than to the euclidean one. This study extends their work, dealing with the tree model that is more general, adding the boolean model, and offering the possible conclusion that none of these models is appropriate. In Sect. 2, we recall some criteria to evaluate the quality of an X-tree compared to a distance on X. There are two kinds of criteria, metric and typological. The first ones come from Statistics, evaluating differences between the D and Da values. The second ones are less classical. They only depend on the tree topology and are independent from the edges length. These criteria also permit one to compare different trees obtained by several reconstruction methods. Even if they do not agree, they give quantitative arguments to prefer one representation. In Sect. 3, we describe our procedures to generate random distances considering three models: on the one hand the tree distances, and on the other hand, the euclidean distances and the boolean distances established from binary attributes. For the tree distances we fix the ratio between the lengths of the internal and external edges. For the other distances, several dimensions of the spaces are tested. Realizing simulations, we evaluate average and critical values. The comparisons of the corresponding tables show that the tree model cannot be confused with the other distance models. In Sect. 4, we study a TAT proteins family (Twin Arginine Pathway) for which we build a tree that is finally validated by the method described here, according to the tables.

2

Criteria to Evaluate X-Trees

Let D be a n × n dissimilarity array such that for all {x, y} ∈ X 2 , D(x, y) = D(y, x) and for all x ∈ X, D(x, x) = 0. Let A be an X-tree given by any reconstruction method. The tree A, together with an edge-weighting allows one to calculate a tree distance such that for all {x, y} ∈ X 2 , Da (x, y) is equal to the length of the path in A between x and y.

Can We Have Confidence in a Tree Representation?

47

To evaluate the quality of A, we compare the two distances D and Da . This is realized using a program Qualitree working with two files: one for the tree, registered according to the Newick format and the other one for the distance D having the Phylip format. This program can be obtained at http://biol10.biol.umontreal.ca/guenoche/. 2.1

Metric Criteria

Among all the criteria proposed in Statistics, we have selected: - the distortion, which is the average of the percentage of differences: Dis =

X |D(x, y) − Da (x, y)| 2 n(n − 1) D(x, y) x 50) reduces the sensitivity of the programs. The reason is that the larger clusters are less likely to be regulated by a single factor, and might contain a mixture of different signals. The effect of mixing together sequence that contains a given signal with sequences that do not contain it is also to reduce the signal-to-noise ratio. The programs are able, to some extent, to extract multiple signals from a single analysis, but the highest efficiency is clearly obtained by selecting clusters of genes that are likely to be all regulated by the same transcription factor. The choice of the clustering method is thus crucial.

5 Metabolic Network Analysis We focus now on the second question, namely the functional interpretation of clusters of co-regulated genes.

156

J. van Helden et al.

Table 1. Patterns extracted by dyad-analysis with the MET family. Legends: obs occ: observed occurrences; exp occ: expected occurrences; proba: binomial probability; sig: significance index. All patterns with significance value higher than 1 were selected. Some patterns can be grouped together on the basis of sequence similarities, and assembled into larger patterns (contigs). The first group corresponds to the sequence recognized by the protein complex Met4p/Met28p/Cbf1p. The second group describes the site bound by Met31p and its homologue Met32p. These factors are those known to regulate methionine biosynthesis in the yeast Saccharomyces cerevisiae. To our knowledge, the third group and the isolated patterns do not show any obvious similarity to known binding sites, and could reveal new regulatory patterns pattern GTC..GTG.. .TCA.GTG.. .TCACGT... ..CACGTG.. ..CAC.TGA. ..CAC..GAC ...ACGTGA. GTCACGTGAC

reverse complementary ..CAC..GAC ..CAC.TGA. ...ACGTGA. ..CACGTG.. .TCA.GTG.. GTC..GTG.. .TCACGT... GTCACGTGAC

CGCCAC.... .GCCACA... ..CCA..GTT ..CCACAG.. ..CCA.AGT. ...CACAGT. ...CAC.GTT CGCCACAGTT

....GTGGCG ...TGTGGC. AAC..TGG.. ..CTGTGG.. .ACT.TGG.. .ACTGTG... AAC.GTG... AACTGTGGCG

14 21 23 24 21 24 24

2.32 4.06 5.86 3.53 4.59 5.41 5.79

.CCA................GGT ACC................TGG. .CCA...............TGG. ACCA...............TGGT

15 15 22

2.9 5.10E-07 1.7 2.9 5.10E-07 1.7 3.16 1.10E-06 1.3

assembly

ACC................TGG. .CCA................GGT .CCA...............TGG. ACCA...............TGGT

isolated

CAG...TGG

CCA...CTG

17

3.12 4.60E-08 2.7

group 1

assembly

group 2

assembly

group 3

obs occ exp occ proba sig 17 2.61 3.60E-09 3.8 23 5.12 8.50E-09 3.4 21 4.75 4.60E-08 2.7 38 3.37 0 20 23 5.12 8.50E-09 3.4 17 2.61 3.60E-09 3.8 21 4.75 4.60E-08 2.7

2.00E-07 3.30E-09 9.10E-08 2.30E-12 2.60E-08 5.20E-09 1.90E-08

2.1 3.8 2.4 7 2.9 3.6 3.1

Representating Metabolic Pathways as Graphs The set of all possible metabolic reactions can be seen as a graph, with two types of nodes (metabolites and reactions respectively). Arcs represent substrate-reaction and reaction-product relationships. A graph containing all known metabolic reactions 4 would include of the order of 10 nodes and as many arcs. The connectivity is very high for some particular compounds (ATP, Adenosyl-Methionine), but besides these “pool metabolites”, the vast majority of compounds are involved in a very limited number of reactions. Reactions have between 1 and 6 substrates (2 on average) and as many products. The complexity of such a graph is huge and the number of possible pathways is virtually infinite.

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

157

However, only a very restricted number of these possible pathways are effectively followed in living organisms. For instance, the database EcoCyc, which holds the most comprehensive information about Escherichia coli metabolism, only contains 159 distinct pathways. E. coli has been, for several decades, the preferred model organism for biochemists, and even though some parts of its metabolism certainly remain to be discovered, the number of pathways is not expected to increase significantly for this organism. Metabolic Pathway Discovery L-Aspartate

S.cerevisiae

E.coli

2.7.2.4

L-aspartyl-4-P 1.2.1.11

L-aspartic semialdehyde 1.1.1.3

L-Homoserine 2.3.1.31

2.3.1.46

Alpha-succinyl-L-Homoserine 4.2.99.9

O-acetyl-homoserine

Cystathionine 4.2.99.10

4.4.1.8

Homocysteine 2.1.1.14

L-Methionine 2.5.1.6

S-Adenosyl-L-Methionine

Fig. 4. Alternative pathways for methionine biosynthesis in E. coli and S. cerevisiae

Many pathways however remain be discovered in other organisms. Indeed, it is common to observe that different organisms follow distinct pathways for the biosynthesis or degradation of the same molecule. For example, in E. coli, methionine is synthesized in 7 steps from aspartate, whereas the yeast Saccharomyces cerevisiae performs this transformation in 6 steps, 4 of which are common with E. coli (Fig. 4). The case of lysine is more extreme, since E. coli and S. cerevisiae follow completely different pathways to synthesize this metabolite.

158

J. van Helden et al.

In addition, many parts of the metabolism remain largely unexplored, for example the mechanisms of toxic molecule degradation or resistance to extreme conditions observed in some bacteria. In summary, among all the pathways that could be followed in the graph of metabolic reactions, only a very restricted fraction corresponds to already described pathways. Another part corresponds to pathways that are not yet described but might appear to be effectively used by some organisms in response to some conditions. Finally, a vast majority of these pathways might be devoid of any biological relevance. As illustrated below, measuring the transcriptional response of all the genes of an organism could be one way to select those pathways that are most likely to correspond to biological processes. Metabolism and Gene Expression Living organisms can rapidly modify their internal concentration of small molecules (metabolites) via enzymatic catalysis. Controlling metabolite fluxes is essential to cell viability, in that it allows the cell to maintain biochemical compounds in stationary concentrations (homeostasis), in spite of fluctuations of their external availability and internal consumption rates. Several molecular mechanisms are involved in metabolic regulation. Enzymes and transporters are regulated at different levels: transcription rate, RNA stability, translation rate, protein activity, intracellular location, protein degradation. Several of these mechanisms can be combined for the control of the same metabolic pathway. Enzymes and transporters participating in a common metabolic pathway are often co-regulated at the transcriptional level. Thus, when the culture medium is modified by depleting (or adding) a given metabolite, it is expected that the genes that participate in the biosynthesis (or degradation) of the molecule will respond at the transcriptional level. DNA chip and related technologies can be used to unravel the set of genes that respond to a given perturbation of the external conditions (addition/removal of a metabolite) in a given organism. The question is thus to discover, from this set of genes, which particular pathway could be catalyzed. Applying Graph Analysis for a Functional Interpretation of Gene Expression Data The first step is to select, among the set of co-regulated genes, those that code for enzymes, and identify the reactions they could catalyze. These reactions correspond to a subset of nodes in the graph of all possible metabolic reactions (Fig. 5A). The method consists in trying to interconnect all these reactions in a meaningful way (Fig. 5B), in order to extract a sub-graph (Fig. 5C) corresponding to one or several putative metabolic pathways (Fig. 5D). The algorithms for subgraph extraction and maximal path enumeration will be described elsewhere (in prep.), and we will only summarize their principle.

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

A. Seed reactions

Compound Reaction Seed Reaction

C. Subgraph extraction

159

B. Reaction linking

Direct link Intercalated reaction

D. Linear Path Enumeration

Fig. 5. Conceptual schema of the subgraph extraction. A: graph representation of metabolic pathways. Two types of nodes are used to represent metabolites (circles) and reactions (rectangles) respectively. Arcs represent the relationships between reactions and their substrates and products. Filled rectangles represent seed reactions, i.e. those that are catalyzed by genes belonging to the co-regulated cluster. B: A direct link (dark gray) can be established between two reactions when the first one produces a metabolite that is used as substrate by the second one. Optionally, one can decide to allow intercalating reactions (ligt gray) that were not part of the initial seeds, if this improves the connectivity. C: Connected components are then extracted from the graph. The extracted subgraph represents a metabolic pathway that is potentially catalyzed by the cluster of co-regulated genes. D: When the subgraph contains branches, it can be decomposed into non-redundant elmentary paths, which highlight potential endpoints of the metabolic pathways

The simplest way to interconnect reactions is to identify compounds that are produced by one reaction and used as substrate by another one. In a second step, linking can be improved by intercalating reactions that were not part of the initial set. Several reasons could be invoked to justify such an intercalation. Firstly, some genes could be involved in the metabolic pathway without being regulated at the transcriptional level. Secondly, microarray technologies are still limited in reproducibility and some regulations might have escaped detection. A third possibility would be that the gene is present on the chip and its expression level has been measured correctly, but this gene has not been annotated as an enzyme yet. Indeed, for newly sequenced genomes, gene function is usually predicted by sequence similarities, and many genes remain unan-

160

J. van Helden et al.

notated. In such a case, the best candidates to ensure the missing enzymatic catalysis are the genes that belong to the initial cluster itself, but have no assigned function yet. Comparison of Extracted Pathways with Known Pathways Once the subgraph has been extracted, the putative pathway can be compared to the set of known metabolic pathways stored in some metabolic pathway database [8, 9, 19, 20]. In some cases the pathway extracted from the gene cluster will correspond to some previously characterized pathway. For such cases, a simple matching of the set of reactions against a database of metabolic pathways would have provided the same answer. In other cases, one might observe only a partial match with a known pathways. The subgraph extraction might thus reveal an alternative to the pathway followed in the model organism. The method could be applied to study the metabolism of newly sequenced organisms, whose metabolism has been poorly characterized. Finally, in some cases, one should be able to extract completely novel pathways. The co-regulation of the enzyme-coding genes would provide a good support to indicate that this pathway is biologically relevant. An interesting field of application would be to discover metabolic pathways involved in largely unexplored processes, such as resistance to toxic compounds or extreme conditions. Another application is to reveal which metabolic pathways are affected by a new drug. Application of Pathway Analysis to the Study Case We applied the above procedure to the 20 genes belonging to the MET cluster defined by Spellman and co-workers. Seven of these genes code for enzymes, which can catalyze 8 distinct reactions. Subgraph extraction and maximal path enumeration resulted in a linear pathway including 6 of the initial reactions (Fig. 6). In this case, the linear path was obtained without intercalating any reaction that was not part of the initial set. The pathway shows partial matches with two distinct metabolic pathways: the 4 initial steps match the sulfur assimilation pathway, and perform a progressive reduction of sulfate into sulfide. The two last steps match the methionine biosynthesis pathway, and correspond to the incorporation of sulfur into homocysteine, and the transformation of the latter into methionine. Sulfur assimilation and methionine biosynthesis are intrinsically related in the yeast Saccharomyces cerevisiae, since in this organism sulfur amino acids are all derived from the methionine biosynthesis pathway (this differs from Escherichia coli, where sulfur is incorporated into cysteine and then transferred to methionine). It makes thus sense to have a coordinated transcriptional regulation of all the metabolic steps from sulfate to methionine. Indeed, these genes are all known targets of the methionine-regulating transcription factors described above [21].

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

Genes

Enzymes

161

Matching pathways

Pathway extracted Sulfate

MET14

Adenylyl sulfate kinase

MET16

3’-phosphoadenylylsulfate reductase

MET5

Putative Sulfite reductase

1.8.1.2

MET10

Sulfite reductase (NADPH)

sulfide

MET17

O-acetylhomoserine (thiol)-lyase

4.2.99.10

MET6

Methionine synthase (vit B12-independent)

2.7.7.4

ATP PPi

Adenylyl sulfate (APS) 2.7.1.25

ATP ADP

3’-phosphoadenylylsulfate (PAPS) 1.8.99.4

NADPH NADP+; AMP; 3’-phosphate (PAP); H+

sulfite 3 NADPH; 5H+ 3 NADP+; 3 H2O

O-acetyl-homoserine

Homocysteine 2.1.1.14

L-Methionine

5-methyltetrahydropteroyltri-L-glutamate 5-tetrahydropteroyltri-L-glutamate

Methionine biosynthesis

Sulfate adenylyl transferase

Sulfur assimilation

MET3

Fig. 6. Result obtained by pathway analysis with the cell-cycle regulated gene cluster MET from Spellman [10]. Six of the reactions potentially catalyzed by enzymes-coding genes from the cluster can be assembled into a single linear pathway, without need to intercalate any additional reaction. The pathway extracted is the way used by yeast to incorporate sulfur into amino acids, by reduction of sulfate into sulfide, which is incorporated in homocyteine. This pathway matches two distinct pathways from the database: the first 4 steps correspond to sulfur assimilation, whereas the two last steps are part of the methionine biosynthetic pathway

In summary, starting from an unordered set of reactions, the program was able to build a linear metabolic pathway, which correspond to our expectation for the study case. In this particular case, the pathway was already well characterized and a similar result would have been obtained by matching the seed reactions against a database of metabolic pathways like KEGG. However, since this pathway was re-discovered by the program without any a priori information about how reactions do assemble into pathways in the yeast (the matching with known pathways was only done a posteriori), one can hope that the same method will also provide a means of discovering novel pathway. We are currently optimizing the program and evaluating it performances in different conditions, on the basis of well characterized pathways. The optimized program will then be used to provide an interpretation of gene expression data in terms of metabolic pathways.

162

J. van Helden et al.

6 Conclusions In the context of genomic approaches, coding sequence analysis is often insufficient to systematically assign a function to each gene. The function depends not only on the structure of the encoded protein, but also on the context in which this protein exerts its activity. Functional predictions thus require the integration of different levels of information. The possibility to measure the transcriptional response at a genome scale offers exciting perspectives for the discovery of gene function, taking into account the ways genes are associated in functional clusters. By combining regulatory sequence analysis and metabolic pathway analysis, one could obtain two independent and complementary sources of information for these clusters of co-regulated genes. The same methods also apply to clusters of genes obtained from other functional genomics approaches, such as phylogenetic profiles [22] and gene fusion/fission analysis [23-25].

7 Availability Regulatory Sequence Analysis tools are available on the web [26] at the URL http://ucmb.ulb.ac.be/bioinformatics/rsa-tools/. The home page for the EBI project of database on Protein Function and Biochemical Pathways is at http://www.ebi.ac.uk/research/pfbp/. A prototype version of the patwhay analysis tools can be accessed from this site. A prototype version of the visualization tools is available at http://www.soi.city.ac.uk/~msch/. Acknowledgements. Jacques van Helden was funded by the European Commission Contract N0: QLRI-CT-1999-01333.

References [1] DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680-6 (1997). [2] Brown, P.O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nat Genet 21, 33-7 (1999). [3] Eisen, M.B. & Brown, P.O. DNA arrays for analysis of gene expression. Methods Enzymol 303, 179-205 (1999). [4] Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96, 2907-12 (1999). [5] Vilo, J., Brazma, A., Jonassen, I. & Ukkonen, E. Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data. ISMB (2000). [6] Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res 28, 316-319 (2000).

Application of Regulatory Sequence Analysis and Metabolic Network Analysis

163

[7] Salgado, H. et al. RegulonDB (version 3.0): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 28, 65-67 (2000). [8] van Helden, J. et al. From molecular activities and processes to biological function. Briefings in Bioinformatics in press(2001). [9] van Helden, J. et al. Representing and analysing molecular and cellular function using the computer. Biol Chem 381, 921-35 (2000). [10] Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97 (1998). [11] Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8 (1998). [12] Gilbert, D., Schroeder, M. & van Helden, J. Interactive visualization and exploration of relationships between biological objects. Trends in Biotechnology 18, 487-495 (2000). [13] van Helden, J., Andre, B. & Collado-Vides, J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281, 827-42 (1998). [14] van Helden, J., Rios, A.F. & Collado-Vides, J. Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Res 28, 1808-18 (2000). [15] Brazma, A., Jonassen, I., Vilo, J. & Ukkonen, E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res 8, 1202-15 (1998). [16] Graber, J.H., Cantor, C.R., Mohr, S.C. & Smith, T.F. Genomic detection of new yeast premRNA 3’-end-processing signals. Nucleic Acids Res 27, 888-94 (1999). [17] Reinert, G. & Schbath, S. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol 5, 223-53 (1998). [18] van Helden, J., del Olmo, M. & Perez-Ortin, J.E. Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28, 1000-10 (2000). [19] Karp, P.D. et al. The EcoCyc and MetaCyc databases. Nucleic Acids Res 28, 56-59 (2000). [20] Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27-30 (2000). [21] Thomas, D. & Surdin-Kerjan, Y. Metabolism of sulfur amino acids in Saccharomyces cerevisiae. Microbiol Mol Biol Rev 61, 503-32 (1997). [22] Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. & Yeates, T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96, 4285-8 (1999). [23] Marcotte, E.M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751-3 (1999). [24] Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. & Eisenberg, D. A combined algorithm for genome-wide prediction of protein function [see comments]. Nature 402, 83-6 (1999). [25] Enright, A.J., Iliopoulos, I., Kyrpides, N.C. & Ouzounis, C.A. Protein interaction maps for complete genomes based on gene fusion events [see comments]. Nature 402, 86-90 (1999). [26] van Helden, J., Andre, B. & Collado-Vides, J. A web site for the computational analysis of yeast regulatory sequences. Yeast 16, 177-87 (2000).

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close