Pattern Discovery from Biosequences

Jaak Vilo

Department of Computer
Science P.O. Box 26,
FIN-00014 University of Helsinki, Finland,

PhD Thesis, Series of Publications A, Report A-2002-3 Helsinki, November 2002, 149 pages

ISSN 1238-8645
ISBN 952-10-0792-3 (paperback)
ISBN 952-10-0819-9 (PDF)


In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (

Biosequences, i.e., the primary sequences of DNA, RNA, and protein molecules, represent the most basic type of biological information. Features of these sequences that are reused by nature help us to understand better the basic mechanisms of gene structure, function, and regulation. The SPEXS algorithm has been developed for the discovery of the biologically relevant features that can be represented in the form of sequence patterns. SPEXS is a fast exhaustive search algorithm for the class of generalized regular patterns. This class is essentially the same as used in the PROSITE pattern database, i.e. it allows patterns to consist of fixed character positions, group character positions (ambiguities), and wildcards of variable lengths. The biological relevance of the patterns can be estimated according to several different mathematical criteria, which have to be chosen according to the application.

We have used SPEXS for the analysis of real biological problems, where we have been able to find biologically meaningful patterns in a variety of different applications. For example, we have studied gene regulation mechanisms by a systematic prediction of transcription factor binding sites or other signals in the DNA. In order to find genes that potentially share common regulatory mechanisms, we have used microarray based gene expression data for extracting sets of coexpressed genes.

We have also demonstrated that it is possible to predict the type of interaction between the G-protein coupled receptors (GPCR) and its respective G-protein, the mechanism widely used by cells for signaling pathways. That prediction, although the GPCR s have been studied for decades, primarily for their immense value for the pharmaceutical industry, had been thought to be unlikely from the primary sequence of GPCR alone.

The tools developed for various practical analysis tasks have been integrated into a web-based data mining environment Expression Profiler hosted at the European Bioinformatics Institute EBI. With the tools in Expression Profiler it is possible to analyze a range of different types of data like sequences, numerical gene expression data, functional annotations, or protein-protein interaction data, as well as to combine these analyses.

Computing Reviews (1998) Categories and Subject Descriptors:

F.2.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity Nonnumerical Algorithms and Problems

H.2.8 [Information Systems]:Database management Database Applications

H.3.5 [Information Systems]:Information Storage and retrieval Online Information Services

I.5.3 [Computing Methodologies]:Pattern Recognition Clustering

J.3 [Computer Applications]:Life and Medical Sciences

General Terms:
Algorithms, Biology and Genetics, Bioinformatics

Additional Key Words and Phrases:
Pattern Discovery, Data Mining, Gene Expression Data Analysis, Functional Genomics, Scientific Visualization