Since the availability of high throughput sequencing tools, the number of known protein sequences has been growing at an unprecedented rate. On the other hand, information about structure or function of proteins is extremely sparse. Biologists that study proteins make extensive use of protein search engines to find homologous sequences whose structure or function are known. One well known measure for sequence similarity is the Smith-Waterman (SW) alignment score. As calculating the SW score is computationally expensive, various approximations for finding homologous sequences have been suggested, and of these the current de-facto standard for protein searching are the BLAST and PSI-BLAST methods of Altschul et al. While BLAST is an efficient approximation algorithm to the optimal SW alignment, it is still, from a computer science standpoint, a very inefficient method as it compares the query sequence to each and every sequence in the database. We present a method for indexing and searching proteins using amino acid patterns. As a source of patterns, we use the BLOCKS library of Henikoff and Henikoff. Position specific scoring matrices are used to identify pattern occurrences. Each iteration consists of a âscanâ in which we identify all statistically significant pattern occurrences in the sequence set; and a refinement stage, in which we use the identified occurrences to define better PSSMs. The final refined PSSMs are then used to index proteins in the UniProt Knowledgebase (UniProtKB), creating an efficient and accurate tool for searching protein homologues.
The authors of these documents have submitted their reports to this technical report series for the purpose of non-commercial dissemination of scientific work. The reports are copyrighted by the authors, and their existence in electronic format does not imply that the authors have relinquished any rights. You may copy a report for scholarly, non-commercial purposes, such as research or instruction, provided that you agree to respect the author's copyright. For information concerning the use of this document for other than research or instructional purposes, contact the authors. Other information concerning this technical report series can be obtained from the Computer Science and Engineering Department at the University of California at San Diego, email@example.com.
[ Search ]