Large scale identification and categorization of protein sequences using structured logistic regression

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

Dokumenter

Large Scale Identification and Categorization of Protein ...
Forlagets udgivne version, 890 KB, PDF-dokument

Bjørn Panella Pedersen
Georgiana Ifrim
Poul Liboriussen
Kristian B Axelsen
Palmgren, Michael Broberg
Poul Nissen
Wiuf, Carsten
Christian N S Pedersen

Abstract

Background
Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem.

Results
Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases.

Conclusions
Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.

Originalsprog	Engelsk
Artikelnummer	e85139
Tidsskrift	PLOS ONE
Vol/bind	9
Udgave nummer	1
Antal sider	11
ISSN	1932-6203
DOI	https://doi.org/10.1371/journal.pone.0085139
Status	Udgivet - 20 jan. 2014

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk

Ingen data tilgængelig

ID: 100977334

Institut for Matematiske Fag