Prediction

Introduction

Method

Performance

Query

The method of CKSAAP_OGlySite

Datasets

The experimentally validated mucin-type O-glycosylation sites from mammalian proteins were extracted from the Swiss-prot database, which contains 103 proteins covering 116 S and 212 T sites in the final dataset, all S and T residues in these protein sequences with no annotation related to O-glycosylation site were selected as negative sites.

Feature construction

A new feature construction, termed as CKSAAP encoding method was employed. The detailed procedures are described as follows.

Generally, a sequence fragment of 2n+1 amino acids (i.e. the window size is equal to 2n+1, n was set as 9) is used to define a glycosylation site. For k-spaced amino acid pairs (i.e. pairs that are separated by k other amino acids) within this sequence fragment, there are 441 possible types (AA, AC, AD, ..., OO). Then, a feature vector of that size is used to represent the composition of these pairs, which can be described as

The value of each feature denotes the composition of the corresponding amino acid pair in the fragment. For instance, if an AD pair occurs m times in this fragment, the corresponding value in the vector (i.e. cAD) is equal to m. The amino acid pairs for k=0, 1,...,kmax are considered together in this study, so the total dimension of the proposed feature vector is 441*(kmax+1).

To benchmark the proposed CKSAAP encoding, the prediction based on the binary encoding was also carried out. In this encoding scheme, each amino acid is represented by a 21-dimensional binary vector, e.g. A (100000000000000000000), C (010000000000000000000), ..., O (000000000000000000001), etc. For a query O-glycosylation site represented by a fragment of 2n+1 residues, the central residue is always S/T, which is not necessary to be taken into account. Therefore, the total dimension of the proposed binary feature vector is 21*2n.

Support Vector Machine (SVM)

The SVM is a machine-learning algorithm for two classes of classification with the goal to find a rule that best maps each member of training set to the correct classification, which has been widely used in the field of protein bioinformatics. The implementation of the SVM algorithm was SVM-Light (http://svmlight.joachims.org/).

Server construction

This site was developed in Linux platform with CGI script. To train the prediction model, the ratio of O-glycosylation sites to non-glycosylation sites in train dataset was set as 1:5.