SCOOP - Simple Comparison Of Outputs Program

The SCOOP software allows the comparison of families of proteins. SCOOP stands for Simple Comparison Of Outputs Program. The software provides an alternative method to profile-profile comparison. The method is conceptually simple yet provides results that are comparable with other state of the art tools.

[Genome Research Limited]

Information

If you find this software or results useful please cite the following paper:

  • SCOOP: a simple method for identification of novel protein superfamily relationships.

    Bateman A and Finn RD

    Bioinformatics (Oxford, England) 2007;23;7;809-14

What's new in SCOOP_0.02

The latest version of SCOOP may be downloaded from here.

Please read the README for installation instructions.

scoop.pl now has 2 modes of scoring -- the original method, and the weighted method, which takes into account the E-values of the overlapping regions to the HMMs. Both methods may be used at the same time -- see perldoc scoop.pl for details.

SCOOP may also be run with an E-value threshold. If the threshold is set at, for example, 700 then all pairs of overlapping regions with a mean E-value greater than 700 will not be counted. Using a lower E-value threshold eg 500 will increase specificity but at the cost of sensitivity. Recomended thresholds to try are 700, 500 and 350.

There is also a new program called pairwise_scoop.pl. This will take two HMMer2 output files and compare them in the same way as scoop does.

SCOOP results

SCOOP results for release 23 of Pfam

The results of running SCOOP with no E-value threshold, against all families in the current release of Pfam (release 23), may be found here: Pfam release 23.0.

Only matches with a normalised score of 10 or more are included to keep the file size reasonable.

This file also includes information about clan memebership in the final column. If the families are in the same clan they are labelled as TRUE, whilst those that belong to different clans are labelled as FALSE. The label UNKNOWN is used if neither family (or only one family) belong to a clan. Some high scoring pairs of families are labelled as NESTED as one pfam domain may be found nested within another. Note that this is a post processing step and is not included in the normal SCOOP output.

A snippet of the results file is shown here:

      792.669 2000.552 Fe-ADH 3914 DHQ_synthase 2595 TRUE
      734.837 2498.263 RNA_pol_Rpb2_2 1821 RNA_pol_Rpb2_1 1729 NESTED
      718.251 1833.564 SecA_PP_bind 1346 SecA_DEAD 1034 NESTED
      699.076 1771.645 MutS_IV 1215 MutS_III 1080 NESTED
      668.824 2272.494 Kazal_1 1793 Kazal_2 1566 TRUE
      660.666 2139.904 Laminin_G_1 1636 Laminin_G_2 1516 TRUE
      638.108 2146.351 Amidohydro_1 7421 Amidohydro_3 5911 TRUE
      635.529 1819.528 Flavodoxin_2 3618 FMN_red 2443 TRUE
      633.738 2132.938 SAM_1 2084 SAM_2 1636 TRUE
      626.493 1203.871 Ala_racemase_N 4062 Orn_Arg_deC_N 1452 TRUE
      624.429 2776.300 RNA_pol_A_bac 2042 RNA_pol_L 1843 NESTED
      619.758 990.697 HWE_HK 1396 HisKA_2 762 TRUE
      619.551 1079.223 PRA-PH 1949 MazG 1143 TRUE

The results columns are as follows:

  1. SCOOP normalised score. Scores greater than 30 are very likely to be true matches.
  2. SCOOP raw score weighted using E-values. Scores greater than 100 are very likely to be true matches.
  3. Pfam family 1 identifier.
  4. Number of matches in total in family 1.
  5. Pfam family 1 identifier.
  6. Number of matches in total in family 2.
  7. Number of common matches between family 1 and family 2.
  8. Whether relationship is TRUE, FALSE or NESTED as predicted by Pfam clan membership.

Previous results

SCOOP results for previous releases of Pfam

The SCOOP results for the previous releases were created with the original version of SCOOP. Therefore, the output file is a little different. See below for details.

As before, only matches with a score of 10 or more are included to keep the file size reasonable. The file also includes information about which families are in the same (TRUE) or different (FALSE) clans in the final column.

An example of the original output is shown here:

      15.441800713964 7tm_1 31589 7tm_2 4405 Both 1949 FALSE
      11.1314824652934 7tm_1 31589 ER_lumen_recept 2928 Both 627 LINKED_1
      12.1693723225486 7tm_1 31589 7tm_4 31064 Both 10640 TRUE
      13.5441035547411 7tm_1 31589 Frizzled 541 Both 152 FALSE
      26.5288908365378 7tm_1 31589 Serpentine_recp 10080 Both 7162 TRUE
      14.2477346712845 7tm_1 31589 DUF131 10132 Both 2742 LINKED_1
      11.4159042357542 7tm_1 31589 Sra 1984 Both 611 FALSE
      14.1400325551833 7tm_1 31589 Srg 6008 Both 2201 LINKED_1
      23.0041659690462 7tm_1 31589 Srb 2042 Both 962 FALSE

The columns of the output are as follows:

  1. SCOOP normalised score. Scores greater than 50 are very likely to be true matches.
  2. Family 1 identifier (Pfam identifier in this case).
  3. Number of matches in total in family 1.
  4. Family 2 identifier (Pfam identifier in this case).
  5. Number of matches in total in family 2.
  6. Number of common matches between family 1 and family 2.
  7. Whether relationship is TRUE or FALSE as judged by Pfam clan membership.

Contact Alex Bateman for help with any aspect of SCOOP.

* quick link - http://q.sanger.ac.uk/ub0y3ktp