We were interested in examining the AIRE-1 sequence for additional domains. Blast2 searches of the NRDB (non-redundant database with >285,000 entries)13 with AIRE-1 residues 1-298 (excluding the PHD fingers) detected several nuclear proteins in the related nuclear phosphoprotein 41/75 and Sp100/Sp140 groups 14-16. Particularly noteworthy were the variants Sp140 (LySp100B) and NucP75 , each of which also contains a PHD finger14-16. The reported probabilities, with a best match to Sp100B (probability of a match by chance, P = 1.5e-4), were of borderline significance, but the presence of two regions in AIRE-1 (in addition to the shared PHD fingers) that independently detected the Sp100 proteins warranted more careful investigation.
With borderline hits of low sequence similarity, it is often helpful to undertake reciprocate searches with the top matching sequences since, if they are related, the matching signal may be expected to be above that of random scores (Bork and Gibson)17. An alignment of the Sp100 group was used to prepare profiles for searches of the NRDB with SearchWise18 or EMBL's Bioccelerator31 (Compugen Ltd., Israel19). A profile was prepared from the highly conserved N-terminal region of the alignment using ProfileWeight32 with sequence weighting and the Gonnet Pam250 matrix33. A search of the NRDB yielded AIRE-1 as the top non-self hit (expected frequency of false positives, E = 8e-2). No other matching sequences were found for this domain. The sequence alignment is shown in Fig. 1A, together with a secondary structure prediction20 suggesting that the domain is predominantly [alpha]-helical.
Reciprocate profile searches with the central conserved region of Sp100 confirmed the relationship with the nuclear phosphoproteins and also detected the Drosophila DEAF-1 transcription factor21 and its vertebrate homologue, termed suppressin22, with good statistical support (E = 6.5e-12). The AIRE-1 match was present, but weakly supported (E = 8.9e-2). While the AIRE-1 sequence is the most divergent of the set and is not well supported statistically, reciprocate detections by three different domains in independent database searches suggest that it is genuine. The colinear order of the domains in AIRE-1 and Sp140/LySp100B suggest that AIRE-1, though highly diverged, shares common ancestry with the Sp100 protein group and may therefore function similarly.
Database searches with a profile prepared after adding in the new sequences also detected four predicted ORFs from the C. elegans genome sequencing project23 and two ESTs24 that did not correspond to known proteins. No additional matching sequences were found after further search permutations: the domain may be restricted to animal phyla as we were unable to find any evidence for the domain in the yeast genome. The set of proteins have in common a conserved sequence of ~80 residues (Fig. 1B), for which we suggest the term SAND domain after Sp100, AIRE-1, NucP41/75 and DEAF-1/suppressin. Although SAND domain similarities have been reported before, the domain has not been characterised in detail, e.g. with regard to domain boundaries and secondary structure.
Conserved hydrophobic residues imply that the SAND domain has a globular fold, while several well-conserved positively-charged residues may be functionally important (Fig. 1B). Secondary structure prediction20 suggests that SAND has an all-[beta] structure with approximately eight [beta]-strands. There are three subgroups of SAND domain sequences typified by Sp100, DEAF-1 and the C. elegans ORFs (Fig. 1C). H_Est1 (a composite of two overlapping entries AA148980, AA279407) clearly belongs with the C. elegans group, yet the nematode sequences are most closely related to each other, suggesting lineage-specific gene duplication has occured. As expected, the AIRE-1 sequence joins the tree adjacent to the Sp100 group. The SAND domain occurs in different modular contexts, including the bromodomain26, the PHD finger and the MYND finger shared by DEAF-1/suppressin, mtg8 and nervy21,22 (Fig. 1D).
The SAND domain adds to the burgeoning set of domains present in modular
chromatin-associated proteins. The functions of most of these domains are not
at all well understood, and gaining a better understanding will be one key to
understanding how chromatin is assembled and regulated. The SAND domain appears
in various nuclear contexts. Sp100/Sp140 are found in recently described
nuclear bodies or dots, discrete structures within the nucleus that do not yet
have known functions16,17. The best clue to function is therefore
supplied by the DEAF-1 DNA-binding transcription factor21.
Many small intracellular modules function in protein-protein interaction,
building up higher order complexes. However, the conserved positive charges in
SAND domains imply negatively charged ligands and, within the region of DEAF-1
which binds DNA, the SAND domain is the only motif which is also conserved in
the homologous vertebrate suppressins. Additional positive charges are found in
adjacent sequence for most of the SAND domains, well positioned for
non-specific interactions with DNA phosphate groups. Thus SAND seems likely to
be a DNA-binding domain22 despite the predicted all-[beta]
structure, which is rare but not unknown for a DNA-binding domain, being found
for example in the transcription factors NF-[kappa]B and NFAT26,27.
This would be quite unusual, as PHD fingers and bromodomains are not usually
found in combination with DNA-binding domains (although the PHD finger is found
in some plant homeodomain proteins28). Confirmation of SAND domain
DNA-binding function in DEAF-1 would lead to the idea that all these proteins
are DNA-binding transcription factors. The SAND proteins which also contain PHD
fingers, such as AIRE-1, are likely to regulate gene expression through the
modulation of chromatin structure.
The typical symptoms of APECED differ in the Finnish and Iranian Jewish populations (e.g. the former, but not the latter, typically show chronic candidiasis), presumably because different mutations are predominant in each population. As more APECED mutations are revealed therefore, it will be important to determine whether there is a correlation between particular symptoms, the mutation site and the domain topology of AIRE-1.