Quick Start
EvolClustDB is a public and browsable database of evolutionarily conserved gene clusters pre-computed
with the EvolClust algorithm (Marcet-Houben M and Gabaldón T (2020)).
This algorithm identifies groups of neighboring genes whose proximity is significantly conserved across evolution compared to the genome average. This prediction is performed pairwise in an all against all comparisons of the genomes of interest, and then predicted gene clusters are grouped into multi-species families based on shared homologous gene content. Visualization allows inspecting gene order per species in a phylogenetic context (species or taxonomy tree), enabling inference of potential evolutionary events. EvolclustDB provides gene cluster visualization through the PhyD3
FAQS
EvolclustDB citation
EvolClustDB: Exploring eukaryotic gene clusters with evolutionarily conserved genomic neighbourhoodsMarina Marcet-Houben, Ismael Collado-Cala, Diego Fuentes-Palacios, Alicia D. Gómez, Manuel Molina, Andrés Garisoain-Zafra, Uciel Chorostecki, Toni Gabaldón
bPMID: 36806474; doi: 10.1016/j.jmb.2023.168013
Evolclust method citation
Evolclust: automated inference of evolutionary conserved gene clusters in eukaryotesMarina Marcet-Houben, Toni Gabaldón
Bioinformatics Feb 15;36(4):1265-1266. doi: 10.1093/bioinformatics/btz706.
We define a cluster family as a group of homologous gene clusters conserved in at least two different species.
We define a gene cluster as a group of homologous proteins that are found sharing a similar gene neighbourhood in at least two different
genomes and which are more conserved than what is expected for the pair of genomes.
EvolClust predicts conserved gene clusters from pairwise genome comparisons and infers families of related clusters from multiple (all versus all) genome comparisons.
You can check the paper describing the algorithm in the following publication.
Marina Marcet-Houben, Toni Gabaldón, EvolClust: automated inference of evolutionary conserved gene clusters in eukaryotes, Bioinformatics, Volume 36, Issue 4, 15 February 2020, Pages 1265–1266, https://doi.org/10.1093/bioinformatics/btz706
Furthermore, you can see a summary of the algorithm in here:
Github
Marina Marcet-Houben, Toni Gabaldón, EvolClust: automated inference of evolutionary conserved gene clusters in eukaryotes, Bioinformatics, Volume 36, Issue 4, 15 February 2020, Pages 1265–1266, https://doi.org/10.1093/bioinformatics/btz706
Furthermore, you can see a summary of the algorithm in here:
Github
To search for existing clusters in a proteome not included in evolclustDB we provide HMM profiles. Profiles are build for each protein family found in the cluster and are joined in a single file to simplify the search of the cluster in a proteome of interest. In order to perform the search you need to have a single proteome for a given species in fasta format. HMMsearch can then be used to obtain a table with the hits the profiles have in the proteome. Gene order information will then need to be used to extract potential clusters. A script is provided in the github repository of EvolClust to help with this search. In order for it to work the fasta header for the proteome has to contain information of the placement of the protein within the genome following the evolclust format (Species code +""+ number indicating a contig +""+ number indicating the protein in such a way that correlative numbers indicate proteins that are located next to each other. i.e. YEAST_002_00234). Then the script can be executed and this will first perform the HMMsearch step and then filter the results and supply a list of clusters (for additional details see here).
Each triangle represents a different gene found in the gene cluster. Homologous genes share the same colour. White triangles represent non-homologous genes.
The download section contains the compressed profiles that can help find cluster families within a new genome and a detailed account of each cluster family which includes the fasta sequences for all genes within the cluster family, a detalied list of genes within the cluster family and their predicted function and the conservation score heatmap for the cluster family. Finally the downloads section includes a file with the complete list of cluster families.
Help
Here is an example of an output from a cluster family (CF_002354). You can see 11 species that have the cluster (tree on the left) in this example. Then, the gene cluster is represented with a set of triangles, each triangle color representing a gene that belongs to the cluster. Genes belonging to the same gene family (homologous group) are depicted in the same color else they are depicted in white color. The triangle can point to right or left indicating different gene orientations (i.e. encoded in opposite strands, relatively). This representation allows you to easily detect a family with gene loss in some species, marked in the figure in the column “Homologous genes (with gene loss)”.
Note that for particularly large trees and clusters (more than 500 genes), there might be some delay when loading the image. You can zoom in and out to check the complete tree.
EvolClustDB allows browsing through the clusters in different ways: searching by protein, species, cluster or sequence.
- If you search by protein, the input is a gene id where you can use different identifiers (Uniprot, NCBI, and TAIR among others)
- If you search by species, the input is a species name. You can use a keyword and the box will suggest different possibilities.
- If you search by cluster, you should use the EvolclustDB cluster identifier. (Ex: CF_002354)
Another way to search is by similarity. In this type of search, the input sequences are accepted in FASTA or multi-FASTA format:
- The line containing the name and/or the description of the sequence starts with a ">" (this is optional for one sequence input)
- The words following the ">" are interpreted as the RNA id
- The following line reports the amino acid sequence
We use an identifier system to assign a cluster to a clade. The cluster will start with the identifier (Ex: CF for a Fungal cluster). Here is the complete list of correspondences between code identifiers and clades:
CF Fungi
CI Insects
CPr Protists
CM Metazoans
CPl Plants