This software is a fork and extension of [Gecko3](https://bio.informatik.uni-jena.de/software/gecko3/), an open-source software for finding gene clusters. Besides improving a few things, it was extended to support structural similarity analysis by means of the local DCJ similarity. If you use this software, please cite:     Diego P. Rubert, Fábio V. Martinez, Jens Stoye and Daniel Doerr     [*Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants*](https://doi.org/10.1186/s12864-020-6609-x) ------- Authors of this extension ------- [Diego P. Rubert](https://www.facom.ufms.br/~diego) <diego at facom dot ufms dot br> Faculdade de Computação, Universidade Federal de Mato Grosso do Sul Campo Grande/MS, Brazil For authors of the original Gecko3, check its [webpage](https://bio.informatik.uni-jena.de/software/gecko3/). ------- License ------- Copyright (C) 2019 Diego Rubert This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . ---------- About Gecko3-DCJ ---------- Gecko3-DCJ (technically Gecko3-DCJ-v3.1.dcj.<version>) is an extension of Gecko3. The original Gecko3 finds [approximate gene clusters](https://doi.org/10.1093/nar/gkw843), and this extension supports structural similarity analysis by means of the [local DCJ similarity](https://doi.org/???). A few other things were also improved, for a comprehensive list check the [changelog](#changelog) bellow. Gecjo3-DCJ can be found here: [https://gitlab.ub.uni-bielefeld.de/gi/gecko-dcj](https://gitlab.ub.uni-bielefeld.de/gi/gecko-dcj) The software first discover referenced-based approximate common intervals in genomes. Each discovered set of intervals gives rise to a set of pairs of substrings between the reference and the remaining genomes for which local rearrangement scores are calculated. If using the GUI, the results can be browsed in an easy-to-use interface and analysed in detail. Whether or not using the GUI, results can be exported in a number of formats. ---------- The local DCJ similarity ---------- Let $`S`$ and $`T`$ be two substrings associated with one of these pairs of approximate common intervals. First, pairs of sequences $`S', T'`$ are identified such that (i) $`S'`$ is a subsequence of $`S`$, and $`T'`$ of $`T`$, (ii) $`S'`$ and $`T'`$ are balanced, and (iii) for each marker $`g`$ in $`\mathcal G(S')`$ holds true that $`m_{S'}(g) = \min(m_S(g), m_T(g))`$. The last constraint ensures maximality of the balanced subsequences. Sequences $`S'`$ and $`T'`$ are then subjected to a second procedure that finds one-to-one assignments between all markers of the two sequences, thus further refining them to non-duplicated balanced sequences $`S''`$ and $`T''`$. Eventually, those pairs of balanced sequences $`S''`$ and $`T''`$ are identified that maximize the following formula: ```math s_\text{\tiny DCJ}(S'', T'') = \sum_{C \in \mathcal{C}}{f(|C|)} + \frac{1}{2} \left (\sum_{O \in \mathcal{O}}{f(|O|+1)} + \sum_{E \in \mathcal{E}}{f(|E|+2)}\right) - d \cdot p\:, ``` where $`\mathcal{C}`$, $`\mathcal{O}`$ and $`\mathcal{E}`$ are the sets of cycles, odd paths, and even paths in the constructed adjacency graph of $`S''`$ and $`T''`$, $`d := |S|+|T|-(|S''|+|T''|)`$ is the number of deleted markers and $`p`$ is the deletion penalty. Function $`f : 2\mathbb{N} \rightarrow \mathbb{R}`$ scores each cycle and path proportional to its length. Because short cycles and paths are indicators of similarity, whereas long cycles and paths suggest the opposite, Gecko3-DCJ uses a simple realization of $`f`$ that works well in general: ```math f(l) = \frac{2 - l}{b - 2} + 1\:. ``` The function $`f`$ used in Gecko3-DCJ makes use of constant $`b`$, a length threshold that demarcates short from long cycles and paths, called in the software "borderline cycle length". ---------- How to use ---------- Gecko3-DCJ has both GUI and command line interfaces. By default, Gecko3-DCJ will use a maximum heap size of 6GB (-Xmx6G java option), set in the Gecko3-DCJ.bat (for Windows) and Gecko3-DCJ (for Linux/Mac) start scripts under bin folder. If you need/have more memory, modify the start script with any text editor, changing -Xmx6G to an appropriate value (e.g. -Xmx12G). For the Linux/Mac script, this means changing: ``` DEFAULT_JVM_OPTS='"-Xmx6G"' to DEFAULT_JVM_OPTS='"-Xmx12G"' ``` For the Windows script, this means changing: ``` set DEFAULT_JVM_OPTS="-Xmx6G" to set DEFAULT_JVM_OPTS="-Xmx12G" ``` Following you'll find some details on the usage of the software, based mostly on the original Gecko3 Readme.txt file. ###### Input files For this software, the basic requirement is that the genomes are given as sequences of strings where each character represents a certain family containing at least one gene. All genes in a family should be homologs performing the same (or very similar) function. Genome input files have the file extension .cog and have to be organized as follows: ``` Empty Line Empty Line ``` With <Genome><Genome Data> being: ``` GenomeName Descriptive Text Descriptive Text (ignored) ``` Where in <Genome><Genome Content> each line contains information about the family and function of single genes in the order of their occurrence in the genome in one of two different formats (if unique locus tags are available, the second format is preferred): ``` Strand (+ or -) functional category Gene Name functional annotation ``` or ``` Strand (+ or -) functional category Gene Name functional annotation Locus Tag product ``` <Homology> can be any word or number, not containing "," (comma). All genes with the same entry will be in one homology family. All genes with the empty string or 0 will be treated as un-homologue. Multiple gene families can be assigned to one gene (see Example 2, second gene), as comma separated entries. Gecko will for visualisation and computation split the one gene with multiple gene families into multiple gene with one gene family each, in the order given in the .cog file. Locus Tag should be an unique tag for each gene in the data set. Example 1 for the format without locus tag: ``` Aquifex aeolicus, complete genome - 0..1551335 1529 proteins 0480 + J fusA elongation factor EF-G 0050 + J tufA1 elongation factor EF-Tu 0051 + J rpsJ ribosomal protein S10 0459 + O mopA GroEL 0000 - - ---- putative protein 0612 - R ymxG processing protease ``` Example 2 for the format with locus tag: ``` Escherichia coli O127:H6 str. E2348/69 chromosome, complete genome. 4552 proteins 0 + ? thrL involved in threonine biosynthesis; controls the expression of the thrLABC operon E2348C_0001 unknown COG0527,COG0460 + ? thrA multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine; functions in a number of amino acid biosynthetic pathways E2348C_0002 unknown COG0083 + ? thrB catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate E2348C_0003 unknown COG0498 + ? thrC catalyzes the formation of L-threonine from O-phospho-L-homoserine E2348C_0004 unknown NOG76743 + ? yaaX hypothetical protein E2348C_0005 unknown COG3022 - ? yaaA hypothetical protein E2348C_0006 unknown ``` ###### Importing data After selecting an input file via `File -> Open session or genome file`, Gecko3-DCJ determines automatically from the file ending whether it loads a genome file (.cog) , a stored session (.gck) or a gzipped stored session (.gckz). In case a genome file is selected, it is parsed and all found chromosomes are listed in a table. Ticking the check boxes next to a chromosomes in the table, one can choose the chromosomes that should be part of the search for approximate gene clusters. Different chromosomes of one genome can be marked and grouped by clicking on the "Group" button. Gecko3-DCJ suggests a grouping of chromosomes based on chromosome names. This can be reverted by marking the grouped chromosomes and clicking on the "Ungroup" button. Genome selection is finished by clicking on the button "OK". The genomes are then visualized in a genome browser, allowing to inspect the genomes, contained genes, and gene annotations. ###### Cluster detection When clicking the "start computation" button, the user is asked to select a search mode, as well as global and model-dependent parameters before the actual search begins. In the simple "Single Distance" mode the minimum cluster size and the maximum distance have to be set. The distance threshold determines the maximum pair-wise distance between the reference set, and each approximate occurrence. The minimum size gives the minimum number of genes a gene cluster has to contain, to be reported. As an alternative in "Distance Table" mode, for each size, a maximum number of gene losses, a maximum number of gene insertions and the maximum sum of losses and insertions can be set. A right click in the table allows to add or delete rows or reset the table to some parameters we used in different publications. For both modes, the minimum number of genomes a cluster has to appear in (quorum parameter) can be set. By default, this value is set to the number of selected genomes. Then, only gene clusters with an approximate occurrence in all genomes are reported. Additionally, one can chose between three sub-modes. In the "all against all" mode, gene cluster are predicted using all input genomes one after the other as reference genome. In the "fixed genome" mode only one genome is used as reference. It can be chosen from a drop-down list containing the previously selected genomes. You can filter the list by typing in the text field. In the "manual cluster" mode, a sequence of genes can be typed in manually, or pasted when e. g. copying a cluster from the result list of a previous run of Gecko3-DCJ. Checking the option "Search Ref. in Ref.", each reference genome will also search for occurrences of the cluster in the reference genome. When using "manual cluster", a gene sequence with the specified genes is created as a new genome, therefore the orientantion of genes can also be defined in this window (genes in reverse orientation are prefixed with a minus sign `-`) Besides cluster detection options, parameters for the local DCJ similarity can also be set in this dialog (see [The local DCJ similarity](#the-local-dcj-similarity) section). Other related option can also be set. "Compute for subintervals" makes the local DCJ similarity to be computed for all subinvervals of clusters' sequences to verify if ther is a better-scoring subinterval. If this last option was checked, the "Extend beyond sequence limits" box is enabled allowing the user to define a number of positions that the subintervals can be extended to the left and to the right of the original intervals when searching for a better-scoring subinterval. If the option "Pre-compute" is checked, the local DCJ similarity is calculated for all clusters after they are found, otherwise the score is computed only when each cluster is selected in the user interface. Finally, since the computation of the local DCJ similarity requires all one-to-one assignment between genes inside each family to be computed (and the number of combinations can be huge), the option "Use heuristics" allows Gecko3-DCJ to use heuristics to fixate the most promissing one-to-one gene assotiations and lower the number of combinations to a reasonable value. After all parameters are set, computation can be started by clicking the 'OK' button. ###### Graphical evaluation After completion of computations, results are shown in tabular form below the genome browser. The table contains the list of all predicted gene clusters, listing a unique id, the number of genes, the number of included genomes, the score of the best occurrence combination (negative logarithm of p-value, and negative logarithm of FDR corrected p-value), the average local DCJ similarity for all cluster occurrences, and a list with the gene families of the genes in the reference occurrence. By default, the gene cluster list is sorted by decreasing score, but clicking on the table header will sort or inverse sort the clusters according to the selected column. A gene cluster can be selected with a double-click on the entry -- its best occurrence will then be visualized by the genome browser, and details about the cluster will be displayed in an information area. Additionally (and, if enabled by the user, also suboptimal) occurrences can be selected with navigation buttons next to the genomes. On top of the genome browser, one can choose to hide un-clustered genomes. By default a chosen cluster will be centered on the screen. To manually align a cluster, you can scroll or drag and drop each genome browser to the left or right. SHIFT + Double Click on a genome browser will invert the genome. A Double Click on a single gene will align the cluster using this gene family as an anchor. The visualization of a selected gene cluster has been optimized to allow for an easy inspection of the gene cluster -- the genome browser allows to visualize the neighborhood on each genome, mouse over tooltips provide the user with the annotation data available for genes or chromosomes, and the information area allows for a more detailed inspection of the search result. ###### Filtering and searching Under the table different filter modes for the results can be chosen. Either all gene clusters are shown ("showAll"), for all overlapping occurrences, only the best p-value scoring one is reported ("showFiltered"), for all overlapping occurrences the best p-value scoring and the best average local DCJ similarity scoring are shown ("showFiltered+"), or only selected clusters are shown. Clusters can be selected by right clicking on the table columns and choosing "Add to selection" or "Add all in list to selection". In the top right corner of the GUI, it is possible to filter for clusters containing individual genes or functional gene annotations by typing the respective information into the "Search" field above the genome browser. In front of each genome in the genome browser, one can choose between "None", "Include" and "Exclude". "None" means no additional filtering. With "Include" the Gene Cluster Table will only contain clusters, that include the genome, when choosing "Exclude", the table will only contain clusters, that do not contain the genome. By right Clicking on a gene cluster in the table, one can choose "Show similar clusters". This will automatically enter all gene families from this cluster into the search field. ###### Saving session The results of a Gecko3-DCJ session can be stored in a file with ending ".gck" or ".gckz" (compressed) via `File -> Save session`. By that you don't need to recompute all results again after closing the program. ###### Exporting clusters Results can be exported via `File -> Export results` in different data formats. 1. "clusterData" similar to the information in the GUI. 2. "clusterStatistics" general statistics about all the clusters. 3. "table" table of cluster information, as used in Jahn et al, Statistics for approximate gene clusters, BMC Bioinformatics, 2013. 4. "latexTable" same as above, only latex ready. 5. "geneNameTable" table of all gene names in the reference occ and additional info. 6. "clusterConservation" information about the gene oder and additional genes for each cluster. 7. "clusterGenomeInformation" in which genome the cluster occurs. 8. "referenceClusterTags" the locus_tags of all genes in the reference occurrence. 9. "pdf" all clusters as a single pdf picture(not all pdf viewers will be able to open this, due to the size). 10. "multiPdf" a zip file containing one pdf picture for each cluster. The "clusterData" format is the only one that includes the local DCJ similarity scores. If you right click on a single cluster in the table, you can select "Export gene cluster" to export this single cluster as a picture (.pdf, .jpg or .png). ###### More details For more details, check the Readme.txt file (under src/dist in the source code) or the original [Gecko3 webpage](https://bio.informatik.uni-jena.de/software/gecko3/). ---------- Known bugs ---------- - When opening big genomes in Linux systems, the genome browser window presents a very annoying bug. The gene labels overlap, because all labels after some point are re-written shifted to the left multiple times (e.g. the gene labels after position 10000 of the genome are also written in sequence starting from indexes -2000, 3000 and 7000, overlapping with the correct labels). I've tried to my best to fix this, but as far as I could trace this bug there is no logical explanation to it. My guess is that this is a bug in the graphical interface middleware or implementation of the Java virtual machine for Linux. The bug doesn't happen in Windows or Mac systems and probably the authors of the original Gecko3 were not aware of this. Finding a workaround for this bug would be easiear with deep Java GUI knowledge, which I don' have, but I hope to find some time to look again to this bug in a not so near future.