In Your Own Words What Is the Application of Msa
DNA Restriction and Modification: Type III Enzymes
D.N. Rao , S. Bheemanaik , in Encyclopedia of Biological Chemistry (Second Edition), 2013
Domain Organization in Restriction Subunit
Multiple sequence alignment of all known and putative Res subunits suggests a modular structure ( Figure 1(b) ). The C-terminus contains the PD(x) n …(D/E)XK endonuclease motif that is commonly present in the catalytic center of restriction endonucleases. Sequence analysis of the Res subunit of EcoP1I and several putative Res subunits revealed the so-called DEAD box motif that is present in the helicase superfamily II. The members of the DEAD family of helicases have seven conserved motifs (motifs I, IA, and II–VI). The first motif of this family resembles the Walker A domain commonly present in ATPases. Mutational analysis of motif I resulted in a loss of DNA cleavage and ATP hydrolysis, while that of motif II significantly decreased ATP hydrolysis but had no effect on DNA cleavage. These motifs must, therefore, clearly play a role in ATP hydrolysis. Mutations in motif VI abolished both the DNA cleavage and ATPase activities, while mutations in putative endonuclease motif abolished DNA cleavage, but not ATP hydrolysis.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123786302002437
Algorithms for Structure Comparison and Analysis: Homology Modelling of Proteins
Marco Wiltgen , in Encyclopedia of Bioinformatics and Computational Biology, 2019
Multiple alignments
A multiple sequence alignment is the alignment of three or more amino acid (or nucleic acid) sequences ( Wallace et al., 2005; Notredame, 2007). Multiple sequence alignments provide more information than pairwise alignments since they show conserved regions within a protein family which are of structural and functional importance.
Fig. 7 shows the target sequence arranged in a multiple alignment with the template OMP-decarboxylases from V. cholerae, L. acidophilus, and C burnetii. The alignment was made with the MULTALIN multiple alignment tool (Corpet, 1988). The sequence alignment is used to determine the equivalent residues in the target and the template proteins.
The corresponding superposition of the template structures is shown in Fig. 8. After a successful alignment has been found, the actual model building can start.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128096338204846
Protocol for Protein Structure Modelling
Amara Jabeen , ... Shoba Ranganathan , in Encyclopedia of Bioinformatics and Computational Biology, 2019
Template Selection
If only one template is identified, then model should be built with this template, otherwise appropriate template(s) may be selected from the pools of templates. For selection of templates a number of factors need to be considered, which include:
Sequence similarity
Multiple sequence alignment of all the domains of potential templates and target sequence can be performed and utilized in building a phylogenetic tree which can then depict the most similar structure that can be used for model building.
Similarity in secondary structure
A template with similar secondary structure as the predicted secondary structure of the target can be prioritised.
Bounded ligands
If the ultimate goal of structure prediction is receptor and ligand interactions studies, then a template that has the same binding ligand (if any) as of the protein of interest should be preferred.
Resolution of template structure
High resolution XRC structures are preferred over low resolution structures. Similarly, XRC structures are preferred over NMR. But if only NMR structures are available, the "average" NMR structure must be chosen for model building as NMR protein structures are often submitted in PDB as ensembles of models.
Presence of missing residues
Missing residues (if any) information is present in REMARK 465 in the header section of a PDB file. Missing residues information can also be seen by using most PDB viewer. If missing residues are present, then an alternate template should be selected, otherwise more than one template should be used to fill in the gaps. If the ultimate goal of structure prediction is to study protein interactions, then no missing residues should be present in the binding sites of the template.
Multiple templates
Optimal use of multiple templates can increase the accuracy of the developed model. If every template is providing unique information and there are minimal overlapping residues between templates, they can be selected for building a model. Superimposing Cα atoms can reveal the unique information each template is contributing. Other features to look at are insertions, variation in the length of secondary structure elements and different conformational loops. Few of the programs can build model by using multiple templates like Modeller, IntFOLD and M4T (Fiser, 2010). If sequence identity falls below 40%, multiple template usage is favorable but above this identity a single template will usually suffice (Fiser, 2010).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128096338204779
Microbial Globins - Status and Opportunities
Serge N. Vinogradov , ... David Hoogewijs , in Advances in Microbial Physiology, 2013
5.2 Multiple sequence alignments and phylogenetic analysis
Multiple sequence alignments were carried out using MUSCLE ( Edgar, 2004), Clustal Omega (Sievers et al., 2011), MAFFT employing the L-INS-i option (Katoh, Kuma, Toh, & Miyata, 2005), ProbCons (Do, Mahabhashyam, Brudno, & Batzoglou, 2005) and T-Coffee (Di Tommaso et al., 2011). To eliminate ambiguous alignments, we used the online version of Gblocks 0.91b (Castresana, 2000) with the 'less stringent selection' parameter set (www.phylogeny.fr). The quality of the alignments was assessed by MUMSA (Lassmann & Sonnhammer, 2005). Maximum likelihood (ML) analyses were carried out using RA × ML (Stamatakiis et al., 2008). Neighbour-joining (NJ) analyses were performed using MEGA version 5.05 (Tamura et al., 2011). Distances were corrected for superimposed events using the Poisson method. All positions containing alignment gaps and missing data were eliminated only in pairwise sequence comparisons (pairwise deletion option). The reliability of the branching pattern was tested by bootstrap analysis with 1000 replications. Bayesian inference trees were obtained employing MrBayes version 3.1.2 (Ronquist & Huelsenbeck, 2003), assuming the WAG model of amino acid substitution and a gamma distribution of evolutionary rates, as determined by the substitution model testing option in MEGA 5.05. Two parallel runs, each consisting of four chains, were run simultaneously for at least 8 × 106 generations and trees were sampled every 1000 generations generating a total of at least 8000 trees. The final average standard deviations of split frequencies were stationary in all analyses and posterior probabilities were estimated on the final 60–80% trees. The CIPRES web portal was used for the Bayesian analyses (Miller, Pfeiffer, & Schwartz, 2010) and MEGA version 5.05 was used to visualize radial trees. With the exception of Fig. 9.3, we employed as outgroups, two non-haem, globin-like stress response regulators RsbR from Bacillus subtilis and B. amyloliquofaciens (NP_388348.1 and YP_00391940.1). Although they have a globin-like secondary structure, their G and H helices are bent inwards, eliminating the haem-binding cavity (Murray, Delumeau, & Lewis, 2005). Phylogenetic trees were also constructed employing SSU rRNA sequences (Guillou et al., 2013). In compiling the lists of sequences, we identified each sequence by the first three letters of the two portions of the binomial, the number of residues, one or more three-letter abbreviations of the taxon, followed by the identifier. Subcellular localization was identified using PSORT II (Nakai & Horton, 1999).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124076938000091
Bioinformatics: Concepts, Methods, and Data
Scott W. Robinson , ... David P. Leader , in Handbook of Pharmacogenomics and Stratified Medicine, 2014
13.3.1 Tree Terminology
Multiple sequence alignments, as explained in Section 13.2.4, help identify homology and reconstruct evolutionary history. Alternatively, it can be said that variation between sequences is used to infer phylogeny. We depict phylogenetic relationships using various types of trees (Figure 13.3). Evolutionary trees are graphs consisting of nodes and branches, which in mathematics are often referred to as vertices and edges. Trees may be either rooted or unrooted, where the root is a node that depicts an ancestor common to all of the other nodes in the graph. Intuitively, then, rooted trees depict directionality in evolution, whereas unrooted trees merely show evolutionary distances. Rooted trees are commonly drawn with the root on the left and the terminal nodes (or tips)—for which we have sequence data—on the right. In between the root and the terminal nodes, there may be internal nodes that depict hypothetical genes or organisms—for which we have no experimental data.
The lengths of the lines that constitute the branches can mean different things depending on the type of tree. Cladograms (Figure 13.3(a)) show only the orders of branching; the lengths of the branches are meaningless. In phylograms (as shown in Figure 13.3(b) and also known as additive trees), the horizontal branch lengths correspond to evolutionary changes whereas the vertical distances are meaningless and only used for separation and clarity. Ultrametric trees (also known as dendrograms) are similar to phylograms, but the root must be equidistant to each terminal node. In this case, the distance represents either actual time or "time" on a molecular clock. The concept of a molecular clock is that sequences evolve at an approximately constant rate, and so the genetic difference between two species or genes is proportional to the amount of time since the speciation or gene duplication event. It should be noted, however, that there is no one universal molecular clock—differences between genomes of different species, and throughout different genomic regions of the same species, lead to different rates of mutation/evolution.
It is often useful to describe groups of nodes and how they relate to one another. Clades are groups of nodes united by ancestry, in which every descendant of every node is present; in other words, a clade is an entire branch. This must not be confused with grades, which are nodes grouped by characteristics such as morphology. Clades are also described as monophyletic groups. Two concepts that contrast with monophyly are paraphyly and polyphyly (Figure 13.3(b)). Paraphyletic groups are made up of consecutive nodes that share common ancestry but in which not all descendents of all nodes are present. Polyphyletic groups do not include a common ancestor of all of its members. An alternative to cladistic classification is phenetic classification, which is based on overall sequence similarity rather than ancestry. Phenetic classification particularly causes different classifications when different rates of evolution occur in different branches of the tree.
Tree diagrams represent genes, species, or both. They therefore can describe paralogy, orthology, or the two of them. In other words, depending on the type of information in the diagram, a branching can represent either a gene duplication event or a speciation event. Trees are similar to hanging mobiles (as in kinetic art) in that the branches can be rotated without altering the meaning of the tree or the relationships between terminal nodes. The specific order in which the terminal nodes are represented is merely stylistic and not inherently meaningful.
Trees allow us to distinguish between ancestral and "derived" (newly and separately occurring) character states or sequence variations. A plesiomorphy is an ancestral character state, whereas an apomorphy is a derived state. If an apomorphy occurs in one sequence only, it is described as autapomorphic, whereas if it occurs in several sequences it is described as synapomorphic. Homology describes two identical character states in multiple terminal nodes inherited from a common ancestor. In contrast, homoplasy describes two identical character states in multiple terminal nodes that have occurred independently.
The degree of a node is the number of branches that are connected to it. A tree is said to be fully resolved if the root has only two children and none of the internal nodes has a degree of more than three—that is, one parent node (ancestor) and two child nodes (descendants). A node having a greater degree than this is described as a polytomy (Figure 13.3(a)). If a tree has no polytomies, it is said to be fully resolved, whereas a tree completely unresolved is said to be a "star tree." Polytomies are described as "hard" if they indicate events of simultaneous divergence for all descendants involved. They may be described as "soft" if they represent an uncertainty in the order of divergence. Other related terms here are bifurcation (divergence into two branches) and trifurcation (divergence into three sequences).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978012386882400013X
Future Perspectives: High-Performance Computing
Julie Dawn Thompson , in Statistics for Bioinformatics, 2016
8.3 MSA in the cloud
Another possible solution to the infrastructure challenge comes in the form of "cloud computing", a model where computation and storage exist as virtual resources, accessed via the Internet, which can be dynamically allocated and released as needed. Where previously acquisition of large amounts of computing power required significant initial and ongoing costs, the cloud model radically alters this by allowing computing resources and services to be acquired and paid for on demand. Importantly, cloud resources can provide storage and computation at far less cost than dedicated resources for certain use cases. Cloud resources have become quite popular in the form of public clouds (e.g. Amazon Web Services [AWS], HP Cloud, Google Compute Engine) where we pay only for the resources consumed. Over the past few years, there has been an increasing trend toward cloud resources also becoming available as research infrastructures, for example the Open Science Data Cloud (www.opensciencedatacloud.org) or the EGI Federated Cloud (www.egi.eu/infrastructure/cloud).
For multiple sequence alignment construction, the effective use of cloud resources has been demonstrated by the porting of the T-Coffee tool onto the cloud [DI 10]. In general, utilization of bioinformatics tools on such delocalized systems requires technical expertise to achieve robust operation and intended performance gains. In trying to address this, significant work has been undertaken to develop workflow environments that attempt to alleviate the need for scientists to write their own scripts or programs. There are several engines that give users the ability to design and execute workflows, including multiple sequence alignment among many other applications. Each engine was created to address certain problems of a specific community, therefore each one has its advantages and shortcomings. Some of the environments that integrate aligners in the Cloud include the following:
- –
-
Galaxy Cloud [AFG 11] allows a user to run a private installation of the Galaxy framework in the cloud. Galaxy (galaxyproject.org) is a popular open-source platform designed to make complex analyses available to researchers using nothing more than a web browser. It allows the construction of complex workflows, and allows the results to be documented, shared and published, guaranteeing transparency and reproducibility. An active community of developers ensures that the latest tools (mainly for NGS data analysis) are wrapped and made available through the Galaxy Tool Shed. Galaxy Cloud exactly replicates the functionality of the main Galaxy site in the cloud. Currently, Galaxy Cloud is deployed on the AWS cloud, although it should be compatible with other clouds. Galaxy Cloud's deployment is achieved by coupling the Galaxy framework to CloudMan [AFG 12], which automates management of the underlying infrastructure cloud resources, including resource acquisition, configuration and data persistence.
- –
-
BioNode [PRI 12] allows a bioinformatics workflow to be modeled and executed in virtual machines (VMs) in different cloud environments. BioNode is based on Debian Linux and can be deployed on several operating systems (Windows, OSX, Linux), architectures as well as in the cloud. Approximately 200 bioinformatics programs mostly related to evolutionary analyses are included. Examples of representative software implemented in BioNode are Muscle and MAFFT for multiple sequence alignment or PAML and MrBayes for phylogenetic tree construction. In addition, BioNode configuration allows scripts to parallelize these bioinformatics tools.
- –
-
Cloud BioLinux [KRA 12] is a publicly accessible VM that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have access to a range of preconfigured command line and graphical software applications, including over >100 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing and phylogeny. The VM is deployed on the Amazon EC2 cloud, but it is also compatible with other clouds.
- –
-
Tavaxy [ABO 13] is a system for modeling and executing bioinformatics workflows based on the integration of the Taverna (www.taverna.org.uk) and Galaxy (galaxyproject.org) workflow environments. Tavaxy supports execution in a sequential environment or on HPC infrastructures and cloud computing systems. It offers a set of features that simplify the development of sequence analysis applications, covering several areas of bioinformatics such as NGS data analysis, metagenomics, proteomics or comparative genomics. Tavaxy can be downloaded or directly used as a service in clouds.
- –
-
Yabi [HUN 12] provides a workflow environment that can create and reuse workflows as well as manage large amounts of raw and processed data in a secure and flexible way across geographically distributed HPC resources. It includes a frontend web application that provides the main user interface; middleware that is responsible for process management, tool configuration, analysis audit trails and user management; and a resource manager that provides data and compute services, including a list of bioinformatics tools running in various execution environments. In this way, Yabi gives researchers access to HPC power without requiring specialized computing knowledge.
The use of cloud computing for biological sequence analysis is only going to increase, but as the datasets grow in size, the resources underpinning the analysis environment must be low cost, scalable and easily accessible for the whole community. Ideally, minimal preparations and resources should be necessary before obtaining access to an analysis environment that is genuinely useful. Nevertheless, to provide flexibility and performance, the user must have some control of the data resources and software tools via a lightweight programming or graphical interface. High-speed transfer technologies are also critical for moving large amounts of data in and out of the cloud. Although larger organizations with enough computational expertise may prefer to develop and maintain their own in-house systems, cloud computing will save resources for smaller groups, allowing them to concentrate on the main task, namely the biological interpretation of the analysis results.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9781785482168500105
Synthetic Biology and Metabolic Engineering in Plants and Microbes Part B: Metabolism in Plants
P. Fan , ... R.L. Last , in Methods in Enzymology, 2016
3.3.2 Protocol Used to Identify Key Polymorphisms Contributing to Variation in ASAT3 Acyl-CoA Substrate Specificity
- 1.
-
Multiple sequence alignment of ASAT3 protein variants is done with MEGA version 5 ( Tamura et al., 2011), using the default settings. Next, substrate specificities of the enzymes are aligned with the amino acid sequences (Fig. 5A). The amino acid residues that correlate with the ASAT3 enzyme activities are candidates for those that contribute to ASAT3 activity changes (see green typeface in Fig. 5A).
- 2.
-
Structural homology modeling of Sl-ASAT3 was performed using a web-based protein homology/analogy recognition engine (http://www.sbg.bio.ic.ac.uk/phyre2) (Kelley, Mezulis, Yates, Wass, & Sternberg, 2015; Kelley & Sternberg, 2009). The predicted 3D structure was overlayed with the crystal structure of the BAHD acyltransferase trichothecene 3-O-acetyltransferase with its acyl donor ligands (Protein Data Bank ID: 3B2S) (Fig. 5B). The relative distance of the candidate residues to the putative acyl-CoA binding pocket was determined using PyMOL (version 1.7.4 Schrödinger, https://www.pymol.org). The working hypothesis is that residues close to the substrate-binding pocket are more likely to affect the enzyme activity and are nominated for experimental testing.
- 3.
-
To test the importance of specific residues, amino acid substitutions are made using the PCR-based Q5-Site-Directed Mutagenesis Kit (NEB) (Fisher & Pei, 1997). The primers to introduce the desired mutation are designed using the web-based software NEBaseChanger (version 1.2.2, NEB, http://nebasechanger.neb.com). The presence of a mutation is verified by DNA sequencing. Protein expression of the Sl-ASAT3 mutants, ASAT3 enzyme assays, and LC/MS analysis are performed as described earlier (Sections 2.1.2 and 3.2.1).
- 4.
-
Wild-type and mutagenized Sl-ASAT3 enzyme kinetics are determined for the various acyl-CoA substrates. To measure the apparent K m and K i of nC12-CoA, purified S2:10 (5,5) is used as the acyl acceptor substrate, and the concentration of nC12-CoA is varied from 0 to 200 μM. All enzyme reactions are done in triplicate at 30°C for 5 min and stopped using the enzyme stop solution described in Section 3.2.1. After LC/MS analysis, the enzyme assay product peak areas divided by the internal standard peak area (normalized peak response) are plotted for each concentration of nC12-CoA. Apparent K m and K i are calculated by the nonregression model in the GraphPad Prism 5 software (http://www.graphpad.com/scientific-software/prism/).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S0076687916001075
Methods in Protein Design
Venuka Durani , Thomas J. Magliery , in Methods in Enzymology, 2013
5.1 Protocol for calculating mutual information values
- Step 1.
-
MSA and frequency table: The MSA and amino acid frequency table can be copied from the RE calculation workbook (worksheets "seq" and "RE" from Section 4.1).
- Step 2.
-
Reference distribution: The reference distribution is calculated using the formula B28 = $C3*VLOOKUP(B$27,$B$3:$D$22,3,FALSE) copied over all the cells in table B28:U47. Range C3:C22 has the amino acid frequencies for position 1, and range D3:D22 contains amino acid frequencies for position 2. One-letter codes for amino acids are listed in B3:B22, B27:U27, and A28:A47.
- Step 3.
-
Observed distribution: The observed distribution is calculated using the formula B52 = SUM(IF(OFFSET(pos1,0,$C$1-4)=$A52,1,0)*IF(OFFSET (pos1,0,$D$1-4)=B$51,1,0))/$V$51 followed by Ctrl+Shift+Return and copied over all the cells of table B52:U71. Cell C1 contains the position number of one position, and D1 contains position number of the other position in question. Cell V51 contains the frequency of cooccurrence of amino acids in the two positions and is calculated using the formula V51= SUM((IF(OFFSET(pos1,0,$C$1-4)<>"-",1,0)*IF(OFFSET (pos1,0,$D$1-4)<>"-",1,0))) followed by Ctrl+Shift+Return.
- Step 4.
-
MI calculation: MI is calculated using the formula B75 = IF(B52=0,0,B52*LN(B52/(B28)) copied over all the cells in table B75:U94 and then summed over the whole table. In this worksheet, range B52:U71 is the table of observed frequencies of cooccurrence (from step 3) and range B28:U47 is the table of expected frequencies of cooccurrence assuming mutual independence (from step 2).
- Step 5.
-
Repeating the calculation for all pairs of positions: While calculating correlation values, the same set of calculations need to be repeated for each pair of positions. Once the excel spreadsheet is set up to calculate one iteration of the calculation, the following macro is used to iteratively calculate these values and tabulate them. Since MI is symmetric, calculating values on one side of the diagonal is sufficient.
Sub MI()
′
′This script calculates MI values for each pair and tabulates them
′There are 53 positions in this alignment.
′B96 is the cell that contains MI value calculated in each iteration
′Row 100 and column A have position numbers
′The MI values are tabulated in B101:BB153
′
Application.ScreenUpdating = False
For ColumnCounter = 2 To 54
For RowCounter = 100 + ColumnCounter To 153
Worksheets("MI").Activate
Range("C1") = Cells(RowCounter, 1)
Range("D1") = Cells(100, ColumnCounter)
Calculate
Cells(RowCounter, ColumnCounter) = Range("B96")
Next RowCounter
Next ColumnCounter
Application.ScreenUpdating = True
ActiveWorkbook.Save
End Sub
- Step 6.
-
Converting a table into a list: When correlation values are calculated, the output format is in a matrix or table form. In order to sort the values, it is more convenient to format them into a list. This Excel macro converts a 53 × 53 MI table into a list. This macro takes the values from only half of the matrix (below the diagonal). The matrix is located in a worksheet titled "matrix" and starts from cell A1. The first row and first column contain position numbers. Another blank worksheet called "list" needs to be created before the macro is run. The RowCounter and ColumnCounter values can be edited if the table in question is of a different size.
Sub matrix_list()
′
′This script converts a 53×53 table in "matrix" worksheet
′into a list in "list" worksheet
′In the "matrix" worksheet, Row 1 and column A have data labels
′
Application.ScreenUpdating = False
Dim ColumnCounter As Integer
Dim RowCounter As Integer
Dim MyCounter As Long
MyCounter = 2
For RowCounter = 2 To 54
For ColumnCounter = (RowCounter + 1) To 54
Worksheets("list").Cells(MyCounter, 1).Value = Worksheets("matrix").Cells(RowCounter, 1).Value
Worksheets("list").Cells(MyCounter, 2).Value = Worksheets("matrix").Cells(1, ColumnCounter).Value
Worksheets("list").Cells(MyCounter, 3).Value = Worksheets("matrix").Cells(RowCounter, ColumnCounter).Value
MyCounter = MyCounter + 1
Next ColumnCounter
Next RowCounter
Application.ScreenUpdating = True
ActiveWorkbook.Save
End Sub
- Step 7.
-
Calculating noise level: In order to calculate noise level for a correlation calculation, a randomized MSA is created where each column is scrambled. This keeps the consensus information the same while scrambling the correlations. In order to randomize the MSA, the RAND, RANK, and INDEX functions of Microsoft Excel are used. For the BPTI database, the worksheet containing the original MSA (table C2:BC881) was named "seq" and a table of the same size was created in another worksheet named "rand" where each cell of the table was = RAND() and hence contained a random number between 0 and 1. In a third worksheet named "scramble," another table of equal size was created where the cell C2 was = INDEX(seq!C$2:C$881,RANK (rand!C2,rand!C$2:C$881)). This formula was copied over the whole table C2:BC881. Every time the worksheet was refreshed/recalculated (F9 on manual calculation mode), a new column-randomized MSA was generated. MI calculation was carried out for the randomized dataset as per the procedure described in steps 1–5, and the maximum MI value obtained from the column-randomized MSA was accepted as the noise level.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123942920000114
New Approaches to Prokaryotic Systematics
Radhey S. Gupta , in Methods in Microbiology, 2014
4.1 Creation of multiple sequence alignments
The creation of multiple sequence alignments (MSA) for protein homologues is the first step in the identification of CSIs. These alignments should contain representative organisms from the group of interest as well as a number of outgroup species. Although, one would think that it would be useful to include sequences for as many species as possible in these initial sequence alignments, from a practical standpoint, this will be very time consuming, and it could also lead to difficulty in identifying many useful CSIs. Sequence alignments from diverse organisms often contain more than one CSI within the same region, which are of different lengths and show different species specificities. The presence of these multiple CSIs, in the same region, can make it difficult to identify CSIs that are specific for the group of interest. Additionally, the inclusion of homologues from distantly related taxa in the alignments will reduce the overall sequence conservation, which can also adversely affect identification of some CSIs. Due to these considerations, the initial alignments for identification of CSIs that are created generally contain sequences for about 15–25 species, including those from the outgroup taxa.
The selection of taxa whose sequences are included in the initial alignments is based upon the overall objective of the project. For example, if one is interested in identifying CSIs that are specific for a group that contains only a limited number of species (e.g. phylum Thermotogae or Aquificae), the initial MSAs could include information for most or all of the sequenced species from these groups. The outgroup species in these cases should include at least two to three species each from two or more phyla. On the other hand, if one is interested in identifying CSIs that are specific for a larger taxonomic group such as Gammaproteobacteria or Actinobacteria then, due to the large number of sequences that are available from these taxa at multiple phylogenetic levels, it is difficult to include all of the sequences in the MSAs. The task of identifying CSIs for these large groups involves the creation of larger MSAs, which should include representatives from different classes and orders covering the phylogenetic diversity of these groups; in addition, these MSAs should also contain representatives from a number of phyla of bacteria. The identification of CSIs for these larger taxonomic groups was much easier, when sequence information was limited (Gao & Gupta, 2005; Gupta, 1998, 2000, 2005); however, with the large increase in the number of genome sequences that are now available for these groups, the task of identifying CSIs for these larger groups has become more difficult. For Gammaproteobacteria some CSIs have been identified at the class level, as well as many others that are specific for some of the orders (viz. Pasteurellales, Xanthomonadales and Enterobacteriales), and distinct subgroups within them (Cutino-Jimenez et al., 2010; Gao, Mohan, & Gupta, 2009; Gupta, 2000; Naushad & Gupta, 2012, 2013; Naushad, Lee, & Gupta, 2014).
For creation of MSAs, the genome sequences for one or two species from the group of interest are chosen. Blastp searches are performed on most proteins (ORFs) from these genomes against the NCBI non-redundant (nr) database. Protein sequences which are < 75 aa long are often omitted from Blast searches, as very few CSIs have been detected in them. For example, in our work on identification of CSIs that are specific for the phylum Aquificae, Blastp searches were performed on different proteins from the genome of Aquifex aeolicus (Deckert et al., 1998). Based on these searches, 10–20 high-scoring homologues (preferably with E value < 1e− 20) for different proteins are retrieved from different Aquificae species as well as a limited number (6–8) of the outgroup species in the FASTA format. It is not necessary to have all the sequenced Aquificae species in every alignments and the outgroup species can also vary. For proteins, whose homologues are not found in other species, or which are present in only a limited number of species (generally < 6), MSAs are not created.
The MSAs of the protein homologues are created using the Clustal_X 2.1 program (Larkin et al., 2007). However, other programs such as Mega or MUSCLE can also be used for this purpose (Chun & Hong, 2010; Edgar, 2004; Kumar, Nei, Dudley, & Tamura, 2008). In the sequence alignments created using the Clustal_X 2.1 based on FASTA files obtained from downloaded sequences (see Figure 2A ), the alignment files show only the GI (Genbank identification) number and the accession numbers of different homologues (see Figure 2B). No information is displayed for the species names, which is essential for determining the taxa specificity of any CSI that may be present in an alignment. To rectify this problem, the sequences in the FASTA file should be edited so that information for the species name appears first on the information line. The edited species names also should not contain any space between the genus and the species names, otherwise only the genera names will be displayed in the created alignments. This can be problematic if multiple species from the same genus are present in an MSA. The names can also be abbreviated at this stage, if necessary. We carry out processing of sequence names prior to creation of MSA using a program 'SEQ_RENAME' that we have developed for this purpose (see Table 1; available from the Gupta Lab Evolutionary Analysis Software (Gleans) Webpage, www.gleans.net). The input and output files for this program for a representative set of sequences are shown in Figure 2A and C, respectively. If two or more sequences in the output file have the same names, they should be edited at this stage. The Clustal_X alignment for the output file generated by the 'SEQ_RENAME' program is shown in Figure 2D.
Table 1. Descriptions of the Software Programs Used for the Creation of Signature Files
Name of the Program | Program Description | Input/Output | Availability |
---|---|---|---|
SIG_RENAME | Renaming of the FASTA sequences so that the species names can be recognised in an alignment file | See Figure 2A and C | These programs and the instructions for their usage are available from the Gupta Lab Evolutionary Analysis Software (Gleans) Webpage, www.gleans.net |
SIG_CREATE | This program extracts information from the Blast output for the species name, accession numbers and sequences of the proteins from target species | See Figure 4A and B | |
SIG_STYLE | This program converts all amino acid identical to that on the top line to dashes (−) and all sequence gaps into blank spaces for easier visualisation of the sequence conservation | See Figure 4B and C and text |
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S058095171400004X
Phylogeny of Tec Family Kinases: Identification of a Premetazoan Origin of Btk, Bmx, Itk, Tec, Txk, and the Btk Regulator SH3BP5
Csaba Ortutay , ... C.I. Edvard Smith , in Advances in Genetics, 2008
C Phylogenetic analysis
Based on the multiple sequence alignment, a bootstrap analysis was performed using maximum parsimony as criteria for searching the optimal tree ( Fig. 3.1). The six protein groups are clearly separated on the tree, and we can draw the phylogeny of these proteins as follows. The ancestor of all the TFKs was present in early eukaryotes prior to the formation of metazoans. The sequences from S. domuncula and M. brevicollis are orthologs of the ancestor. After the divergence of deuterostomia and protostomia the descendants of the ancestor further diverged. In protostomia, now insects, TFKs developed in the form of the Btk29A protein group.
In deuterostomia a descendant of the single gene became the ancestor for the five chordata-specific protein groups. The TFK in E. brugeri is a direct ascendant to that. After the formation of craniata, but before the formation of vertebrata, the ancestor went through multiple duplications. First, it was divided into the Btk/Bmx and Tec/Txk/Itk groups. Then both groups duplicated until all the five protein groups appeared. These events took place before the emergence of vertebrates. The lack of sequences in fishes and some other genomes is likely due to deletion events rather than duplications after the emergence of the vertebrates, since all the sequences within the groups are more similar to each other than to any of the fish or frog sequences.
Another view on evolution is based upon the analysis of genomic organization. Recently the amphioxus, Branchiostoma floridae, genome was published (Putnam et al., 2008). The authors partly reconstructed the genomic organization of the last common chordate ancestor and described two genome-wide duplications and subsequent reorganizations in the vertebrate lineage. Interestingly, number 8 of the 17 reconstructed ancestral chordate linkage groups contains regions corresponding to the location of all TFKs in the human genome.
Since we identified only a single frog TFK, the genomic region was analyzed. According to Xenbase (http://www.xenbase.org/), Cyfip2 and Med7 genes are both located in the vicinity of the Tec gene. This is surprising since both these genes are in close proximity to the Itk gene in zebrafish, mouse and man. The Itk and Tec genes in these three species are on different chromosomes. The Txk gene, which is absent from the zebrafish, is in very close proximity to the Tec gene in humans and mice. However, there is no doubt from the sequence alignment that the frog TFK should be classified as Tec. It is possible that a recombination event in an ancestor have transferred the Tec gene from its original location into the position of an Itk homolog, which was simultaneously lost. An alternative explanation is that the gene has evolved so that it is currently more closer to Tec than to Itk. The X. laevis Tec has similarity to Tec from other species throughout the sequence, suggesting that if a recombination occurred, the entire gene was replaced.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S0065266008008031
In Your Own Words What Is the Application of Msa
Source: https://www.sciencedirect.com/topics/medicine-and-dentistry/multiple-sequence-alignment