# Sequence Alignment Module The `align` module performs multiple sequence alignment using MAFFT and trimming using TrimAl. It can work with both standalone sequences and database-integrated alignments for phylogenetic analysis. ```{image} _static/figures/fig_module_phylogenetic.png :width: 800 :class: no-scaled-link :align: center ``` ## Overview The alignment module is essential for downstream phylogenetic analysis and serves as the foundation for: - Phylogenetic tree construction - Recombination detection - Motif analysis - Comparative genomics studies ## Workflow The `align` module follows a structured workflow: 1. **Input Validation**: Validates FASTA format and sequence integrity 2. **Database Integration** (optional): Merges input sequences with reference database sequences 3. **Metadata Generation**: Creates comprehensive metadata for all sequences 4. **Multiple Sequence Alignment**: Uses MAFFT with optimized parameters for protein/nucleotide sequences 5. **Alignment Trimming**: Removes poorly aligned regions using TrimAl ## Usage ### Basic Alignment Align sequences without database integration: ```bash cressent align \ --threads 24 \ --input_fasta sequences.fasta \ -o output/alignment ``` ### Database-Integrated Alignment Align sequences with viral family database for enhanced phylogenetic context: ```bash cressent align \ --threads 24 \ --input_fasta sequences_reps.faa \ --db_family "Naryaviridae" \ --protein_type reps \ --db_path databases/ \ -o output/alignment_with_db ``` ### Custom Database Alignment Use a your custom database for alignment: ```bash cressent align \ --threads 24 \ --input_fasta sequences.fasta \ --db_family "custom" \ --custom_aa custom_sequences.faa \ -o output/custom_alignment ``` ### How do I build custom reference databases? Use the `db_builder`module: The taxonomic list is [here](https://github.com/ricrocha82/cressent/blob/main/DB/taxonomy_accession_number.csv). ```bash cressent db_builder \ -t taxonomy_file.csv \ -l Genus \ -s "YourVirusGenus" \ -o custom_database \ -e your.email@example.com ``` The final database would contain: ``` custom_database/YourVirusGenus/ ├── annotated/ │ ├── caps/ # Capsid proteins by cluster (if there sequences were found) │ └── reps/ # Replication proteins by cluster (if there sequences were found) ├── unannotated/ # Unclassified ORFs (if there sequences were found) ├── cd_hit/ # CD-HIT output ├── db_builder.log # log file ├── diamond/ # diamond output ├── mcl/ # MCL output ├── raw_aa # Family-specific raw protein sequences ``` Ensure your taxonomy file contains proper ICTV classifications and accession numbers. ## Parameters ### Required Parameters - `-i, --input_fasta`: Input FASTA file containing sequences to align - `-o, --output`: Output directory for alignment results ### Optional Parameters - `-t, --threads`: Number of CPU threads (default: 1) - `--mafft_ep`: MAFFT alignment accuracy parameter (default: 0.123) - `--gap_threshold`: TrimAl gap threshold for trimming (default: 0.2) ### Database Parameters - `--db_family`: Viral family name(s) for database selection or 'all' for complete database - `--db_path`: Path to the database directory - `--protein_type`: Specify 'reps' or 'caps' for protein-specific databases - `--custom_aa`: Path to custom amino acid database file ## Output Files The `align` module generates several important output files: ### Primary Outputs - `_aligned_sequences.fasta`: Raw MAFFT alignment - `_aligned_trimmed_sequences.fasta`: Trimmed alignment ready for phylogenetic analysis - `metadata.csv`: Comprehensive sequence metadata including family assignments ### Metadata Structure The metadata file contains the following columns: | Column | Description | |:-------|:------------| | protein_id | Unique sequence identifier | | protein_description | Full sequence description | | family | Assigned viral family | | scientific_name | Source organism name | | protein_name | Protein function/name | | source | Origin (input or database) | ## Best Practices ### Sequence Preparation 1. **Ensure sequence quality**: Remove sequences with excessive ambiguous nucleotides 2. **Check sequence orientation**: All sequences should be in the same orientation 3. **Validate functional domains**: For proteins, ensure sequences contain expected functional domains ### Parameter Optimization 1. **Thread usage**: Use available CPU cores but monitor memory usage 2. **Gap threshold**: Lower values (0.1-0.3) for conserved sequences, higher (0.4-0.6) for divergent sequences 3. **Database selection**: Use family-specific databases when available for better phylogenetic signal ### Quality Control After alignment, check: - **Alignment length**: Should retain sufficient positions for phylogenetic analysis - **Sequence coverage**: Most sequences should span the majority of the alignment - **Conserved regions**: Key functional domains should be well-aligned ## Integration with Other Modules The `align` module outputs are directly compatible with: - [build_tree](phylogenetic_analysis.md): For phylogenetic tree construction - [recombination](recombination.md): For recombination detection analysis - [motif](motif.md): For motif discovery and analysis - [plot_tree](viz.md): For tree visualization with alignment context ## Troubleshooting ### Common Issues **Memory Errors** Use fewer threads or reduce dataset size. Consider clustering sequences first. **Poor Alignment Quality** Adjust `--mafft_ep` parameter or check input sequence quality. **Database Integration Failures** Verify database path and family names. Ensure database files exist. **Empty Output** Check input file format and sequence validity. Review log files for specific errors. ### Performance Tips 1. **Large datasets**: Use sequence clustering before alignment 2. **Memory optimization**: Reduce thread count if memory is limited 3. **Speed optimization**: Use family-specific databases instead of 'all' ## Example Workflow Here's a complete example for capsid protein alignment: ```bash # Basic alignment for tree building cressent align \ --threads 24 \ --input_fasta capsid_proteins.faa \ -o analysis/caps_align # Database-integrated alignment for comprehensive phylogeny cressent align \ --threads 24 \ --input_fasta capsid_proteins.faa \ --db_family "Circoviridae" "Genomoviridae" \ --protein_type caps \ --db_path /path/to/databases \ -o analysis/caps_align_with_db # Build tree from alignment cressent build_tree \ -i analysis/caps_align_with_db/capsid_proteins_aligned_trimmed_sequences.fasta \ -o analysis/caps_tree ```