Sequence Alignment Module

The align module performs multiple sequence alignment using MAFFT and trimming using TrimAl. It can work with both standalone sequences and database-integrated alignments for phylogenetic analysis.

_images/fig_module_phylogenetic.png

Overview

The alignment module is essential for downstream phylogenetic analysis and serves as the foundation for:

  • Phylogenetic tree construction

  • Recombination detection

  • Motif analysis

  • Comparative genomics studies

Workflow

The align module follows a structured workflow:

  1. Input Validation: Validates FASTA format and sequence integrity

  2. Database Integration (optional): Merges input sequences with reference database sequences

  3. Metadata Generation: Creates comprehensive metadata for all sequences

  4. Multiple Sequence Alignment: Uses MAFFT with optimized parameters for protein/nucleotide sequences

  5. Alignment Trimming: Removes poorly aligned regions using TrimAl

Usage

Basic Alignment

Align sequences without database integration:

cressent align \
    --threads 24 \
    --input_fasta sequences.fasta \
    -o output/alignment

Database-Integrated Alignment

Align sequences with viral family database for enhanced phylogenetic context:

cressent align \
    --threads 24 \
    --input_fasta sequences_reps.faa \
    --db_family "Naryaviridae" \
    --protein_type reps \
    --db_path databases/ \
    -o output/alignment_with_db

Custom Database Alignment

Use a your custom database for alignment:

cressent align \
    --threads 24 \
    --input_fasta sequences.fasta \
    --db_family "custom" \
    --custom_aa custom_sequences.faa \
    -o output/custom_alignment

How do I build custom reference databases?

Use the db_buildermodule:

The taxonomic list is here.

cressent db_builder \
    -t taxonomy_file.csv \
    -l Genus \
    -s "YourVirusGenus" \
    -o custom_database \
    -e your.email@example.com

The final database would contain:

custom_database/YourVirusGenus/
├── annotated/
│   ├── caps/           # Capsid proteins by cluster (if there sequences were found)
│   └── reps/           # Replication proteins by cluster (if there sequences were found)
├── unannotated/        # Unclassified ORFs (if there sequences were found)
├── cd_hit/             # CD-HIT output
├── db_builder.log      # log file 
├── diamond/            # diamond output
├── mcl/                # MCL output
├── raw_aa              # Family-specific raw protein sequences

Ensure your taxonomy file contains proper ICTV classifications and accession numbers.

Parameters

Required Parameters

  • -i, --input_fasta: Input FASTA file containing sequences to align

  • -o, --output: Output directory for alignment results

Optional Parameters

  • -t, --threads: Number of CPU threads (default: 1)

  • --mafft_ep: MAFFT alignment accuracy parameter (default: 0.123)

  • --gap_threshold: TrimAl gap threshold for trimming (default: 0.2)

Database Parameters

  • --db_family: Viral family name(s) for database selection or ‘all’ for complete database

  • --db_path: Path to the database directory

  • --protein_type: Specify ‘reps’ or ‘caps’ for protein-specific databases

  • --custom_aa: Path to custom amino acid database file

Output Files

The align module generates several important output files:

Primary Outputs

  • <prefix>_aligned_sequences.fasta: Raw MAFFT alignment

  • <prefix>_aligned_trimmed_sequences.fasta: Trimmed alignment ready for phylogenetic analysis

  • metadata.csv: Comprehensive sequence metadata including family assignments

Metadata Structure

The metadata file contains the following columns:

Column

Description

protein_id

Unique sequence identifier

protein_description

Full sequence description

family

Assigned viral family

scientific_name

Source organism name

protein_name

Protein function/name

source

Origin (input or database)

Best Practices

Sequence Preparation

  1. Ensure sequence quality: Remove sequences with excessive ambiguous nucleotides

  2. Check sequence orientation: All sequences should be in the same orientation

  3. Validate functional domains: For proteins, ensure sequences contain expected functional domains

Parameter Optimization

  1. Thread usage: Use available CPU cores but monitor memory usage

  2. Gap threshold: Lower values (0.1-0.3) for conserved sequences, higher (0.4-0.6) for divergent sequences

  3. Database selection: Use family-specific databases when available for better phylogenetic signal

Quality Control

After alignment, check:

  • Alignment length: Should retain sufficient positions for phylogenetic analysis

  • Sequence coverage: Most sequences should span the majority of the alignment

  • Conserved regions: Key functional domains should be well-aligned

Integration with Other Modules

The align module outputs are directly compatible with:

  • build_tree: For phylogenetic tree construction

  • recombination: For recombination detection analysis

  • motif: For motif discovery and analysis

  • plot_tree: For tree visualization with alignment context

Troubleshooting

Common Issues

Memory Errors Use fewer threads or reduce dataset size. Consider clustering sequences first.

Poor Alignment Quality Adjust --mafft_ep parameter or check input sequence quality.

Database Integration Failures Verify database path and family names. Ensure database files exist.

Empty Output Check input file format and sequence validity. Review log files for specific errors.

Performance Tips

  1. Large datasets: Use sequence clustering before alignment

  2. Memory optimization: Reduce thread count if memory is limited

  3. Speed optimization: Use family-specific databases instead of ‘all’

Example Workflow

Here’s a complete example for capsid protein alignment:

# Basic alignment for tree building
cressent align \
    --threads 24 \
    --input_fasta capsid_proteins.faa \
    -o analysis/caps_align

# Database-integrated alignment for comprehensive phylogeny
cressent align \
    --threads 24 \
    --input_fasta capsid_proteins.faa \
    --db_family "Circoviridae" "Genomoviridae" \
    --protein_type caps \
    --db_path /path/to/databases \
    -o analysis/caps_align_with_db

# Build tree from alignment
cressent build_tree \
    -i analysis/caps_align_with_db/capsid_proteins_aligned_trimmed_sequences.fasta \
    -o analysis/caps_tree