# Motif Discovery CRESSENT provides comprehensive motif discovery capabilities through two complementary approaches: regex-based pattern matching for known motifs and de novo motif discovery using MEME. Additionally, it integrates with the Prosite database for functional annotation of discovered motifs. ```{image} _static/figures/fig_module_phylogenetic.png :width: 800 :class: no-scaled-link :align: center ``` ## Overview The motif discovery module combines: - **Pattern-based searching** using regex patterns with seqkit - **De novo motif discovery** using MEME for finding unknown patterns - **Functional annotation** via ScanProsite for protein sequences - **Visualization** through sequence logos and motif maps ## Regex-based Pattern Matching ### Basic Usage Search for specific motif patterns in your sequences: ```bash cressent motif -i sequences.fasta -p "TAGTATTAC" -o output_dir ``` ### Key Features - **Flexible pattern matching** using regex syntax - **Gap handling** with optional gap removal - **Position tracking** with detailed coordinate information - **Sequence splitting** at motif positions ### Parameters | Parameter | Description | Default | |-----------|-------------|---------| | `-p, --pattern` | Sequence pattern (regex) for motif searching | Required | | `-n, --table_name` | Name of output table file | `pattern_positions.txt` | | `--remove-gaps` | Remove gaps before searching | False | | `--split-sequences` | Split sequences at motif positions | False | ### Example: CRESS Virus Nonanucleotide ```bash # Search for nonanucleotide motif in CRESS viruses cressent motif -i cress_genomes.fasta \ -p "TAGTATTAC" \ -n nona_positions.txt \ -o motif_analysis/ ``` ### Output Files - **Position table**: Tab-delimited file with motif locations - **Split sequences**: Optional FASTA files split at motif positions - **Log file**: Detailed analysis log ## De Novo Motif Discovery ### MEME Integration Discover unknown motifs using the MEME suite: ```bash cressent motif_disc -i sequences.fasta \ -o meme_output/ \ -nmotifs 3 \ -minw 6 \ -maxw 12 ``` ### Parameters | Parameter | Description | Default | |-----------|-------------|---------| | `-nmotifs` | Number of motifs to find | 1 | | `-minw` | Minimum motif width | 5 | | `-maxw` | Maximum motif width | 10 | | `--meme_extra` | Additional MEME arguments | None | | `--scanprosite` | Run ScanProsite analysis | False | ### MEME Output Processing The module automatically processes MEME results to generate: 1. **Consensus table** (`consensus_table.csv`) 2. **Detailed motif table** (`motif_table.csv`) with: - Sequence IDs and matched regions - Motif positions and orientations - Regular expressions for each motif 3. **EPS visualization files** (organized in `eps_files/`) ### Example Output Structure ``` motif_discovery_output/ ├── meme.html # MEME results webpage ├── meme.xml # Machine-readable results ├── consensus_table.csv # Motif consensus sequences ├── motif_table.csv # Detailed motif matches └── eps_files/ # MEME visualization files ├── logo1.eps └── logo2.eps ``` ## ScanProsite Integration ### Functional Annotation For protein sequences, automatically annotate motifs with known functions: ```bash cressent motif_disc -i proteins.fasta \ -o output/ \ --scanprosite ``` ### Features - **Automatic sequence type detection** (DNA vs protein) - **Prosite database querying** for functional annotations - **Rate limiting** to respect server resources - **Comprehensive results** with functional descriptions ### ScanProsite Output - **Results table** (`scanprosite_results.csv`) with: - Signature accessions and descriptions - Pattern positions and scores - Functional annotations ## Combined Workflow ### Complete Motif Analysis ```bash # Step 1: Search for known patterns cressent motif -i sequences.fasta \ -p "TAGTATTAC" \ --generate-logo \ -o analysis/ # Step 2: Discover novel motifs cressent motif_disc -i sequences.fasta \ -o analysis/meme/ \ -nmotifs 5 \ -minw 8 \ -maxw 15 # Step 3: Functional annotation (for proteins) cressent motif_disc -i proteins.fasta \ -o analysis/prosite/ \ --scanprosite ``` ## Visualization Integration ### Sequence Logos Generate publication-ready sequence logos: ```bash cressent motif -i sequences.fasta \ -p "MOTIF_PATTERN" \ --generate-logo \ --plot-title "CRESS Nonanucleotide" \ --width 12 \ --height 8 \ -o output/ ``` ### Motif Mapping Create genome-wide motif distribution maps: ```bash cressent motif_map_viz -f motif_table.csv \ -o visualization/ \ --format auto ``` ## Advanced Features ### Custom MEME Parameters ```bash cressent motif_disc -i sequences.fasta \ -o output/ \ --meme_extra -mod "zoops" -revcomp -dna ``` ### Grouped Analysis Analyze motifs by sequence groups: ```bash cressent motif -i sequences.fasta \ -p "PATTERN" \ --generate-logo \ --split-logo \ --metadata groups.csv \ --group-label "virus_family" \ --ncol 2 \ -o grouped_analysis/ ``` ## Best Practices ### Pattern Design 1. **Use IUPAC codes** for ambiguous positions 2. **Test patterns** on known sequences first 3. **Consider reverse complements** for DNA sequences ### De Novo Discovery 1. **Optimize motif number** based on sequence complexity 2. **Adjust width parameters** for expected motif sizes 3. **Use appropriate background models** for your sequence type ### Performance Tips 1. **Remove gaps** if not biologically relevant 2. **Filter sequences** by length or quality 3. **Use smaller datasets** for initial parameter testing ## Troubleshooting ### Common Issues **No motifs found:** - Check pattern syntax - Verify sequence format - Consider case sensitivity **MEME fails:** - Ensure sequences are aligned if needed - Check minimum sequence requirements - Verify MEME installation **ScanProsite timeout:** - Reduce sequence number - Check internet connection - Retry with rate limiting ### Error Messages The module provides detailed logging to help diagnose issues: ``` INFO - Processing sequence example_seq: nucleotide sequence (length: 2847) WARNING - Skipping sequence example_seq2: motif not found ERROR - MEME command failed: insufficient sequences ```