Motif Discovery
CRESSENT provides comprehensive motif discovery capabilities through two complementary approaches: regex-based pattern matching for known motifs and de novo motif discovery using MEME. Additionally, it integrates with the Prosite database for functional annotation of discovered motifs.
Overview
The motif discovery module combines:
Pattern-based searching using regex patterns with seqkit
De novo motif discovery using MEME for finding unknown patterns
Functional annotation via ScanProsite for protein sequences
Visualization through sequence logos and motif maps
Regex-based Pattern Matching
Basic Usage
Search for specific motif patterns in your sequences:
cressent motif -i sequences.fasta -p "TAGTATTAC" -o output_dir
Key Features
Flexible pattern matching using regex syntax
Gap handling with optional gap removal
Position tracking with detailed coordinate information
Sequence splitting at motif positions
Parameters
Parameter |
Description |
Default |
|---|---|---|
|
Sequence pattern (regex) for motif searching |
Required |
|
Name of output table file |
|
|
Remove gaps before searching |
False |
|
Split sequences at motif positions |
False |
Example: CRESS Virus Nonanucleotide
# Search for nonanucleotide motif in CRESS viruses
cressent motif -i cress_genomes.fasta \
-p "TAGTATTAC" \
-n nona_positions.txt \
-o motif_analysis/
Output Files
Position table: Tab-delimited file with motif locations
Split sequences: Optional FASTA files split at motif positions
Log file: Detailed analysis log
De Novo Motif Discovery
MEME Integration
Discover unknown motifs using the MEME suite:
cressent motif_disc -i sequences.fasta \
-o meme_output/ \
-nmotifs 3 \
-minw 6 \
-maxw 12
Parameters
Parameter |
Description |
Default |
|---|---|---|
|
Number of motifs to find |
1 |
|
Minimum motif width |
5 |
|
Maximum motif width |
10 |
|
Additional MEME arguments |
None |
|
Run ScanProsite analysis |
False |
MEME Output Processing
The module automatically processes MEME results to generate:
Consensus table (
consensus_table.csv)Detailed motif table (
motif_table.csv) with:Sequence IDs and matched regions
Motif positions and orientations
Regular expressions for each motif
EPS visualization files (organized in
eps_files/)
Example Output Structure
motif_discovery_output/
├── meme.html # MEME results webpage
├── meme.xml # Machine-readable results
├── consensus_table.csv # Motif consensus sequences
├── motif_table.csv # Detailed motif matches
└── eps_files/ # MEME visualization files
├── logo1.eps
└── logo2.eps
ScanProsite Integration
Functional Annotation
For protein sequences, automatically annotate motifs with known functions:
cressent motif_disc -i proteins.fasta \
-o output/ \
--scanprosite
Features
Automatic sequence type detection (DNA vs protein)
Prosite database querying for functional annotations
Rate limiting to respect server resources
Comprehensive results with functional descriptions
ScanProsite Output
Results table (
scanprosite_results.csv) with:Signature accessions and descriptions
Pattern positions and scores
Functional annotations
Combined Workflow
Complete Motif Analysis
# Step 1: Search for known patterns
cressent motif -i sequences.fasta \
-p "TAGTATTAC" \
--generate-logo \
-o analysis/
# Step 2: Discover novel motifs
cressent motif_disc -i sequences.fasta \
-o analysis/meme/ \
-nmotifs 5 \
-minw 8 \
-maxw 15
# Step 3: Functional annotation (for proteins)
cressent motif_disc -i proteins.fasta \
-o analysis/prosite/ \
--scanprosite
Visualization Integration
Sequence Logos
Generate publication-ready sequence logos:
cressent motif -i sequences.fasta \
-p "MOTIF_PATTERN" \
--generate-logo \
--plot-title "CRESS Nonanucleotide" \
--width 12 \
--height 8 \
-o output/
Motif Mapping
Create genome-wide motif distribution maps:
cressent motif_map_viz -f motif_table.csv \
-o visualization/ \
--format auto
Advanced Features
Custom MEME Parameters
cressent motif_disc -i sequences.fasta \
-o output/ \
--meme_extra -mod "zoops" -revcomp -dna
Grouped Analysis
Analyze motifs by sequence groups:
cressent motif -i sequences.fasta \
-p "PATTERN" \
--generate-logo \
--split-logo \
--metadata groups.csv \
--group-label "virus_family" \
--ncol 2 \
-o grouped_analysis/
Best Practices
Pattern Design
Use IUPAC codes for ambiguous positions
Test patterns on known sequences first
Consider reverse complements for DNA sequences
De Novo Discovery
Optimize motif number based on sequence complexity
Adjust width parameters for expected motif sizes
Use appropriate background models for your sequence type
Performance Tips
Remove gaps if not biologically relevant
Filter sequences by length or quality
Use smaller datasets for initial parameter testing
Troubleshooting
Common Issues
No motifs found:
Check pattern syntax
Verify sequence format
Consider case sensitivity
MEME fails:
Ensure sequences are aligned if needed
Check minimum sequence requirements
Verify MEME installation
ScanProsite timeout:
Reduce sequence number
Check internet connection
Retry with rate limiting
Error Messages
The module provides detailed logging to help diagnose issues:
INFO - Processing sequence example_seq: nucleotide sequence (length: 2847)
WARNING - Skipping sequence example_seq2: motif not found
ERROR - MEME command failed: insufficient sequences