Motif Discovery

CRESSENT provides comprehensive motif discovery capabilities through two complementary approaches: regex-based pattern matching for known motifs and de novo motif discovery using MEME. Additionally, it integrates with the Prosite database for functional annotation of discovered motifs.

_images/fig_module_phylogenetic.png

Overview

The motif discovery module combines:

  • Pattern-based searching using regex patterns with seqkit

  • De novo motif discovery using MEME for finding unknown patterns

  • Functional annotation via ScanProsite for protein sequences

  • Visualization through sequence logos and motif maps

Regex-based Pattern Matching

Basic Usage

Search for specific motif patterns in your sequences:

cressent motif -i sequences.fasta -p "TAGTATTAC" -o output_dir

Key Features

  • Flexible pattern matching using regex syntax

  • Gap handling with optional gap removal

  • Position tracking with detailed coordinate information

  • Sequence splitting at motif positions

Parameters

Parameter

Description

Default

-p, --pattern

Sequence pattern (regex) for motif searching

Required

-n, --table_name

Name of output table file

pattern_positions.txt

--remove-gaps

Remove gaps before searching

False

--split-sequences

Split sequences at motif positions

False

Example: CRESS Virus Nonanucleotide

# Search for nonanucleotide motif in CRESS viruses
cressent motif -i cress_genomes.fasta \
               -p "TAGTATTAC" \
               -n nona_positions.txt \
               -o motif_analysis/

Output Files

  • Position table: Tab-delimited file with motif locations

  • Split sequences: Optional FASTA files split at motif positions

  • Log file: Detailed analysis log

De Novo Motif Discovery

MEME Integration

Discover unknown motifs using the MEME suite:

cressent motif_disc -i sequences.fasta \
                        -o meme_output/ \
                        -nmotifs 3 \
                        -minw 6 \
                        -maxw 12

Parameters

Parameter

Description

Default

-nmotifs

Number of motifs to find

1

-minw

Minimum motif width

5

-maxw

Maximum motif width

10

--meme_extra

Additional MEME arguments

None

--scanprosite

Run ScanProsite analysis

False

MEME Output Processing

The module automatically processes MEME results to generate:

  1. Consensus table (consensus_table.csv)

  2. Detailed motif table (motif_table.csv) with:

    • Sequence IDs and matched regions

    • Motif positions and orientations

    • Regular expressions for each motif

  3. EPS visualization files (organized in eps_files/)

Example Output Structure

motif_discovery_output/
├── meme.html                 # MEME results webpage
├── meme.xml                  # Machine-readable results
├── consensus_table.csv       # Motif consensus sequences
├── motif_table.csv          # Detailed motif matches
└── eps_files/               # MEME visualization files
    ├── logo1.eps
    └── logo2.eps

ScanProsite Integration

Functional Annotation

For protein sequences, automatically annotate motifs with known functions:

cressent motif_disc -i proteins.fasta \
                        -o output/ \
                        --scanprosite

Features

  • Automatic sequence type detection (DNA vs protein)

  • Prosite database querying for functional annotations

  • Rate limiting to respect server resources

  • Comprehensive results with functional descriptions

ScanProsite Output

  • Results table (scanprosite_results.csv) with:

    • Signature accessions and descriptions

    • Pattern positions and scores

    • Functional annotations

Combined Workflow

Complete Motif Analysis

# Step 1: Search for known patterns
cressent motif -i sequences.fasta \
               -p "TAGTATTAC" \
               --generate-logo \
               -o analysis/

# Step 2: Discover novel motifs
cressent motif_disc -i sequences.fasta \
                        -o analysis/meme/ \
                        -nmotifs 5 \
                        -minw 8 \
                        -maxw 15

# Step 3: Functional annotation (for proteins)
cressent motif_disc -i proteins.fasta \
                        -o analysis/prosite/ \
                        --scanprosite

Visualization Integration

Sequence Logos

Generate publication-ready sequence logos:

cressent motif -i sequences.fasta \
               -p "MOTIF_PATTERN" \
               --generate-logo \
               --plot-title "CRESS Nonanucleotide" \
               --width 12 \
               --height 8 \
               -o output/

Motif Mapping

Create genome-wide motif distribution maps:

cressent motif_map_viz -f motif_table.csv \
                      -o visualization/ \
                      --format auto

Advanced Features

Custom MEME Parameters

cressent motif_disc -i sequences.fasta \
                        -o output/ \
                        --meme_extra -mod "zoops" -revcomp -dna

Grouped Analysis

Analyze motifs by sequence groups:

cressent motif -i sequences.fasta \
               -p "PATTERN" \
               --generate-logo \
               --split-logo \
               --metadata groups.csv \
               --group-label "virus_family" \
               --ncol 2 \
               -o grouped_analysis/

Best Practices

Pattern Design

  1. Use IUPAC codes for ambiguous positions

  2. Test patterns on known sequences first

  3. Consider reverse complements for DNA sequences

De Novo Discovery

  1. Optimize motif number based on sequence complexity

  2. Adjust width parameters for expected motif sizes

  3. Use appropriate background models for your sequence type

Performance Tips

  1. Remove gaps if not biologically relevant

  2. Filter sequences by length or quality

  3. Use smaller datasets for initial parameter testing

Troubleshooting

Common Issues

No motifs found:

  • Check pattern syntax

  • Verify sequence format

  • Consider case sensitivity

MEME fails:

  • Ensure sequences are aligned if needed

  • Check minimum sequence requirements

  • Verify MEME installation

ScanProsite timeout:

  • Reduce sequence number

  • Check internet connection

  • Retry with rate limiting

Error Messages

The module provides detailed logging to help diagnose issues:

INFO - Processing sequence example_seq: nucleotide sequence (length: 2847)
WARNING - Skipping sequence example_seq2: motif not found
ERROR - MEME command failed: insufficient sequences