# Secondary Structure Detection

CRESSENT provides specialized tools for detecting and analyzing secondary structures in CRESS DNA viruses, focusing on two critical elements: stem-loop structures and iterons. These features are essential for understanding viral replication mechanisms and genomic organization.

```{image} _static/figures/fig_module_2ry_str.png
:width: 800
:class: no-scaled-link
:align: center
```

## Overview

The secondary structure module includes:
- **Stem-loop detection** using RNA folding algorithms and motif conservation
- **Iteron identification** with CRUISE integration for replication origins
- **Structural validation** through energy minimization and pattern recognition
- **Annotation output** in standard GFF3 format

## Stem-Loop Detection

### Purpose

Identify hairpin structures with conserved motifs, particularly important for viral replication and packaging signals.

### Basic Usage

```bash
cressent sl_finder -i sequences.fasta \
                  --gff_in annotations.gff \
                  --out_gff stemloops.gff \
                  --output sl_analysis/
```

### Algorithm Overview

The stem-loop finder uses a multi-step approach:

1. **Motif searching** for conserved sequences (e.g., nonanucleotides)
2. **RNA folding** using ViennaRNA package
3. **Structure parsing** to identify stem-loop patterns
4. **Validation** based on stem/loop length criteria
5. **Scoring** using structural and sequence features

### Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--motif` | Conserved motif pattern | nantantan |
| `--family` | CRESS viral family | None |
| `--idealstemlen` | Ideal stem length | 11 |
| `--ideallooplen` | Ideal loop length | 11 |
| `--frame` | Bases around motif for folding | 15 |
| `--csv_out` | Output CSV filename | None |

### Viral Family Support

Pre-defined motifs for major CRESS virus families:

| Family | Motif Pattern |
|--------|---------------|
| Geminiviridae | TRAKATTRC |
| Circoviridae | TAGTATTAC |
| Cycloviridae | TAATATTAC |
| Genomoviridae | TAWWDHWAN |
| Smacoviridae | NAKWRTTAC |
| General | TAGTATTAC |

### Example Usage by Family

```bash
# Geminivirus analysis
cressent sl_finder -i gemini_genomes.fasta \
                  --gff_in annotations.gff \
                  --family geminiviridae \
                  --out_gff gemini_stemloops.gff \
                  --output gemini_analysis/

# Custom motif analysis
cressent sl_finder -i sequences.fasta \
                  --gff_in annotations.gff \
                  --motif "TAGTATTAC" \
                  --idealstemlen 12 \
                  --ideallooplen 8 \
                  --out_gff custom_stemloops.gff \
                  --output custom_analysis/
```

### Output Files

- **GFF annotations** (`stemloops.gff`): Stem-loop and nonanucleotide features
- **CSV results** (`results.csv`): Detailed structural information
- **Log file** (`sl_finder.log`): Processing details and statistics

### GFF Output Format

```gff3
##gff-version 3
sequence_01    sl_finder    stem_loop         150    201    .    +    .    Name=stem-loop
sequence_01    sl_finder    nonanucleotide    165    173    .    +    .    Name=nona
sequence_02    sl_finder    stem_loop         89     145    .    -    .    Name=stem-loop
sequence_02    sl_finder    nonanucleotide    110    118    .    -    .    Name=nona
```

### CSV Output Format


|   seqID   |matched    |motif_start    |   stem_start |   stem_end   |score  |   folded_structure
|:--------------|:------------|:-----------|:------------|:-------|:------------|:-------|
|   sequence_01|    TAGTATTAC|  165 |    150 |    201 |    8.5 |    (((((.......))))).....  |
|   sequence_02|    TAGTATTAC|  110 |   89  |    145  |    6.2  |   (((((........))))).....    |


## Iteron Detection

### Purpose

Identify iterons (direct repeats) that serve as replication origins in CRESS DNA viruses, using the CRUISE algorithm.

### Basic Usage

```bash
cressent run_cruise --input_fasta sequences.fasta \
                   --inputGFF annotations.gff \
                   --output cruise_analysis/ \
                   --outputGFF iterons.gff
```

### CRUISE Algorithm

CRUISE (CRUcivirus Iteron SEarch) implements:

1. **Substring enumeration** within specified length ranges
2. **Distance analysis** between repeat occurrences
3. **Stem-loop context** validation
4. **Scoring systems** based on repeat quality and distribution
5. **Known iteron** database comparison

### Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--minLength` | Minimum iteron length | 5 |
| `--maxLength` | Maximum iteron length | 12 |
| `--range` | Search range around nonanucleotide | 65 |
| `--rank` | Use ranking system | True |
| `--numberTopIterons` | Number of top iterons to report | 5 |
| `--maxScore` | Maximum score for non-ranked mode | 40 |
| `--wiggle` | Length-distance tolerance | 5 |
| `--goodLength` | Optimal iteron length | 11 |
| `--maxDist` | Maximum distance between iterons | 20 |
| `--bestDist` | Optimal distance between iterons | 10 |

### Known Iterons Database

CRUISE includes a database of characterized iterons:

```python
# Examples of known iterons
TTGTCCAC  # RDHV
AGTGGGA   # Various circoviruses
GCCACCC   # Begomovirus
GGGGA     # Mastrevirus
TCTGA     # Curtovirus
```

### Advanced Parameters

```bash
cressent run_cruise --input_fasta sequences.fasta \
                   --inputGFF annotations.gff \
                   --output detailed_analysis/ \
                   --minLength 8 \
                   --maxLength 15 \
                   --range 100 \
                   --numberTopIterons 10 \
                   --maxDist 25 \
                   --bestDist 12 \
                   --scoreRange 30
```

### Output Files

- **Annotated GFF** (`iterons.gff`): Original GFF with iteron annotations
- **Processing log** (`cruise.log`): Detailed analysis log
- **Statistics summary**: Iteron discovery rates and patterns

### Iteron Annotation Types

The CRUISE module annotates several types of features:

| Feature Type | Description |
|--------------|-------------|
| `iteron` | Standard direct repeats |
| `stem_loop_repeats` | Iterons associated with stem-loops |
| `tagIteron` | Known/characterized iterons |
| `known_iteron` | Database-matched sequences |

## Combined Secondary Structure Analysis

### Integrated Workflow

```bash
# Step 1: Identify stem-loops
cressent sl_finder -i cress_genomes.fasta \
                  --gff_in base_annotations.gff \
                  --family circoviridae \
                  --out_gff stemloops.gff \
                  --csv_out stemloop_details.csv \
                  --output structure_analysis/

# Step 2: Find iterons in context
cressent run_cruise --input_fasta cress_genomes.fasta \
                   --inputGFF structure_analysis/stemloops.gff \
                   --output structure_analysis/ \
                   --outputGFF complete_structures.gff \
                   --minLength 6 \
                   --maxLength 14 \
                   --numberTopIterons 8
```

### Quality Control Pipeline

```bash
# Conservative stem-loop detection
cressent sl_finder -i sequences.fasta \
                  --gff_in annotations.gff \
                  --idealstemlen 10 \
                  --ideallooplen 9 \
                  --frame 20 \
                  --out_gff conservative_stemloops.gff \
                  --output qc_analysis/

# Comprehensive iteron search
cressent run_cruise --input_fasta sequences.fasta \
                   --inputGFF qc_analysis/conservative_stemloops.gff \
                   --output qc_analysis/ \
                   --minLength 4 \
                   --maxLength 16 \
                   --range 80 \
                   --numberTopIterons 15 \
                   --maxDist 30
```

## Validation and Quality Assessment

### Structural Validation

1. **Energy scoring** using ViennaRNA folding energies
2. **Length criteria** for biologically relevant structures
3. **Motif conservation** across related sequences
4. **Distance constraints** for iteron spacing

### Quality Metrics

```bash
# High-confidence stem-loops
- Stem length: 8-15 bp
- Loop length: 6-20 bp
- Folding score: < 10
- Motif match: exact or 1 mismatch

# Validated iterons
- Length: 6-12 bp
- Occurrences: ≥ 2
- Distance: 8-15 bp
- Context: within 65 bp of nonanucleotide
```

## Biological Interpretation

### Stem-Loop Functions

- **Replication origins** (often contain nonanucleotides)
- **Packaging signals** for viral DNA
- **Regulatory elements** for gene expression
- **Structural domains** in viral genomes

### Iteron Significance

- **Replication initiation** sites
- **Rep protein binding** domains
- **Copy number control** elements
- **Evolutionary markers** for virus classification

## Best Practices

### Input Preparation

1. **Quality sequences** with minimal gaps or ambiguities
2. **Proper annotations** in GFF3 format
3. **Family-specific parameters** when known
4. **Consistent naming** for sequence identifiers

### Parameter Optimization

1. **Family-appropriate motifs** for stem-loop detection
2. **Realistic length ranges** for iteron searches
3. **Contextual distances** based on genome organization
4. **Scoring thresholds** appropriate for data quality

### Validation Strategies

1. **Cross-reference** with known structures
2. **Phylogenetic consistency** across related sequences
3. **Experimental validation** when possible
4. **Literature comparison** for characterized viruses

## Troubleshooting

### Common Issues

**No stem-loops detected:**
- Verify motif pattern accuracy
- Check sequence quality and format
- Adjust folding parameters
- Consider alternative viral families

**Excessive iteron predictions:**
- Increase scoring stringency
- Reduce maximum distance parameters
- Use ranking mode for top candidates
- Filter by stem-loop context

**ViennaRNA errors:**
- Check sequence format (DNA/RNA)
- Verify installation completeness
- Monitor memory usage for large sequences
- Consider sequence length limitations

### Performance Optimization

**Large genomes:**
- Process in smaller segments
- Increase memory allocation
- Use parallel processing where available
- Consider preliminary filtering

**Parameter tuning:**
- Start with default values
- Adjust based on known biology
- Validate with characterized examples
- Document parameter choices

### Error Diagnostics

```bash
# Detailed logging output
INFO - Processing record: sequence_01
INFO - Found stem-loop at 150-201 in sequence_01
WARNING - No iterons found in sequence_02
ERROR - ViennaRNA folding failed: sequence too short
```