Recombination Detection Module

The recombination module detects recombination events in DNA sequences using multiple computational methods integrated through OpenRDP. It provides comprehensive recombination analysis using up to seven different detection algorithms.

Overview

Recombination is a crucial evolutionary process in viral genomes that can:

Generate genetic diversity
Create new viral strains
Influence host range evolution
Affect vaccine and therapeutic development

The recombination module implements multiple detection methods to ensure robust identification of recombination events.

Detection Methods

CRESSENT integrates seven recombination detection methods:

Primary Methods

RDP: Identifies recombination by detecting unusual phylogenetic relationships
GENECONV: Detects gene conversion events using statistical analysis
Bootscan: Uses bootstrap values to identify recombination breakpoints
MaxChi: Maximum chi-square method for breakpoint detection
Chimaera: Detects recombination using multiple reference sequences
3Seq: Three-sequence method for recombination detection
Siscan: Sister-scanning method for recombination identification

Method Reliability

Events detected by multiple methods (≥3) are considered highly reliable:

Algorithm Workflow

Sequence Preprocessing: Validates alignment quality and sequence integrity
Method Execution: Runs selected recombination detection algorithms
Statistical Analysis: Calculates p-values for detected events
Result Integration: Combines outputs from multiple methods
Significance Filtering: Applies statistical thresholds for reliable detection

Parameters

.ini configuration file defines the default OpenRDP parameters used by CRESSENT’s recombination detection module.

General Settings

circular_genome = False — ssDNA viruses often have circular genomes, but alignments are linearized before recombination scanning to avoid false breakpoints at the artificial sequence junctions. This flag disables circular wraparound scanning by default.
comparison_correction = bonferroni — applies Bonferroni multiple-testing correction to maintain conservative significance thresholds across many pairwise tests, minimizing false positives in small viral datasets.

Permutation Options

num_permutations = 0 — disables permutation-based empirical p-value estimation, favoring analytical p-values for faster runtime on large datasets. Users can increase this number for stricter validation when small sample sizes permit.

Data Processing

min_num_detecting_events = 1 — requires that at least one independent method support a recombination event for it to be reported. Raising this threshold (e.g., 2 or 3) increases confidence but reduces sensitivity.

RDP Method

max_pvalue = 0.05 — classical significance threshold.
window_size = 30 — scans triplets of sequences over 30-nt windows, appropriate for short ssDNA genomes (≈1–3 kb).
min_identity / max_identity = 0–100 — allows detection across full diversity range rather than limiting to closely related sequences.
reference_sequence = None — indicates that all sequences are treated symmetrically (no fixed “reference” genome).

GENECONV Method

Detects unusually long identical fragments between sequences.
indels_as_polymorphisms = True — treats small indels as informative events rather than missing data.
mismatch_penalty = 1, min_len = 1, min_poly = 2, min_score = 2 — set liberal thresholds to detect short conversion tracts common in compact viral genomes.
max_num = 1 — limits redundant event reporting for the same region.

Bootscan Method

Performs sliding-window phylogenetic reconstruction.
win_size = 200, step_size = 20 — windows of 200 bp shifted every 20 bp balance signal strength and resolution for 1–3 kb genomes.
num_replicates = 100 — bootstrap replicates per window.
cutoff_percentage = 0.7 — requires ≥70 % bootstrap support to accept a topology switch.
model = Jukes–Cantor — simplest substitution model suitable for short alignments with limited divergence.
p_value_calculation = binomial — uses binomial significance testing for breakpoint validation.

MaxChi and Chimaera Methods

Both use chi-square tests on substitution patterns.
win_size = 100–200 and num_var_sites = 60–70 — define the number of polymorphic sites per window.
strip_gaps = False — retains indel positions since compact viral genomes often contain informative indels.
max_pvalue = 0.05 — standard significance cutoff.

SiScan Method

Uses similarity profiles between sequences to detect topological shifts.
win_size = 200, step_size = 20 — same window scheme as Bootscan for consistent resolution.
pvalue_perm_num = 1100, scan_perm_num = 100 — defines permutation counts for empirical p-value estimation and scanning.
strip_gaps = True — removes gap-rich regions to avoid spurious similarity spikes.
fourth_seq_sel = outlier — uses the most divergent sequence as the outgroup for normalization, which improves detection sensitivity across heterogeneous viral families.

Usage

Run All Methods

Detect recombination using all available methods:

cressent recombination \
    -i sequences.fasta \
    -o output/recombination \
    -f recombination_results.csv \
    --all

Run Specific Methods

Run selected methods for targeted analysis:

cressent recombination \
    -i sequences.fasta \
    -o output/recombination \
    -f recombination_results.csv \
    -rdp -bootscan -maxchi

Custom Configuration

Use custom parameters via configuration file:

cressent recombination \
    -i sequences.fasta \
    -o output/recombination \
    -f recombination_results.csv \
    -c custom_config.ini \
    --all

Parameters

Required Parameters

-i, --input: Input alignment file in FASTA format
-o, --output: Output directory for results
-f, --output_file: Output CSV file name for results

Method Selection

-rdp: Run RDP method
-threeseq: Run 3Seq method
-geneconv: Run GENECONV method
-maxchi: Run MaxChi method
-chimaera: Run Chimaera method
-bootscan: Run Bootscan method
-siscan: Run Siscan method
-all: Run all available methods

Optional Parameters

-c, --config: Configuration file for method parameters
-quiet: Suppress console output
-verbose: Enable detailed logging

Output Format

The recombination analysis produces a comprehensive CSV file with the following structure:

Column	Description
Method	Detection method used
Recombinant	Sequence identified as recombinant
Major_Parent	Predicted major parent sequence
Minor_Parent	Predicted minor parent sequence
Breakpoint_Start	Start position of recombination region
Breakpoint_End	End position of recombination region
Pvalue	Statistical significance of detection
Multiple_Comparisons	Corrected p-value

Example Output

Method	Recombinant	Major_Parent	Minor_Parent	Breakpoint_Start	Breakpoint_End	Pvalue	Multiple_Comparisons
RDP	Sequence_A	Sequence_B	Sequence_C	245	678	0.0023	0.0156
Bootscan	Sequence_A	Sequence_B	Sequence_C	240	685	0.0034	0.0204
MaxChi	Sequence_A	Sequence_B	Sequence_C	250	670	0.0019	0.0133

Result Interpretation

Statistical Significance

P-value < 0.05: Statistically significant recombination event
P-value < 0.01: Highly significant recombination event
Multiple methods: Events detected by ≥3 methods are most reliable

Breakpoint Analysis

Examine breakpoint positions to understand:

Recombination hotspots: Regions with frequent breakpoints
Functional domains: Impact on protein function
Phylogenetic implications: Effect on evolutionary relationships

Data Analysis Workflow

Python Analysis Example

import pandas as pd
import matplotlib.pyplot as plt

# Load recombination results
df = pd.read_csv("recombination_results.csv")

# Filter significant events
significant = df[df['Pvalue'] < 0.05]

# Count methods per recombinant
method_counts = significant.groupby('Recombinant')['Method'].value_counts().unstack(fill_value=0)
method_counts['Total_Methods'] = method_counts.sum(axis=1)

# Identify reliable events (≥3 methods)
reliable_events = method_counts[method_counts['Total_Methods'] >= 3]

print("Reliable recombination events:")
print(reliable_events)

# Plot breakpoint distribution
plt.figure(figsize=(10, 6))
plt.hist(significant['Breakpoint_Start'], bins=20, alpha=0.7)
plt.xlabel('Breakpoint Position')
plt.ylabel('Frequency')
plt.title('Distribution of Recombination Breakpoints')
plt.show()

Integration with Phylogenetic Analysis

Recombination detection should be performed before phylogenetic analysis:

# 1. Detect recombination
cressent recombination \
    -i sequences.fasta \
    -o recombination/ \
    -f recomb_results.csv \
    --all

# 2. Align sequences
cressent align \
    --input_fasta recombination/cleaned_sequences.fasta \
    -o alignment/

# 3. Analyze results and potentially remove recombinant sequences
# 4. Build phylogenetic trees with cleaned dataset
cressent build_tree \
    -i alignment/cleaned_alignment.fasta \
    -o phylogeny/

Best Practices

Input Preparation

High-quality alignment: Ensure proper sequence alignment before analysis
Sufficient diversity: Include adequate sequence diversity for detection
Appropriate length: Sequences should be long enough to detect meaningful events

Method Selection

All methods: Use all methods for comprehensive analysis
Cross-validation: Require detection by multiple methods for reliability
Statistical thresholds: Apply appropriate p-value cutoffs

Result Validation

Manual inspection: Examine alignments around detected breakpoints
Phylogenetic analysis: Compare trees before/after recombinant removal
Functional analysis: Consider impact on protein function

Common Applications

Viral Evolution Studies

Identify recombination hotspots in viral genomes
Track recombination patterns across viral families
Study impact on virulence and host adaptation

Outbreak Investigation

Trace recombination events in epidemic strains
Identify parent strains in recombinant viruses
Understand transmission dynamics

Vaccine Development

Identify stable genomic regions for vaccine targets
Assess recombination risk in vaccine strains
Monitor vaccine escape variants

Troubleshooting

Common Issues

No Events Detected Check alignment quality and sequence diversity. Ensure adequate evolutionary distance.

Binary Compilation Errors The module automatically compiles required binaries. Check system compatibility and dependencies.

Memory Issues Reduce dataset size or increase available memory. Some methods are computationally intensive.

Statistical Significance Adjust p-value thresholds based on study requirements and multiple testing considerations.

Performance Considerations

Dataset size: Larger datasets require more computational time
Sequence length: Longer sequences provide more power but increase runtime
Method selection: Running all methods increases accuracy but computational cost
Parallel processing: Some methods can utilize multiple CPU cores

Example Complete Analysis

#!/bin/bash

# Complete recombination analysis workflow
echo "Starting recombination analysis..."

# 1. Prepare alignment
cressent align \
    --threads 24 \
    --input_fasta viral_genomes.fasta \
    -o analysis/alignment

# 2. Run comprehensive recombination detection
cressent recombination \
    -i analysis/alignment/viral_genomes_aligned_trimmed_sequences.fasta \
    -o analysis/recombination \
    -f comprehensive_recombination.csv \
    --all \
    --verbose

# 3. Generate summary report
echo "Analysis complete. Results in analysis/recombination/"
echo "Check comprehensive_recombination.csv for detailed results"