Recombination Detection Module

The recombination module detects recombination events in DNA sequences using multiple computational methods integrated through OpenRDP. It provides comprehensive recombination analysis using up to seven different detection algorithms.

_images/fig_module_recomb_detection.png

Overview

Recombination is a crucial evolutionary process in viral genomes that can:

  • Generate genetic diversity

  • Create new viral strains

  • Influence host range evolution

  • Affect vaccine and therapeutic development

The recombination module implements multiple detection methods to ensure robust identification of recombination events.

Detection Methods

CRESSENT integrates seven recombination detection methods:

Primary Methods

  • RDP: Identifies recombination by detecting unusual phylogenetic relationships

  • GENECONV: Detects gene conversion events using statistical analysis

  • Bootscan: Uses bootstrap values to identify recombination breakpoints

  • MaxChi: Maximum chi-square method for breakpoint detection

  • Chimaera: Detects recombination using multiple reference sequences

  • 3Seq: Three-sequence method for recombination detection

  • Siscan: Sister-scanning method for recombination identification

Method Reliability

Events detected by multiple methods (≥3) are considered highly reliable:

Algorithm Workflow

  1. Sequence Preprocessing: Validates alignment quality and sequence integrity

  2. Method Execution: Runs selected recombination detection algorithms

  3. Statistical Analysis: Calculates p-values for detected events

  4. Result Integration: Combines outputs from multiple methods

  5. Significance Filtering: Applies statistical thresholds for reliable detection

Parameters

.ini configuration file defines the default OpenRDP parameters used by CRESSENT’s recombination detection module.

General Settings

  • circular_genome = False — ssDNA viruses often have circular genomes, but alignments are linearized before recombination scanning to avoid false breakpoints at the artificial sequence junctions. This flag disables circular wraparound scanning by default.

  • comparison_correction = bonferroni — applies Bonferroni multiple-testing correction to maintain conservative significance thresholds across many pairwise tests, minimizing false positives in small viral datasets.

Permutation Options

  • num_permutations = 0 — disables permutation-based empirical p-value estimation, favoring analytical p-values for faster runtime on large datasets. Users can increase this number for stricter validation when small sample sizes permit.

Data Processing

  • min_num_detecting_events = 1 — requires that at least one independent method support a recombination event for it to be reported. Raising this threshold (e.g., 2 or 3) increases confidence but reduces sensitivity.

RDP Method

  • max_pvalue = 0.05 — classical significance threshold.

  • window_size = 30 — scans triplets of sequences over 30-nt windows, appropriate for short ssDNA genomes (≈1–3 kb).

  • min_identity / max_identity = 0–100 — allows detection across full diversity range rather than limiting to closely related sequences.

  • reference_sequence = None — indicates that all sequences are treated symmetrically (no fixed “reference” genome).

GENECONV Method

  • Detects unusually long identical fragments between sequences.

  • indels_as_polymorphisms = True — treats small indels as informative events rather than missing data.

  • mismatch_penalty = 1, min_len = 1, min_poly = 2, min_score = 2 — set liberal thresholds to detect short conversion tracts common in compact viral genomes.

  • max_num = 1 — limits redundant event reporting for the same region.

Bootscan Method

  • Performs sliding-window phylogenetic reconstruction.

  • win_size = 200, step_size = 20 — windows of 200 bp shifted every 20 bp balance signal strength and resolution for 1–3 kb genomes.

  • num_replicates = 100 — bootstrap replicates per window.

  • cutoff_percentage = 0.7 — requires ≥70 % bootstrap support to accept a topology switch.

  • model = Jukes–Cantor — simplest substitution model suitable for short alignments with limited divergence.

  • p_value_calculation = binomial — uses binomial significance testing for breakpoint validation.

MaxChi and Chimaera Methods

  • Both use chi-square tests on substitution patterns.

  • win_size = 100–200 and num_var_sites = 60–70 — define the number of polymorphic sites per window.

  • strip_gaps = False — retains indel positions since compact viral genomes often contain informative indels.

  • max_pvalue = 0.05 — standard significance cutoff.

SiScan Method

  • Uses similarity profiles between sequences to detect topological shifts.

  • win_size = 200, step_size = 20 — same window scheme as Bootscan for consistent resolution.

  • pvalue_perm_num = 1100, scan_perm_num = 100 — defines permutation counts for empirical p-value estimation and scanning.

  • strip_gaps = True — removes gap-rich regions to avoid spurious similarity spikes.

  • fourth_seq_sel = outlier — uses the most divergent sequence as the outgroup for normalization, which improves detection sensitivity across heterogeneous viral families.

Usage

Run All Methods

Detect recombination using all available methods:

cressent recombination \
    -i sequences.fasta \
    -o output/recombination \
    -f recombination_results.csv \
    --all

Run Specific Methods

Run selected methods for targeted analysis:

cressent recombination \
    -i sequences.fasta \
    -o output/recombination \
    -f recombination_results.csv \
    -rdp -bootscan -maxchi

Custom Configuration

Use custom parameters via configuration file:

cressent recombination \
    -i sequences.fasta \
    -o output/recombination \
    -f recombination_results.csv \
    -c custom_config.ini \
    --all

Parameters

Required Parameters

  • -i, --input: Input alignment file in FASTA format

  • -o, --output: Output directory for results

  • -f, --output_file: Output CSV file name for results

Method Selection

  • -rdp: Run RDP method

  • -threeseq: Run 3Seq method

  • -geneconv: Run GENECONV method

  • -maxchi: Run MaxChi method

  • -chimaera: Run Chimaera method

  • -bootscan: Run Bootscan method

  • -siscan: Run Siscan method

  • -all: Run all available methods

Optional Parameters

  • -c, --config: Configuration file for method parameters

  • -quiet: Suppress console output

  • -verbose: Enable detailed logging

Output Format

The recombination analysis produces a comprehensive CSV file with the following structure:

Column

Description

Method

Detection method used

Recombinant

Sequence identified as recombinant

Major_Parent

Predicted major parent sequence

Minor_Parent

Predicted minor parent sequence

Breakpoint_Start

Start position of recombination region

Breakpoint_End

End position of recombination region

Pvalue

Statistical significance of detection

Multiple_Comparisons

Corrected p-value

Example Output

Method

Recombinant

Major_Parent

Minor_Parent

Breakpoint_Start

Breakpoint_End

Pvalue

Multiple_Comparisons

RDP

Sequence_A

Sequence_B

Sequence_C

245

678

0.0023

0.0156

Bootscan

Sequence_A

Sequence_B

Sequence_C

240

685

0.0034

0.0204

MaxChi

Sequence_A

Sequence_B

Sequence_C

250

670

0.0019

0.0133

Result Interpretation

Statistical Significance

  • P-value < 0.05: Statistically significant recombination event

  • P-value < 0.01: Highly significant recombination event

  • Multiple methods: Events detected by ≥3 methods are most reliable

Breakpoint Analysis

Examine breakpoint positions to understand:

  • Recombination hotspots: Regions with frequent breakpoints

  • Functional domains: Impact on protein function

  • Phylogenetic implications: Effect on evolutionary relationships

Data Analysis Workflow

Python Analysis Example

import pandas as pd
import matplotlib.pyplot as plt

# Load recombination results
df = pd.read_csv("recombination_results.csv")

# Filter significant events
significant = df[df['Pvalue'] < 0.05]

# Count methods per recombinant
method_counts = significant.groupby('Recombinant')['Method'].value_counts().unstack(fill_value=0)
method_counts['Total_Methods'] = method_counts.sum(axis=1)

# Identify reliable events (≥3 methods)
reliable_events = method_counts[method_counts['Total_Methods'] >= 3]

print("Reliable recombination events:")
print(reliable_events)

# Plot breakpoint distribution
plt.figure(figsize=(10, 6))
plt.hist(significant['Breakpoint_Start'], bins=20, alpha=0.7)
plt.xlabel('Breakpoint Position')
plt.ylabel('Frequency')
plt.title('Distribution of Recombination Breakpoints')
plt.show()

Integration with Phylogenetic Analysis

Recombination detection should be performed before phylogenetic analysis:

# 1. Detect recombination
cressent recombination \
    -i sequences.fasta \
    -o recombination/ \
    -f recomb_results.csv \
    --all

# 2. Align sequences
cressent align \
    --input_fasta recombination/cleaned_sequences.fasta \
    -o alignment/

# 3. Analyze results and potentially remove recombinant sequences
# 4. Build phylogenetic trees with cleaned dataset
cressent build_tree \
    -i alignment/cleaned_alignment.fasta \
    -o phylogeny/

Best Practices

Input Preparation

  1. High-quality alignment: Ensure proper sequence alignment before analysis

  2. Sufficient diversity: Include adequate sequence diversity for detection

  3. Appropriate length: Sequences should be long enough to detect meaningful events

Method Selection

  1. All methods: Use all methods for comprehensive analysis

  2. Cross-validation: Require detection by multiple methods for reliability

  3. Statistical thresholds: Apply appropriate p-value cutoffs

Result Validation

  1. Manual inspection: Examine alignments around detected breakpoints

  2. Phylogenetic analysis: Compare trees before/after recombinant removal

  3. Functional analysis: Consider impact on protein function

Common Applications

Viral Evolution Studies

  • Identify recombination hotspots in viral genomes

  • Track recombination patterns across viral families

  • Study impact on virulence and host adaptation

Outbreak Investigation

  • Trace recombination events in epidemic strains

  • Identify parent strains in recombinant viruses

  • Understand transmission dynamics

Vaccine Development

  • Identify stable genomic regions for vaccine targets

  • Assess recombination risk in vaccine strains

  • Monitor vaccine escape variants

Troubleshooting

Common Issues

No Events Detected Check alignment quality and sequence diversity. Ensure adequate evolutionary distance.

Binary Compilation Errors The module automatically compiles required binaries. Check system compatibility and dependencies.

Memory Issues Reduce dataset size or increase available memory. Some methods are computationally intensive.

Statistical Significance Adjust p-value thresholds based on study requirements and multiple testing considerations.

Performance Considerations

  • Dataset size: Larger datasets require more computational time

  • Sequence length: Longer sequences provide more power but increase runtime

  • Method selection: Running all methods increases accuracy but computational cost

  • Parallel processing: Some methods can utilize multiple CPU cores

Example Complete Analysis

#!/bin/bash

# Complete recombination analysis workflow
echo "Starting recombination analysis..."

# 1. Prepare alignment
cressent align \
    --threads 24 \
    --input_fasta viral_genomes.fasta \
    -o analysis/alignment

# 2. Run comprehensive recombination detection
cressent recombination \
    -i analysis/alignment/viral_genomes_aligned_trimmed_sequences.fasta \
    -o analysis/recombination \
    -f comprehensive_recombination.csv \
    --all \
    --verbose

# 3. Generate summary report
echo "Analysis complete. Results in analysis/recombination/"
echo "Check comprehensive_recombination.csv for detailed results"