Frequently Asked Questions

Here you will find answers to common questions about CRESSENT. If you can’t find an answer to your problem here, please open an issue in the GitHub repository.

General Usage

How can I speed up CRESSENT analysis?

If you want to speed up CRESSENT execution, several options are available:

Use sequence clustering: Run the cluster module before phylogenetic analysis to remove redundant sequences
Optimize thread usage: Use the --threads parameter with appropriate values for your system
Disable time-intensive modules: Skip recombination analysis if not needed for your research
Use simpler models: Choose faster evolutionary models (e.g., -m GTR+G4 instead of -m MFP) for quick analysis

Please note that some optimizations may reduce analysis depth or accuracy.

What input formats does CRESSENT accept?

CRESSENT accepts the following input formats:

FASTA files: Nucleotide and protein sequences (required)
Compressed files: .gz, .bz2, and .xz compression supported
GFF files: For structural analysis modules like stem loops and iterons
CSV files: For database construction and metadata integration

All sequence files must be in valid FASTA format with unique sequence identifiers.

How do I interpret phylogenetic tree support values?

Bootstrap support values in CRESSENT trees indicate confidence in tree topology:

≥95%: Very strong support, high confidence in the relationship
70-94%: Strong support, generally reliable
50-69%: Moderate support, interpret with caution
<50%: Poor support, relationship is uncertain

For critical evolutionary inferences, focus on relationships with ≥70% bootstrap support.

Module-Specific Questions

Why is my alignment empty or of poor quality?

Poor alignment quality can result from several issues:

Input sequence problems:

Sequences are not homologous
Poor sequence quality with many ambiguous nucleotides
Sequences of very different lengths

Solutions:

Verify that input sequences represent the same gene/protein
Remove sequences with >10% ambiguous characters
Use cluster to remove highly divergent sequences
Adjust --mafft_ep parameter for better alignment sensitivity

Why does tree construction fail with “insufficient data” errors?

Tree construction can fail when:

Too few informative sites: Sequences are too similar or too short
Identical sequences: Remove duplicates using cluster
Poor alignment: Insufficient overlap between sequences after trimming

Solutions:

Check alignment quality and length
Use longer sequences or more divergent taxa
Adjust trimming parameters (--gap_threshold)
Try simpler evolutionary models

How can I improve recombination detection sensitivity?

To enhance recombination detection:

Use high-quality alignments: Ensure proper sequence alignment
Include diverse sequences: Adequate evolutionary distance improves detection
Run all methods: Use --all flag for comprehensive analysis
Verify manually: Inspect alignment around detected breakpoints
Apply statistical filters: Focus on events with p-values <0.05

Why are my sequence logos poorly resolved?

Poor sequence logo quality often indicates:

Insufficient sequence conservation: Not enough conserved positions
Poor motif definition: Pattern may be too broad or specific
Alignment gaps: Excessive gaps in the motif region

Solutions:

Increase sequence number in the analysis
Refine motif search patterns
Use --remove-gaps option in motif
Check alignment quality in the motif region

Technical Issues

Why am I getting memory errors?

Memory errors typically occur with:

Large datasets (>1000 sequences)
Long alignments (>10,000 positions)
Database integration with large reference sets

Solutions:

Reduce dataset size: Use cluster for sequence reduction
Decrease thread count: Lower --threads parameter to reduce memory usage
Increase system RAM: Consider upgrading hardware for large analyses
Process in batches: Split large datasets into smaller chunks

Why is the execution stuck during tree building?

Tree construction can be slow due to:

Large datasets: Many sequences or long alignments
Complex models: Model selection (-m MFP) is computationally intensive
Insufficient CPU: Single-threaded execution on multi-core systems

Solutions:

Use --threads parameter to utilize multiple cores
Choose specific models (-m GTR+G4) instead of model selection
Monitor system resources to identify bottlenecks
Consider using a high-performance computing cluster

Why do binary compilation errors occur?

Some modules compile required binaries automatically:

System compatibility: Binaries may not work on all systems
Missing dependencies: Required compilers or libraries missing
Permission issues: Insufficient write permissions

Solutions:

Ensure gcc and make are installed
Check system compatibility (Linux/Unix preferred)
Run with appropriate permissions
Install missing system dependencies

Database and Integration

Why does database integration fail?

Database integration issues often result from:

Missing database files: Verify database paths and file existence
Incompatible family names: Check spelling and available families
Corrupted downloads: Re-download database files if corrupted

Solutions:

Verify --db_path points to correct directory
Check available families in database documentation
Re-run database construction if files are corrupted

How do I handle large-scale comparative analyses?

For large comparative studies:

Use hierarchical clustering: Reduce dataset complexity first
Employ database integration: Add reference sequences for context
Optimize computational resources: Use high-performance computing
Process by viral family: Analyze related viruses separately
Validate key findings: Focus detailed analysis on important relationships

Results Interpretation

How reliable are the recombination predictions?

Recombination reliability depends on:

Multiple method agreement: Events detected by ≥3 methods are most reliable
Statistical significance: p-values <0.01 indicate strong evidence
Biological plausibility: Results should make biological sense
Manual validation: Visual inspection of alignments confirms events

What do tanglegram comparisons reveal?

Tanglegrams show congruence between phylogenetic trees:

Parallel lines: Congruent phylogenies, similar evolutionary history
Crossed lines: Incongruent relationships, possible recombination
RF scores: Robinson-Foulds distance quantifies tree differences

High congruence suggests vertical inheritance; incongruence may indicate recombination or different evolutionary pressures.

How do I validate phylogenetic results?

Validate phylogenetic analyses through:

Bootstrap assessment: Focus on well-supported relationships (≥70%)
Model adequacy: Check model fit statistics in log files
Biological consistency: Ensure results match known biology
Alternative methods: Compare with other phylogenetic approaches
Literature comparison: Verify consistency with published studies

Best Practices

What is the recommended workflow for new users?

For new users, we recommend:

Start with tutorial data: Practice with provided example datasets
Begin with simple analyses: Single-gene phylogenies before complex workflows
Validate each step: Check intermediate outputs for quality
Use default parameters: Optimize parameters only after understanding basics
Consult documentation: Read module-specific documentation thoroughly

How should I cite CRESSENT in publications?

Please cite CRESSENT as:

[Authors]. CRESSENT: A comprehensive toolkit for CRESS DNA virus analysis. [Journal] [Year]. DOI: [DOI].

Additionally, cite specific tools used within CRESSENT modules (MAFFT, IQ-TREE, etc.) as appropriate for your analysis.

What are common analysis pitfalls to avoid?

Avoid these common mistakes:

Skipping quality control: Always validate input data quality
Ignoring statistical support: Don’t over-interpret poorly supported results
Using inappropriate models: Choose evolutionary models suitable for your data
Mixing sequence types: Don’t combine nucleotide and protein sequences inappropriately
Inadequate taxon sampling: Include sufficient diversity for robust analysis

How do I optimize CRESSENT for different viral families?

Different viral families may require specific approaches:

Small genomes (e.g., Circoviridae):

Use complete genome sequences when possible
Consider codon-based alignments for coding sequences
Pay attention to overlapping genes

Large genomes (e.g., some ssDNA viruses):

Focus on conserved gene regions
Use gene-specific databases when available
Consider computational resource requirements

Highly divergent families:

Use protein sequences instead of nucleotides
Employ more sensitive alignment parameters
Consider domain-specific analysis

If you have additional questions not covered here, please check the individual module documentation or contact us through the GitHub repository.