Preprocessing
CRESSENT provides comprehensive preprocessing capabilities to prepare sequence data for downstream analysis. The preprocessing module includes dereplication, decontamination, and sequence adjustment tools to ensure high-quality input data.
Overview
The preprocessing module offers three key functionalities:
Dereplication using CD-HIT and clustering algorithms
Decontamination via BLAST-based screening against contaminant databases
Sequence adjustment for motif-based sequence standardization
Dereplication and Clustering
Purpose
Remove redundant sequences and group similar sequences to reduce computational burden and improve analysis quality.
Basic Usage
cressent cluster -i sequences.fasta \
-o clustering_output/ \
-t 8 \
--min_ani 95.0 \
--min_tcov 85.0
Algorithm Workflow
Sequence preprocessing with name sanitization
BLAST database creation (nucleotide or protein auto-detected)
All-vs-all BLAST search
ANI calculation using anicalc
Clustering with aniclust
Representative selection and output generation
Parameters
Parameter |
Description |
Default |
|---|---|---|
|
Number of CPU threads |
1 |
|
Minimum average nucleotide identity |
95.0 |
|
Minimum target coverage |
85.0 |
|
Minimum query coverage |
0.0 |
|
Keep only first word of sequence IDs |
False |
Output Files
Clusters table (
clusters.tsv): Representative sequences and membersRepresentative sequences (
cluster_sequences.fa): FASTA with cluster representativesName mapping (
*_name_table.tsv): Original to sanitized name mappingBLAST results (
blast_results.tsv): All-vs-all comparison resultsANI results (
ani_results.tsv): Average nucleotide identity calculations
Example Output
clustering_output/
├── clusters.tsv
├── cluster_sequences.fa
├── blast_results.tsv
├── ani_results.tsv
├── renamed_sequences.fasta
├── sequences_name_table.tsv
└── clustering.log
Cluster Table Format
Representative_Sequence Sequences
seq_001 seq_001,seq_045,seq_123
seq_002 seq_002,seq_067
seq_003 seq_003
Decontamination
Purpose
Screen and remove potential contaminant sequences using BLAST against known contaminant databases.
Basic Usage
cressent detect_contamination -i input_sequences.fasta \
--db contaminant_database.fasta \
-o decontamination_output/ \
--output-name clean_sequences
Features
Automatic sequence type detection (nucleotide vs protein)
Flexible BLAST parameters for different stringency levels
Comprehensive statistics and contamination reports
Cross-platform compatibility with robust error handling
Parameters
Parameter |
Description |
Default |
|---|---|---|
|
Contaminant database FASTA file |
Required |
|
Sequence type (nucl/prot) |
Auto-detect |
|
BLAST E-value threshold |
1e-10 |
|
Minimum percent identity |
90.0 |
|
Minimum query coverage |
50.0 |
|
Number of CPU threads |
1 |
|
Keep temporary BLAST files |
False |
Algorithm Steps
Input validation and sequence type detection
BLAST database creation from contaminant sequences
BLAST search (BLASTN or BLASTP)
Results filtering by identity and coverage thresholds
Sequence filtering and clean sequence output
Statistics generation and reporting
Output Files
Clean sequences (
clean_sequences.fasta): Filtered sequencesStatistics (
clean_sequences_stats.txt): Contamination summaryBLAST results (
clean_sequences_blast.tsv): Optional detailed resultsLog file (
clean_sequences_decontamination.log): Process log
Statistics Report
Total sequences: 1000
Identified contaminants: 45
Clean sequences: 955
Contamination rate: 4.50%
Contaminant sequence IDs:
seq_001
seq_042
...
Building Contaminant Databases
Create custom contamination databases from accession lists:
cressent build_contaminant_db --accession-csv accessions.csv \
-o database_output/ \
--output-name viral_contaminants \
--email your.email@domain.com
Features
NCBI integration for sequence download
Batch processing with rate limiting
Protein extraction from nucleotide records
Metadata generation for traceability
Sequence Adjustment
Purpose
Standardize sequence start positions based on conserved motifs, particularly useful for circular genomes.
Basic Usage
cressent adjust_seq -i sequences.fasta \
-m "TAGTATTAC" \
-o adjusted_output/
Features
Automatic sequence type detection (DNA/RNA/protein)
Flexible motif patterns using regex syntax
Circular genome support with sequence rotation
Comprehensive logging and statistics
Parameters
Parameter |
Description |
Default |
|---|---|---|
|
Motif pattern for adjustment |
TAGTATTAC |
|
Output directory |
Current directory |
Algorithm
Sequence type detection based on character composition
Motif searching using regex patterns
Sequence rotation to start at motif position
Quality control and validation
Output generation with adjusted sequences
Output Files
Adjusted sequences (
*_motif_adj.fa): Sequences starting at motifLog file (
adjust_seq.log): Processing details and statistics
Processing Summary
Processing Summary:
Total sequences processed: 100
Sequences adjusted: 87
Sequences skipped: 13
Nucleotide sequences: 95
Protein sequences: 5
Unknown sequences: 0
Combined Preprocessing Workflow
Complete Pipeline
# Step 1: Sequence adjustment
cressent adjust_seq -i raw_sequences.fasta \
-m "TAGTATTAC" \
-o preprocessing/
# Step 2: Decontamination
cressent detect_contamination -i preprocessing/*_motif_adj.fa \
--db viral_contaminants.fasta \
-o preprocessing/ \
--output-name decontaminated
# Step 3: Dereplication
cressent cluster -i preprocessing/decontaminated.fasta \
-o preprocessing/ \
--min_ani 95.0 \
--min_tcov 85.0 \
-t 8
Quality Control Workflow
# Conservative decontamination
cressent detect_contamination -i sequences.fasta \
--db contaminants.fasta \
--identity 85.0 \
--coverage 70.0 \
-o qc_output/
# Strict clustering
cressent cluster -i qc_output/clean_sequences.fasta \
--min_ani 98.0 \
--min_tcov 90.0 \
--min_qcov 10.0 \
-o qc_output/
Best Practices
Sequence Preparation
Quality filtering before preprocessing
Appropriate motif selection for sequence adjustment
Comprehensive contaminant databases for screening
Parameter Optimization
ANI thresholds based on taxonomic level of interest
Coverage parameters depending on sequence quality
Identity cutoffs appropriate for contamination type
Computational Considerations
Thread allocation based on available resources
Memory requirements for large datasets
Temporary file management for disk space
Troubleshooting
Common Issues
No sequences adjusted:
Verify motif pattern syntax
Check sequence format and encoding
Consider motif orientation
High contamination rates:
Review contamination database completeness
Adjust identity/coverage thresholds
Validate input sequence quality
Clustering failures:
Check sequence format compatibility
Verify BLAST installation
Monitor memory usage
Performance Optimization
Large datasets:
Increase thread count
Use sequence splitting for memory management
Consider preliminary filtering
Slow BLAST searches:
Reduce database size if possible
Optimize E-value thresholds
Use faster BLAST variants
Error Diagnostics
The preprocessing modules provide detailed logging:
INFO - Processing FASTA with keep_names=False
INFO - Sanitized 1000 sequences to output.fasta
INFO - Detected nucleotide sequences (blastn will be used)
WARNING - Binary compatibility issue detected: GLIBC version
ERROR - BLAST search failed: insufficient memory