Clumpy compression

--clumpify is an opt-in mode that reorders reads inside each gzip member of the trimmed output so reads sharing a canonical 16-mer minimizer land adjacent on disk. gzip's 32 KB sliding window then finds long redundant runs of similar sequences, shrinking the .fq.gz by 15–55% depending on the data type, with no information loss — only the on-disk order of records changes.

--compression <N> sets the gzip level independently (1–9, default 1). Combine the two for maximum effect: --clumpify --compression 9 reorders reads and runs gzip at its slowest/smallest level.

When to use it

Clumpify

✅ Low complexity data: yes (ATAC-seq, ChIP-seq, Ribo-Seq, RNA-Seq, RRBS, high sequencing depth WES)
❌ High complexity data: no (whole-genome sequencing, WGBS) - may have detrimental impact (flowcell ordering wins)
❌ Long reads: no (Oxford Nanopore) - has no effect
❌ Unusual paired-end formats: no - can have deleterious effect

Compression

Whether to push up --compression level or not depends on what the trimmed FASTQ is used for:

Pipeline intermediates (trimmed FASTQ is ephemeral, deleted after the pipeline finishes)
- Leave as compression level 1, but can still use --clumpify.
- The reorder is essentially free (1.0–1.4× slowdown on most data) and the smaller output makes the next step (typically an aligner) read less from disk — net I/O win for the whole pipeline.
Long-term storage or disk-constrained workdirs
- Add --compression 6 (or --compression 9 for archival)
- --clumpify --compression 6 can halve output file sizes (15–50% less) but makes the run time 4–6× slower.
- Can be specified without --clumpify, but with redundant data types (ATAC-seq, Ribo-seq) it typically runs faster than with clumpify on, because deflate finds matches more cheaply on sorted runs.

Data types

Data type	Typical saving (`--clumpify --compression 9`)	Recommendation
ATAC-seq (paired)	~50%	✅ Strong yes: Tn5 insertion bias creates very high fragment redundancy
Ribo-seq (paired)	~45%	✅ Strong yes: short ribosome-protected fragments are highly clustered
MiSeq amplicon / CRISPR / sgRNA	30–37%	✅ Strong yes: explicit amplification produces lots of duplicates
RRBS (paired)	~24% at default `--memory 1G`; up to ~31% at `--memory 4G+`	✅ Yes: MspI cut sites concentrate reads at fragment ends; minimizer co-clusters them. Bigger `--memory` budget gives substantial extra saving — atypical for paired-end data (most types saturate at default memory).
WGBS (paired)	+9% — but plain `--compression 6` alone gets +19%	❌ No: coverage-diverse reads, no fragment-level clustering. R2 disruption beats the R1 win at every gzip level — same mechanism as 10x scRNA-seq. Use `--compression 6` without `--clumpify` for ~19% saving
ChIP-seq (single-end)	~24%	✅ Yes: peaks generate clustered reads
RNA-seq (paired)	16–30%	✅ Yes: highly-expressed transcripts create dense clusters; bigger savings at higher gzip levels
WES / WGS (paired)	6–22%	🟡 Modest: diverse coverage gives less clustering
scRNA-seq (10x Chromium)	negative — output grows	❌ No: R1 (cell barcode + UMI) reorders cleanly, but R2 (cDNA) follows R1's order to preserve pair lockstep and ends up scrambled vs the natural flowcell-cluster order. R2 disruption beats the R1 win. Use `--compression 6` without `--clumpify` for ~17% saving
Long-read (ONT, PacBio)	~0%	❌ No: long reads are mostly unique fragments; clumpify doesn't help and adds wall time
Variable-length / mixed amplicon	~0%	❌ Skip: diversity defeats minimizer clustering

How to use it

# Reorder reads, default gzip level (1 — fastest)
trim_galore --clumpify <input>

# Maximum compression (slowest, smallest output)
trim_galore --clumpify --compression 9 <input>

# Compose for archival storage: max compression with extra memory
trim_galore --clumpify --compression 9 --memory 4G <input>

# Higher gzip without reordering (gzip-only win, no clumping cost)
trim_galore --compression 6 <input>

--clumpify requires --cores >= 2 (it feeds the existing parallel worker pool with binned batches) and gzip output (--dont_gzip is rejected).

--compression is independent: it works with or without --clumpify, and applies to the regular trimming pipeline too.

Performance and compression considerations

Wall-time cost

Reordering itself is essentially free; the dominant cost at higher gzip levels is gzip's CPU time. Decoupling the two flags lets you pick the trade-off you actually want:

Mode	Wall time vs plain
`--clumpify` (compression 1, default)	~1.0–1.4× plain (essentially free)
`--clumpify --compression 6`	~1.5–6.4× plain
`--clumpify --compression 9`	~5–10× plain
`--compression 6` (no clumpify)	~1.6–6.4× plain
`--compression 9` (no clumpify)	~5–8× plain

The minimizer computation uses 2-bit packed integer ops (one O(1) bitwise step per read position) and the per-bin sort is O(n log n) on small bins; both run in the parallel worker pool alongside trim+filter so they overlap with I/O.

Memory

--memory (default 1G) is a Trim Galore-wide memory budget.

The clumpify dispatcher sizes the per-bin sort runs against it, with the formula picked in an attempt to get the predicted peak RSS ≤ --memory:

\begin{aligned} n_{\text{bins}} &= \max(16,\; 4 \times \text{cores}) \\[0.8em] \text{usable} &= \text{memory} - 512\,\text{MiB} \quad\small\text{(FastQC + allocator + runtime overhead)} \\[0.8em] \text{bin\_byte\_budget} &= \frac{4 \times \text{usable}}{5 \times n_{\text{bins}} + 7 \times \text{cores}} \end{aligned}

So with --cores 8 --memory 1G you get 32 bins × 12 MB; with --cores 8 --memory 4G you get 32 bins × 66 MB. The binary prints those resolved values at startup.

In theory, bigger budget → bigger per-gzip-member sort runs → better compression. In practice, increasing memory doesn't seem to make very much difference in our tests.

Below-floor behaviour

--clumpify needs a minimum of ~535 MiB to run (mostly from a fixed 512 MiB reservation for FastQC, allocator, and runtime overhead). The exact floor varies slightly with --cores but stays in the 535–730 MiB range for any sensible core count.

If --memory is below the floor, Trim Galore prints a warning and falls back to plain mode:

WARNING: --memory 100M is too small for --clumpify at --cores 6 (need ≥ 552 MiB).
         Falling back to plain mode (no read reordering). Increase --memory or
         drop --clumpify to silence this warning.

The trim itself proceeds normally; only the read-reordering step is skipped. Memory usage without --clumpify is typically significantly lower, around the 100MB mark.

What doesn't change

Trimming algorithm and per-record output bytes — clumpify only changes the order of records.
All *_trimming_report.txt / .json numbers are byte-identical between plain and clumpify runs (filter + stats code is order-independent).
Multi-member gzip is RFC 1952 valid; zcat, seqkit, samtools fastq, and MultiGzDecoder all handle it transparently.
Pair lockstep is preserved: R1[i] and R2[i] are still mates after clumpify reorders them.

Downstream BAM compression

The read-clustering effect carries through into downstream unsorted BAM files at essentially full strength.

Because aligners typically stream output reads out in the same order that they come in with, and BAM files useg gzip compression internally, the same clumping improvements hold true through alignment.

Here's an example using ATAC-seq data (31 M paired-end reads), aligning to a minimal index (chr22 only):

BAM stage	Saving (clumpify vs plain)
Trimmed FASTQ (gzip level 1)	−34.8%
`samtools import` → uBAM (no alignment)	−36.4%
STAR 2.7.11b chr22 alignment → unsorted, aligned BAM	−34.2%

Note that only unsorted BAMs benefit. If your pipeline coordinate-sorts the BAM immediately after alignment (e.g. samtools sort or STAR --outSAMtype BAM SortedByCoordinate), the read order is rearranged by genomic position and the input-order signal is erased. A coordinate-sorted BAM's size is determined by genomic distribution of reads, not by clumpify's clustering.

Benchmark results

Real-world numbers from a benchmark using a MacBook Pro (Apple Silicon, 16 GiB RAM, --cores 6 --memory 1G, all defaults). Each dataset has 3 bars:

--clumpify(L1, default)
--compression 6 (no clumpify)
--clumpify --compression 6.

Plots show compression savings (how much smaller the resulting FastQ files are versus the regular run) and the Wall-time effect (how much slower the run was, 1x is original run).

Datasets covered:

MiSeq amplicon (CRISPR): 4.4M SE, 500 MB plain output — ERR16944282
ChIP-seq (Illumina SE): 28.6M SE, 1.5 GB — SRX747791
WES (Illumina SE): 105M SE, 9.2 GB — SRR7890918
Long-read (ONT): 100K SE, 558 MB — SRR37915503
ATAC-seq (Illumina PE): 31.5M PE, 2.9 GB — SRX2717909
Ribo-seq (Illumina PE): ~30M PE, 4.0 GB — SRX11780879 (SRR15480782)
RNA-seq (Illumina PE): 93M PE, 17.0 GB — SRX1603629
scRNA-seq 10x Chromium (PE): 392M PE, 39.6 GB — pbmc8k_v2 (10x Genomics public dataset)

Comparison with other tools

If you've used bbmap clumpify or stevekm/squish, --clumpify produces compression results in the same ballpark on most data:

On amplicon-type data, all three tools converge to ~37–38% saving at gzip level 9.
On diverse data (WES, WGS), all three give ~20–30% saving — none of them works miracles on inherently diverse libraries.

The advantage of --clumpify over running a separate tool is that it's part of the trim pass, so there's no extra read-and-rewrite cycle. For an X GB input, a separate clumpify step would mean reading the trimmed output, sorting, and writing it back; typically +3–5× the trim wall time as well as double the disk I/O. --clumpify does it in one pass.