Summary. Deduplicate lines across one or more input files while preserving structure: every unique line is kept; every duplicate line is kept only at its first occurrence; and for each unique line we keep up to N lines of context (the lines immediately before it). Blank or whitespace-only lines are never deduplicated. Output order is the same as input order. Useful for compressing multi-page HTML→markdown where repeated headers/nav would be removed by block dedup, but we still want one copy of shared content and local context around unique content.
node compress.js [options] file1 [file2 ...]
Output is written to stdout. A summary is written to stderr.
| Option | Description |
|---|---|
-n N, --context N |
Number of lines to keep before each unique line (default: 4). |
--drop |
Omit dropped lines (default). |
--prepend |
Keep dropped lines but prepend with COMPRESS: (for testing). |
--help, -h |
Show usage and exit. |
Example (default: dropped lines are omitted):
Summary:
Input files: 3
Input lines: 17
Input chars: 117
Output lines: 13 (13 kept, 4 dropped)
Output chars: 89
With --prepend the summary shows (13 kept, 4 with COMPRESS: prefix) and output includes those lines prefixed with COMPRESS: .
Input: A stream of lines (e.g. from one or more files concatenated in order).
Parameters: Context size N (number of lines to keep before each unique line).
Output: The same stream with duplicates removed under the rules below; dropped lines are omitted by default, or prefixed with COMPRESS: when using --prepend.
- Read all input into an array
lines[0..L-1]in order (preserving line boundaries and content). - Build a frequency map: for each distinct line string
s, count how many indicesihavelines[i] === s. Call thisfreq(s). - Build “first index of”: for each distinct line
s, store the smallestisuch thatlines[i] === s. Call thisfirst(s).
Maintain a set (or boolean array) keep of size L, initially empty/false.
-
Unique lines For each index
i: iffreq(lines[i]) === 1, setkeep[i] = true. -
Context for unique lines For each index
iwherelines[i]is unique: setkeep[j] = truefor alljin[max(0, i - N), i - 1](theNlines immediately beforei; fewer ifi < N). -
First occurrence of duplicates For each distinct line
swithfreq(s) > 1: setkeep[first(s)] = true. -
Blank/whitespace-only lines For each index
iwherelines[i]is empty or contains only whitespace: setkeep[i] = true. These lines are never deduplicated.
- For
i = 0toL - 1:- If
keep[i]then outputlines[i]. - Else: output nothing (default), or
COMPRESS:+lines[i](--prepend).
- If
- Preserve original line endings / newlines between lines as in the input.
| Line type | When it's kept |
|---|---|
Unique (freq === 1) |
At every occurrence (there is only one). |
Duplicate (freq > 1) |
Only at its first occurrence (i === first(lines[i])). |
| Blank/whitespace-only | Always (every occurrence). |
| Any line | If it lies in the context window (within the last N lines before some unique line), it is kept there as well. |
Overlaps (e.g. a duplicate that is also in the context of a unique line, or the first occurrence of a duplicate) are handled by the single keep set; each index is kept at most once in the output.