Skip to content

Instantly share code, notes, and snippets.

@digitalWestie
Last active March 9, 2026 17:30
Show Gist options
  • Select an option

  • Save digitalWestie/861e13f496c65f95b89ebe88aee325fd to your computer and use it in GitHub Desktop.

Select an option

Save digitalWestie/861e13f496c65f95b89ebe88aee325fd to your computer and use it in GitHub Desktop.
Context aware de-duplication

Context-aware line deduplication

Summary. Deduplicate lines across one or more input files while preserving structure: every unique line is kept; every duplicate line is kept only at its first occurrence; and for each unique line we keep up to N lines of context (the lines immediately before it). Blank or whitespace-only lines are never deduplicated. Output order is the same as input order. Useful for compressing multi-page HTML→markdown where repeated headers/nav would be removed by block dedup, but we still want one copy of shared content and local context around unique content.


Usage

node compress.js [options] file1 [file2 ...]

Output is written to stdout. A summary is written to stderr.

Options

Option Description
-n N, --context N Number of lines to keep before each unique line (default: 4).
--drop Omit dropped lines (default).
--prepend Keep dropped lines but prepend with COMPRESS: (for testing).
--help, -h Show usage and exit.

Summary (stderr)

Example (default: dropped lines are omitted):

Summary:
  Input files:     3
  Input lines:     17
  Input chars:     117
  Output lines:    13 (13 kept, 4 dropped)
  Output chars:    89

With --prepend the summary shows (13 kept, 4 with COMPRESS: prefix) and output includes those lines prefixed with COMPRESS: .


Algorithm

Input: A stream of lines (e.g. from one or more files concatenated in order).
Parameters: Context size N (number of lines to keep before each unique line).
Output: The same stream with duplicates removed under the rules below; dropped lines are omitted by default, or prefixed with COMPRESS: when using --prepend.


Phase 1 — Read and count

  1. Read all input into an array lines[0..L-1] in order (preserving line boundaries and content).
  2. Build a frequency map: for each distinct line string s, count how many indices i have lines[i] === s. Call this freq(s).
  3. Build “first index of”: for each distinct line s, store the smallest i such that lines[i] === s. Call this first(s).

Phase 2 — Mark indices to keep

Maintain a set (or boolean array) keep of size L, initially empty/false.

  1. Unique lines For each index i: if freq(lines[i]) === 1, set keep[i] = true.

  2. Context for unique lines For each index i where lines[i] is unique: set keep[j] = true for all j in [max(0, i - N), i - 1] (the N lines immediately before i; fewer if i < N).

  3. First occurrence of duplicates For each distinct line s with freq(s) > 1: set keep[first(s)] = true.

  4. Blank/whitespace-only lines For each index i where lines[i] is empty or contains only whitespace: set keep[i] = true. These lines are never deduplicated.


Phase 3 — Emit output (preserve order)

  1. For i = 0 to L - 1:
    • If keep[i] then output lines[i].
    • Else: output nothing (default), or COMPRESS: + lines[i] (--prepend).
  2. Preserve original line endings / newlines between lines as in the input.

Summary of what is kept

Line type When it's kept
Unique (freq === 1) At every occurrence (there is only one).
Duplicate (freq > 1) Only at its first occurrence (i === first(lines[i])).
Blank/whitespace-only Always (every occurrence).
Any line If it lies in the context window (within the last N lines before some unique line), it is kept there as well.

Overlaps (e.g. a duplicate that is also in the context of a unique line, or the first occurrence of a duplicate) are handled by the single keep set; each index is kept at most once in the output.

#!/usr/bin/env node
//
// Context-aware line deduplication.
//
// Usage:
// node compress.js [options] file1 [file2 ...] → stdout
//
// Options:
// -n N, --context N Number of lines to keep before each unique line (default: 4)
// --drop Omit dropped lines (default)
// --prepend Keep dropped lines but prepend with "COMPRESS: " (for testing)
const fs = require('fs');
const DROP_MODES = Object.freeze({ drop: 'drop', prepend: 'prepend' });
function parseArgs(argv) {
const args = argv.slice(2);
let contextN = 4;
let dropMode = DROP_MODES.drop;
const files = [];
for (let i = 0; i < args.length; i++) {
const a = args[i];
if (a === '-n' || a === '--context') {
contextN = parseInt(args[++i], 10);
if (!Number.isInteger(contextN) || contextN < 0) {
console.error('compress: -n/--context must be a non-negative integer');
process.exit(1);
}
} else if (a === '--drop') {
dropMode = DROP_MODES.drop;
} else if (a === '--prepend') {
dropMode = DROP_MODES.prepend;
} else if (a === '--help' || a === '-h') {
console.error(`Usage: node compress.js [options] file1 [file2 ...]
-n N, --context N Lines of context before each unique line (default: 4)
--drop Omit dropped lines (default)
--prepend Keep dropped lines, prepend with "COMPRESS: " (for testing)
--help, -h Show this help`);
process.exit(0);
} else {
files.push(a);
}
}
return { contextN, dropMode, files };
}
function readLines(files) {
const lines = [];
for (const file of files) {
if (!fs.existsSync(file)) {
console.error(`compress: File not found: ${file}`);
process.exit(1);
}
const content = fs.readFileSync(file, 'utf8');
const fileLines = content.split(/\r?\n/).map((line) => line.replace(/\r$/, ''));
// If file didn't end with newline, last element might be ''; keep as-is so we can preserve length
lines.push(...fileLines);
}
return lines;
}
function run(lines, contextN, dropMode) {
const L = lines.length;
if (L === 0) return [];
const isBlank = (line) => line.trim() === '';
// Phase 1: frequency and first occurrence
const freq = new Map();
const first = new Map();
for (let i = 0; i < L; i++) {
const s = lines[i];
freq.set(s, (freq.get(s) ?? 0) + 1);
if (!first.has(s)) first.set(s, i);
}
// Phase 2: mark indices to keep
const keep = new Set();
// 2a. Unique lines
for (let i = 0; i < L; i++) {
if (freq.get(lines[i]) === 1) keep.add(i);
}
// 2b. Context before each unique line
for (let i = 0; i < L; i++) {
if (freq.get(lines[i]) !== 1) continue;
const start = Math.max(0, i - contextN);
for (let j = start; j < i; j++) keep.add(j);
}
// 2c. First occurrence of each duplicate
for (const [s, count] of freq) {
if (count > 1) keep.add(first.get(s));
}
// 2d. Never deduplicate blank/whitespace-only lines: always keep them
for (let i = 0; i < L; i++) {
if (isBlank(lines[i])) keep.add(i);
}
// Phase 3: emit in order (default and --drop: skip dropped lines; --prepend: prefix with "COMPRESS: ")
const out = [];
for (let i = 0; i < L; i++) {
if (keep.has(i)) {
out.push(lines[i]);
} else if (dropMode === DROP_MODES.prepend) {
out.push('COMPRESS: ' + lines[i]);
}
}
return out;
}
const { contextN, dropMode, files } = parseArgs(process.argv);
if (files.length === 0) {
console.error('Usage: node compress.js [options] file1 [file2 ...]\n Use --help for options.');
process.exit(1);
}
const lines = readLines(files);
const result = run(lines, contextN, dropMode);
const keptCount =
dropMode === DROP_MODES.prepend
? result.filter((l) => !l.startsWith('COMPRESS: ')).length
: result.length;
const droppedCount = lines.length - keptCount;
const inputChars = lines.join('\n').length;
const outputChars = (result.join('\n') + (result.length ? '\n' : '')).length;
const summaryDetail =
dropMode === DROP_MODES.prepend
? `(${keptCount} kept, ${droppedCount} with COMPRESS: prefix)`
: `(${keptCount} kept, ${droppedCount} dropped)`;
console.error('Summary:');
console.error(' Input files: ', files.length);
console.error(' Input lines: ', lines.length);
console.error(' Input chars: ', inputChars);
console.error(' Output lines: ', result.length, summaryDetail);
console.error(' Output chars: ', outputChars);
console.error('');
process.stdout.write(result.join('\n') + (result.length ? '\n' : ''));
process.exit(0);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment