Redact-a-bit documentation

Redact-a-bit swaps the personal details in a document for realistic fakes, so the file still reads normally and the numbers still add up. Everything runs on your own machine.

Getting started

There are two ways to run Redact-a-bit today.

Python tool (works now)

The engine runs anywhere Python does, with a browser UI or straight from the command line.

pip install gradio pymupdf
python redactabit.py                 # opens the UI at http://localhost:7860
python redactabit.py statement.pdf -l 2 -m fake   # or run it headless

Desktop app

The one-click desktop app bundles the redactor and the optional models, with no Python needed. Prebuilt installers are coming to the Releases page; until then you can build it from source:

git clone https://github.com/prPMDev/redactabit.git
cd redactabit
npm install && npm run tauri build   # installers land in src-tauri/target/release/bundle/

Your first redaction

Drop in a PDF or text file, choose how much to take out (a level) and how to replace it (a mode), then download the clean copy. Nothing is uploaded.

How it works

A level decides how much Redact-a-bit removes; a mode decides what the removed values are replaced with.

Levels

Standard is the default — it removes what identifies you and keeps the figures, so the document still makes sense.

LevelRemovesKeeps
LightThe critical IDs: SSNs, ITINs, bank accounts, routing and card numbers, passport / visa / USCIS numbersNames, contact details, amounts, dates, zips
Standard (default)Everything in Light, plus names, email, phone, street address, date of birth, EINDollar amounts, dates, zip codes
HeavyEverything in Standard, plus dollar amounts and zip codesLittle beyond non-identifying text

Modes

  • Fake — a realistic stand-in (a different name, a different SSN). Best for AI, because the document still reads naturally.
  • Mask — only the last few characters show, like XXX-XX-6789. Good for confirming you have the right document.
  • Label — a plain tag like [SSN REDACTED]. Best for formal or legal sharing.

In the Python command line the third mode is named redact rather than Label.

What it detects

Built-in rules catch structured, US-style identifiers:

  • SSNs, ITINs, and EINs
  • Bank account and routing numbers, credit-card numbers
  • Phone numbers, email addresses, and street addresses
  • Dates of birth; passport, visa, and USCIS numbers; Indian PAN; labeled IDs (Reference #, Envelope #, and the like)
  • At Heavy, also dollar amounts and zip codes

Anything you add as a custom term — a name, an employer, a project — is always removed, at every level. For names and addresses that don't follow a fixed format, the optional models help.

The built-in rules are US-centric. Non-US documents are partly covered today; see the multilingual model under Models.

Fake data

Fake replacements are deterministic: the same input and the same seed produce the same fake every run, so a document stays consistent with itself. Fake SSNs use the 800–899 range, so they're never a real SSN or ITIN, and dollar amounts are kept in proportion, so the numbers still behave for analysis. The seed just picks which set of fakes you get — it isn't a password, and it can't reverse a redaction.

Models

Detection has two layers. Built-in rules are always on and need nothing extra. Optional models add machine learning for what rules miss — mainly names and addresses — and they run entirely on your machine.

Built-in rules

Always on, 0 MB, no download. The pattern matchers above — fast and reliable on structured IDs like SSNs, cards, and account numbers.

GLiNER PII base (default)

Trained specifically on personal data: names, addresses, accounts, IDs. About 197 MB, downloaded once, then it works offline.

GLiNER multi PII

For non-English documents — French, German, Spanish, Italian, Portuguese. About 349 MB, downloaded once.

A model downloads from the catalog the first time you pick it, then runs offline. Picking one is optional — the built-in rules work on their own.

Privacy

Your document stays on your computer. Redact-a-bit doesn't upload your file, ask you to sign in, or phone home. The Python UI listens on 127.0.0.1 only, so nothing else on your network can reach it.

The optional models download once, and after that the whole thing runs with the Wi-Fi off — the airplane-mode test. Redact-a-bit is open source (AGPL-3.0), so you can read the code and confirm all of this for yourself.

FAQ

Does it really work offline?

Yes. The redaction engine needs no network at all. The optional models download once; after that you can pull the plug and Redact-a-bit still works.

Does it read scanned documents?

Not yet. Redact-a-bit needs a real text layer, so a scanned PDF with no embedded text won't work. OCR is on the list.

Which files can I use?

PDF, and plain-text files: .txt, .md.

Is any of my data uploaded?

No. Nothing about your document leaves your machine.

What about non-US documents?

The built-in rules are US-centric. The multilingual model helps with French, German, Spanish, Italian, and Portuguese; broader coverage is planned.

Can the fake data be reversed?

No. The seed only selects which fakes you get — it isn't a key, and it can't restore the original values.