Strip UTF-8 BOMs from strings, buffers, streams, and files

When it’s useful

Files from Windows editors, exports, or legacy tools sometimes start with an invisible UTF-8 byte order mark. That single prefix can break string equality, JSON or CSV parsers, and integrations that expect a clean first character. You want one small utility that handles the common shapes of input without pulling in a larger stack.

What you can do

Normalize Unicode strings by removing a leading UTF-8 BOM when present.
Clean bytes / bytearray with a buffer helper that validates UTF-8 first; invalid sequences are left unchanged so you do not silently corrupt binary-ish payloads.
Stream large inputs chunk by chunk so you are not forced to load whole files into memory.
Read files in text or binary mode and get back content with the BOM removed when it was there.

Limits and fit

The library targets UTF-8 BOM handling as described in the docs and tests, not every legacy encoding or non-UTF-8 BOM variant. For buffer helpers, the “leave invalid UTF-8 alone” rule is deliberate safety, not a general-purpose transcoder. Python 3.8+; install from PyPI (strip-bom). Examples and the full function list are in the GitHub repository.

Strip BOM

Strip UTF-8 BOMs from strings, buffers, streams, and files

When it’s useful

What you can do

Limits and fit

You might also like

How to Add PyPI Download Stats to GitHub Actions

Shai Hulud 2.0: How to Detect a Suspected npm Supply-Chain Attack

Analyze Git Repositories in GitHub Actions Using Gitingest