Strip UTF-8 BOMs from strings, buffers, streams, and files
When it’s useful
Files from Windows editors, exports, or legacy tools sometimes start with an invisible UTF-8 byte order mark. That single prefix can break string equality, JSON or CSV parsers, and integrations that expect a clean first character. You want one small utility that handles the common shapes of input without pulling in a larger stack.
What you can do
- Normalize Unicode strings by removing a leading UTF-8 BOM when present.
- Clean
bytes/bytearraywith a buffer helper that validates UTF-8 first; invalid sequences are left unchanged so you do not silently corrupt binary-ish payloads. - Stream large inputs chunk by chunk so you are not forced to load whole files into memory.
- Read files in text or binary mode and get back content with the BOM removed when it was there.
Limits and fit
The library targets UTF-8 BOM handling as described in the docs and tests, not every legacy encoding or non-UTF-8 BOM variant. For buffer helpers, the “leave invalid UTF-8 alone” rule is deliberate safety, not a general-purpose transcoder. Python 3.8+; install from PyPI (strip-bom). Examples and the full function list are in the GitHub repository.



