Strip BOM

Strip UTF-8 byte order mark (BOM) from strings, bytes, streams, and files

Gallery image 1

Strip UTF-8 BOMs from strings, buffers, streams, and files

When it’s useful

Files from Windows editors, exports, or legacy tools sometimes start with an invisible UTF-8 byte order mark. That single prefix can break string equality, JSON or CSV parsers, and integrations that expect a clean first character. You want one small utility that handles the common shapes of input without pulling in a larger stack.

What you can do

  • Normalize Unicode strings by removing a leading UTF-8 BOM when present.
  • Clean bytes / bytearray with a buffer helper that validates UTF-8 first; invalid sequences are left unchanged so you do not silently corrupt binary-ish payloads.
  • Stream large inputs chunk by chunk so you are not forced to load whole files into memory.
  • Read files in text or binary mode and get back content with the BOM removed when it was there.

Limits and fit

The library targets UTF-8 BOM handling as described in the docs and tests, not every legacy encoding or non-UTF-8 BOM variant. For buffer helpers, the “leave invalid UTF-8 alone” rule is deliberate safety, not a general-purpose transcoder. Python 3.8+; install from PyPI (strip-bom). Examples and the full function list are in the GitHub repository.

You might also like

Explore All Blogs