Unicode Normalizer & Zero-Width Cleaner

Normalize Unicode text and remove invisible zero-width characters.

Privacy First

This tool runs entirely in your browser. No data is sent to any server. Your input remains completely private.

Length: 0 characters

Understanding Unicode Normalization

Unicode is the universal character encoding standard that allows text in any language to be represented digitally. However, Unicode allows some characters to be represented in multiple ways. For example, the character "e" can be stored as a single character (U+00E9) or as "e" followed by a combining acute accent (U+0065 U+0301). Both look identical but are stored differently.

Unicode normalization is the process of converting text to a consistent, canonical form. This ensures that equivalent characters are stored the same way, which is essential for text comparison, searching, and data processing.

Why Unicode Normalization Matters

String Comparison

Two strings that look identical might not be equal in code if they use different Unicode representations. Normalizing both strings before comparison ensures accurate matching.

Database Consistency

When storing user input in databases, normalization ensures consistent storage. Without it, the same apparent text might create duplicate entries.

Password Security

Users might enter passwords with different Unicode representations of the same characters, causing login failures. Normalizing passwords during registration and login prevents this issue.

Search Functionality

Search queries and indexed content should use the same normalization form to ensure searches find all matching results.

The Four Normalization Forms

NFC (Canonical Decomposition, followed by Canonical Composition)

The most commonly used form. Characters are decomposed and then recomposed into their precomposed (single-character) form when possible. Recommended for most use cases.

NFD (Canonical Decomposition)

Characters are decomposed into their constituent parts but not recomposed. Accented characters become base character + combining marks.

NFKC (Compatibility Decomposition, followed by Canonical Composition)

Like NFC, but also normalizes "compatibility characters" (e.g., full-width letters, circled numbers) to their standard equivalents.

NFKD (Compatibility Decomposition)

Like NFD, but also decomposes compatibility characters. Useful for aggressive text matching and search.

Zero-Width and Invisible Characters

Zero-width characters are Unicode characters that have no visible representation but affect text processing:

  • Zero Width Space (U+200B): Invisible space that allows line breaks
  • Zero Width Non-Joiner (U+200C): Prevents characters from joining
  • Zero Width Joiner (U+200D): Forces characters to join (used in emoji)
  • Byte Order Mark (U+FEFF): Indicates byte order but often appears as garbage
  • Soft Hyphen (U+00AD): Invisible hyphen that appears only at line breaks

Common Problems Caused by Invisible Characters

Copy-Paste Bugs

Copying text from websites or documents often includes invisible characters that break code, configurations, or comparisons.

Password Issues

Invisible characters in passwords cause login failures because the stored password differs from what appears to be typed.

Text Comparison Failures

Strings that look identical fail equality checks due to hidden characters.

Privacy and Security

All normalization and character removal happens entirely in your browser. Your text never leaves your computer, making this tool safe for processing sensitive content.

Common Use Cases

Database Data Cleanup

Normalize user-submitted text before storing in databases to ensure consistent data and prevent duplicate entries.

Password Processing

Normalize passwords during registration and login to prevent authentication failures caused by different Unicode representations.

Search Optimization

Normalize both search queries and indexed content to ensure comprehensive search results regardless of input method.

Copy-Paste Debugging

Remove invisible characters from text copied from websites, documents, or terminals that cause unexpected behavior.

Code Cleanup

Remove zero-width characters from source code that cause mysterious compilation or runtime errors.

Data Import Preparation

Clean and normalize text data before importing into systems that require consistent encoding.

Worked Examples

NFC Normalization

Input

cafe\u0301 (decomposed e + combining accent)

Output

cafe (single precomposed character)

The "e" followed by combining acute accent (U+0301) is normalized to a single precomposed "e" character (U+00E9).

Remove Zero-Width Characters

Input

Hello (with hidden U+200B between letters)

Output

Hello (clean, 5 characters)

Zero-width spaces that were invisibly embedded between letters are detected and removed, leaving clean text.

Frequently Asked Questions

Which normalization form should I use?

NFC is recommended for most use cases. It produces the most compact representation while preserving semantic meaning. Use NFKC if you also want to normalize compatibility characters like full-width letters.

Why does my text look the same but fail equality checks?

The text likely contains different Unicode representations of the same characters or invisible characters. Use this tool to detect hidden characters and normalize the text to a consistent form.

Will normalization change my text visually?

NFC and NFD preserve visual appearance. NFKC and NFKD may change appearance by converting compatibility characters (e.g., full-width to half-width, circled numbers to regular numbers).

How do invisible characters get into my text?

They often come from copy-pasting from web pages, Word documents, PDFs, or other rich text sources. Some websites intentionally add them for tracking. Text editors may also insert them.

Is my text sent to any server?

No, all processing happens locally in your browser using JavaScript. Your text never leaves your device, making it safe to process sensitive content.

Can invisible characters cause security issues?

Yes. Invisible characters can be used to create visually identical but different strings for phishing (homograph attacks), bypass security filters, or create misleading URLs and usernames.