Free Byte Counter

Paste text and see its byte size in UTF-8, UTF-16, and ASCII. Great for checking database column limits.

Results

Enter text and click Count Bytes.

How It Works

  1. Enter or paste text: Type or paste any text into the input field.
  2. View byte counts: The tool instantly shows the byte count in UTF-8, UTF-16, ASCII, and other encodings side by side.
  3. Check limits: Compare the byte count against common limits (SMS: 160 chars, HTTP headers: 8 KB, database fields, etc.) to see if your content fits.

Why Use Byte Counter?

Character count and byte count are not the same. A single emoji can be 4 bytes in UTF-8. Chinese and Arabic characters take 2-3 bytes each. Many systems enforce byte limits, not character limits, including MySQL VARCHAR fields, Redis values, HTTP headers, SMS messages, and cloud storage object names. The Byte Counter reveals the actual byte size of your text in each encoding so you can stay within system constraints.

Features

Frequently Asked Questions

Why is my byte count larger than my character count?

Many characters take more than 1 byte in UTF-8. ASCII characters (A-Z, 0-9, punctuation) are 1 byte each. Latin-extended characters (accented letters) are 2 bytes. Chinese, Japanese, Korean, and Arabic characters are typically 3 bytes. Emoji are usually 4 bytes.

What encoding do most web systems use?

UTF-8 is the dominant encoding for web content, APIs, JSON, and databases. MySQL and PostgreSQL use UTF-8 by default. When checking byte limits, use the UTF-8 column unless your system specifies otherwise.

Why do SMS messages have a 160-character limit?

Traditional SMS uses 7-bit GSM encoding, which allows 160 characters per segment. When you include any non-GSM character (like a smart quote, emoji, or non-Latin letter), the message switches to UCS-2 encoding, which drops the limit to 70 characters per segment.

What Is a Byte, Really?

A byte is 8 bits, which can hold 256 distinct values. In text, those 256 values map to characters via an encoding, a rule book that says "this byte sequence equals this character." The same string of bytes can mean completely different text under different encodings: the byte 0xE9 is "é" in Latin-1, the start of a 3-byte sequence in UTF-8, or part of a UTF-16 code unit. The encoding is the whole story.

When you save text to disk, send it over the wire, or store it in a database, what's actually persisted is bytes, not characters. The character count you see in a text editor is computed at display time, after the bytes are decoded. Mismatch the encoding on either side and you get mojibake: text decoded as the wrong encoding shows up as gibberish (the classic é instead of é when Windows-1252 bytes are read as UTF-8).

Byte counting is what database column limits, HTTP header buffers, SMS payloads, and cloud storage object keys all measure, regardless of what the text "looks like." This counter reports byte size in the four encodings you're most likely to care about: UTF-8 (the modern default), UTF-16 (the Windows / Java / JavaScript internal format), ASCII (only valid for English Latin text), and Latin-1 (a single-byte legacy fallback). The character count alongside is given for reference.

UTF-8: The Story

UTF-8 was sketched out by Ken Thompson and Rob Pike at Bell Labs on the night of 2 September 1992, reportedly on a placemat at a New Jersey diner, after the Plan 9 team needed an ASCII-compatible variable-length encoding for Unicode. The design carries three properties almost nothing else has at once: ASCII text is also valid UTF-8 (1 byte per character, identical bytes), the encoding is self-synchronising (any byte's high bits tell you whether it starts a new character or continues an existing one), and there's no byte-order ambiguity. Those three properties together explain why UTF-8 displaced every competing encoding on the web.

It was first standardised as RFC 2044 in October 1996, revised as RFC 2279 in January 1998, and replaced by the current RFC 3629 (November 2003), which capped UTF-8 at 4 bytes per character to match Unicode's eventual code-point ceiling at U+10FFFF. W3Techs has tracked encoding usage on the public web continuously since 2010; UTF-8 went from 56% of websites in 2011 to roughly 98% in 2026. The HTML5 specification mandates UTF-8 for new content; HTTP/2 and HTTP/3 send headers in UTF-8 via HPACK / QPACK; RFC 8259 mandates UTF-8 for JSON interchange between systems. If you have to pick one encoding for everything, the answer for the last 15 years has been UTF-8 and the answer for the next 15 will be the same.

UTF-8 is variable-length, 1 to 4 bytes per character:

Code-point range Bytes Typical content
U+0000 – U+007F1ASCII letters, digits, common punctuation
U+0080 – U+07FF2Latin-extended (é, ñ), Greek, Cyrillic, Arabic, Hebrew
U+0800 – U+FFFF3Most CJK ideographs, Devanagari, Thai, Hangul, € symbol
U+10000 – U+10FFFF4Emoji, supplementary CJK, historical scripts

A practical consequence: English text in UTF-8 averages ~1 byte per character; Chinese ~3 bytes; an emoji-heavy message can hit 4 bytes per visible character, and combined emoji (family ZWJ sequences) easily reach 20-30 bytes for what looks like one character.

UTF-16 and the Surrogate Trap

UTF-16 was the encoding of choice for Windows NT (1993), Java 1.0 (1996), JavaScript (1995), .NET, and Mac OS X Cocoa NSString. It uses 2 bytes for every character in the Basic Multilingual Plane (U+0000 – U+FFFF), and surrogate pairs for anything outside it: a high surrogate (D800–DBFF) plus a low surrogate (DC00–DFFF), 4 bytes total. UTF-16 needs a byte-order mark (BOM) on disk to disambiguate big-endian (UTF-16BE, FE FF) from little-endian (UTF-16LE, FF FE); Windows defaults to little-endian.

The trap: in JavaScript, "😀".length === 2. MDN states this directly: the length property "contains the length of the string in UTF-16 code units." That's why a single emoji like 😄 reports a length of 2 (it lives in the supplementary plane and needs a surrogate pair), and the family ZWJ sequence 👨‍👩‍👧‍👦 reports a length of 11 (four 2-code-unit emojis plus three zero-width joiners). The same one-character family emoji counts as 11 in JavaScript, 5 in Python 3, and 1 in Swift, depending on each language's string model. For correct visible-character counts in JavaScript, use Intl.Segmenter with grapheme granularity (every evergreen browser since 2021).

ASCII, Latin-1, and the Pre-Unicode Mess

ASCII (American Standard Code for Information Interchange) was standardised as ASA X3.4-1963, revised as X3.4-1968 and again as ANSI X3.4-1986. A 7-bit code, 128 characters: 95 printable plus 33 control. The 33 control characters include teletype legacies like BEL, BS, CR, LF, DEL, and a few that survive in modern protocols (NUL, TAB, LF, CR, ESC). ASCII still works as a strict subset of UTF-8, which is why "plain ASCII text" is also valid UTF-8 and why the migration to UTF-8 was painless for English-only systems.

Latin-1 / ISO-8859-1 (1987) was a single-byte 256-character extension that added accented Western European letters, currency symbols, and common punctuation. It was the de-facto encoding for Western web content from 1995 until UTF-8 displaced it around 2008. Windows-1252 is Microsoft's superset of Latin-1, adding "smart quotes", em-dashes, and the euro sign in the C1 control range (0x80-0x9F); when CSV files are emailed between Mac and Windows, this is the source of the classic é mojibake when one side reads Windows-1252 bytes as UTF-8.

The MySQL "utf8" Trap

MySQL has had a notorious character-set wart since version 4.1: the utf8 charset alias is not actually UTF-8. It's a 3-byte-maximum subset that cannot represent characters above U+FFFF, which means it cannot store emoji or supplementary-plane characters. Inserting "🎉" into a utf8 column produces "?" or an error depending on sql_mode. The fix is utf8mb4, added in MySQL 5.5.3 (March 2010); MySQL 8.0 (April 2018) made utf8mb4 the new default. But schemas created before 8.0 often still default to the 3-byte version. If you see emoji silently dropping from user input, this is almost always the cause. PostgreSQL has no equivalent trap, it accepts true UTF-8 natively.

SMS, GSM-7, and the 160-Byte Payload

The 160-character SMS limit traces back to a 1985 calculation by Friedhelm Hillebrand, an engineer on the GSM Working Party who reportedly sat at his typewriter, typed out random sentences, and counted that "most messages can be expressed in 160 characters or less." The 160 was then back-derived to fit a 140-byte payload using a 7-bit alphabet (140 × 8 ÷ 7 = 160). The encoding details are formalised in 3GPP TS 23.038 (originally GSM 03.38), and they still govern SMS billing today.

In bytes: a single SMS is 140 bytes on the wire. With GSM-7 that's 160 characters; with UCS-2 (a 2-byte fixed-width encoding used for anything outside the GSM-7 alphabet) it's 70. Multi-part messages lose 7 GSM-7 characters or 3 UCS-2 characters per segment to a User Data Header used for re-assembly, so long messages cap at 153 GSM-7 chars per segment or 67 UCS-2 chars per segment. One smart quote, em-dash, or emoji downgrades the whole message to UCS-2 and halves the per-segment limit. Twilio's "Smart Encoding" auto-substitutes curly quotes for straight ones to keep marketing campaigns in the cheaper encoding.

Where Byte Limits Actually Bite

Three categories where byte (not character) limits will catch you out:

HTTP request headers. No formal spec maximum, every server enforces one. Apache LimitRequestFieldSize defaults to 8 KB per header; Nginx's large_client_header_buffers defaults to 4 × 8 KB; IIS defaults to 16 KB; AWS Application Load Balancer accepts 16 KB per header and 60 KB total; Cloudflare allows 32 KB. JWTs with bloated claim sets routinely exceed Apache's 8 KB default, which is the most common production failure mode for token-based auth.

Cloud object storage keys. S3 and GCS both cap object keys at 1024 bytes of UTF-8. Azure Blob Storage caps blob names at 1024 characters (UTF-16 internal). For S3, a CJK-heavy filename (3 bytes per char) tops out at ~341 characters; an emoji-heavy one (4 bytes per char) at ~256, well before the developer expects.

Database row and index limits. MySQL InnoDB has a 65,535-byte row size and a 3072-byte index-key-prefix limit on DYNAMIC row format (767 on older COMPACT). A VARCHAR(255) utf8mb4 column needs 1020 bytes (255 × 4) of index space, fine on DYNAMIC, broken on COMPACT. MongoDB BSON documents cap at 16 MB. DynamoDB items cap at 400 KB (including attribute names). Redis values cap at 512 MB.

Common Use Cases

Common Mistakes

  1. Trusting JavaScript's .length for byte size. .length returns UTF-16 code units, not bytes and not characters. For UTF-8 bytes, use new TextEncoder().encode(text).length; for visible characters, use Intl.Segmenter.
  2. Assuming MySQL utf8 is actually UTF-8. It's a 3-byte subset that silently drops emoji. Always use utf8mb4 (and utf8mb4_unicode_ci for the collation) on any column that touches user-submitted text.
  3. Assuming one emoji equals one byte. A single emoji is 4 bytes in UTF-8, 4 bytes in UTF-16 (surrogate pair). A family ZWJ sequence can exceed 30 bytes for what looks like one character.
  4. Counting a UTF-8 BOM as content. The three-byte UTF-8 BOM EF BB BF at the start of a file is metadata, not text. Most CLI tools (awk, head, sed) treat it as part of the first field, which is the source of many "why does my first column name have a weird character" bugs.
  5. Reporting an "ASCII bytes" count for non-ASCII text. ASCII cannot represent characters above U+007F. This counter warns when the input contains non-ASCII so you know the ASCII column is not meaningful.

More Frequently Asked Questions

Why is one emoji 4 bytes when text characters are only 1?

UTF-8 uses 1 byte for ASCII (U+0000 to U+007F), 2 bytes for Latin-extended / Greek / Cyrillic / Arabic / Hebrew (U+0080 to U+07FF), 3 bytes for most CJK and Indic scripts (U+0800 to U+FFFF), and 4 bytes for emoji and supplementary-plane characters (U+10000 to U+10FFFF). A typical emoji like 😀 (U+1F600) is in the supplementary plane and costs 4 bytes. Combined emoji (e.g. family 👨‍👩‍👧‍👦) are built from several base emoji glued together with zero-width joiners; each base emoji is 4 bytes, each joiner is 3 bytes, so a family of 4 takes 4×4 + 3×3 = 25 bytes for what looks like one character.

What does MySQL utf8 actually mean?

In MySQL, the charset alias utf8 is a 3-byte-maximum subset of real UTF-8. It can encode every character in the Unicode Basic Multilingual Plane but cannot store emoji or any character above U+FFFF. Real 4-byte UTF-8 in MySQL is utf8mb4, available since MySQL 5.5.3 (March 2010), default since MySQL 8.0 (April 2018). If you can change the schema, always use utf8mb4 with the utf8mb4_0900_ai_ci collation (or utf8mb4_unicode_ci on older servers).

Does this counter include a UTF-8 byte-order mark?

No. The UTF-8 byte-order mark is the three bytes EF BB BF that Excel on Windows requires at the start of a file to detect UTF-8. The counter measures the bytes of the text you paste in; if your text happens to start with a BOM, those three bytes are counted as content. If you want to know whether your file's bytes will reach a limit, paste only the body of the file, not the BOM.

Why does my Chinese text show 3 bytes per character in UTF-8?

Almost all CJK ideographs sit in the Unicode range U+4E00 to U+9FFF (the CJK Unified Ideographs block), which UTF-8 encodes as 3 bytes each. A 100-character Chinese sentence is therefore 300 UTF-8 bytes. In UTF-16 the same text is 200 bytes (2 bytes per character), so UTF-16 is more compact for predominantly-CJK content. UTF-8 wins for mixed Latin-and-CJK content because Latin characters cost 1 byte each instead of 2.

Is my text uploaded anywhere?

No. The byte counter runs entirely in your browser. UTF-8 byte counts come from the standard TextEncoder API (every modern browser supports it), UTF-16 and Latin-1 counts come from simple loops. There is no network request, no server call, no logging. Once the page is loaded, the tool works offline. Safe for inspecting API tokens, internal data, or anything you would not paste into a third-party text counter.

Related Tools

Character Counter Word Counter Reading Time String Hash