Tuple Logo
utf-8

SHARE

UTF-8

What is UTF-8?

UTF-8 is a character encoding used to digitally store and exchange text. It is a standard compatible with Unicode and can represent virtually all the world's written characters. Its efficient storage and wide adoption make it the most widely used encoding on the Internet and software applications.

UTF-8 is designed as a variable-length encoding, meaning that some characters take up fewer bytes than others. This makes it compatible with older systems that support ASCII, while allowing it to accommodate a wide range of special and international characters.

Why is UTF-8 important?

The emergence of Unicode and UTF-8 has made software and websites more globally accessible. Without a universal encoding such as UTF-8, systems would experience problems correctly representing different languages and symbols.

Some reasons why UTF-8 is the preferred choice for text encoding:

The relationship between UTF-8 and Unicode

Unicode is a standard that assigns unique numeric values (code points) to characters from different languages and symbol sets. UTF-8 is a way of storing these codepoints in a computer-friendly format.

For example:

This makes UTF-8 a flexible and scalable solution for modern software and Web development.

History of UTF-8

UTF-8 was developed in 1992 by Ken Thompson and Rob Pike, two engineers at Bell Labs. They designed this character encoding as a more efficient way to store and process Unicode characters, with a focus on compatibility with ASCII and saving space.

The original idea was to create a variable-length encoding that:

In 1993 UTF-8 was included in the Unicode 2.0 standard, and in 1996 it was further refined in RFC 2044. Later, in 2003, it was finally specified in RFC 3629, which limited the valid characters within Unicode to 0 through 0x10FFFF (the first 1,114,112 characters of Unicode).

Why was UTF-8 needed?

Before the advent of Unicode and UTF-8, there were many character encodings such as ISO 8859-1 (Latin-1), Shift-JIS, and Windows-1252. This caused major compatibility problems in international communication and file exchange.

Problems with older encodings:

Unicode offered a solution by introducing one universal character set. However, the first Unicode encodings such as UTF-16 and UTF-32 used 2 or 4 bytes per character, which was inefficient for English and other Latin-based texts.

How has UTF-8 evolved?

Since its introduction, UTF-8 has spread rapidly and become the dominant character encoding on the Internet and in software applications.

Key developments:

Thanks to its wide support and efficiency, UTF-8 is the most widely used character encoding in the world.

How does UTF-8 work?

UTF-8 is variable-length character encoding, meaning that some characters require fewer bytes than others. This ensures that the encoding is efficient while remaining compatible with ASCII.

The basic principle of variable-length encoding

Each Unicode code point is stored in UTF-8 as a sequence of one to four bytes.

Here is an overview of how characters are stored in UTF-8:

Codepoint (hex)

Codepoint (dec)

Bytes in UTF-8

byte representation

0000 - 007F

0 - 127

1 byte

0xxxxxxx

0080 - 07FF

128 - 2047

2 bytes

110xxxxx 10xxxxxx

0800 - FFFF

2048 - 65535

3 bytes

1110xxxx 10xxxxxx 10xxxxxx

010000 - 10FFFF

65536 - 1114111

4 bytes

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This means that commonly used characters such as letters and numbers take up little storage space, while rare symbols or non-Latin characters require more bytes.

Examples of UTF-8 encoding

Let's look at some Unicode characters and see how they are stored in UTF-8:

Unicode character

Unicode code point

UTF-8 bytes

A

U+0041

0x41

U+20AC

0xE2 0x82 0xAC

😀

U+1F600

0xF0 0x9F 0x98 0x80

As you can see, the letter A uses only 1 byte, the euro sign uses 3 bytes, and an emoji such as 😀 uses 4 bytes.

ASCII compatibility of UTF-8

A major advantage of UTF-8 is that ASCII characters remain exactly the same. This means that a file with only ASCII characters in UTF-8 will be read the same by older systems that understand only ASCII.

For example, the ASCII string:

Hello

is stored the same in UTF-8 as it is in ASCII:

0x48 0x65 0x6C 0x6C 0x6F

But if you add a Unicode character, such as an é (U+00E9), the encoding will change to:

0x48 0x65 0x6C 0x6C 0xC3 0xA9

Here you can see that é is stored as 2 bytes (0xC3 0xA9), while the rest remains unchanged.

Overlong encodings

Overlong encodings are incorrect UTF-8 representations of characters that use unnecessarily many bytes.

For example, the ASCII character A can be encoded correctly as 0x41 (1 byte). But in an overlong encoding, the same character can be stored as 11000000 10000001 (2 bytes), which is unnecessary and unsafe.

Why are overlong encodings a problem?

Error handling in UTF-8

What happens when an application encounters an invalid UTF-8 byte?

There are three possible approaches:

Example:

If a byte is missing from a 2-byte character, the software may replace it with � to indicate that the text is corrupted.

Surrogates and Byte Order Mark (BOM)

Surrogates are special code points used in UTF-16 to encode characters outside the base plan. But in UTF-8, surrogates are invalid and should not be used.

Byte Order Mark (BOM) is an optional marker (U+FEFF) that some systems place at the beginning of a file to let you know that the content is encoded in UTF-8.

Difference between UTF-8 and other encodings

Several character encodings are available, but UTF-8 has set the standard because of its flexibility and broad support. Nevertheless, other encodings such as UTF-16 and UTF-32 are still used in specific situations. Let's look at the main differences.

UTF-8 vs. UTF-16 vs. UTF-32

Feature

UTF-8

UTF-16

UTF-32

Byte size per character

1-4 bytes (variable)

2 or 4 bytes (depending on character)

Always 4 bytes

ASCII compatible?

Yes (ASCII remains 1 byte)

No (even ASCII characters take up 2 bytes)

No (always 4 bytes per character)

Efficient for

English-language texts and mixed languages

Asian languages and systems that use 2 bytes per character

Systems that need quick access to characters

Storage efficiency

Highly efficient due to variable length

Inefficient for ASCII text, but good for Chinese, Japanese, and Korean characters

Wasteful for storage, but simple in terms of processing

Web use

Standard in HTML, databases and APIs

Less common

Rarely used for storage, more often used internally in certain software

When to use?

Almost always recommended

Only when compatibility with older systems is required

Only when performance is more important than storage space

When do you use which encoding?

Comparison with older encodings such as ISO 8859-1 and Windows-1252

Before Unicode became popular, many computers, and systems used locale-specific encodings, such as ISO 8859-1 (Latin-1) and Windows-1252.

Feature

UTF-8

ISO 8859-1

Windows-1252

Number of characters supported

1.1 million (Unicode)

256 (Latin characters only)

256 (with additional characters, such as €)

Supports multiple languages?

Yes

No (Western European languages only)

Limited (additional characters, but not Unicode)

Compatible with modern software?

Yes

No

No

Usage nowadays

Globally the standard

Obsolete

Still in use in legacy systems

Why is UTF-8 better than older encodings?

Advantages of UTF-8

UTF-8 has emerged as the most widely used character encoding in the world, mainly due to its versatility and efficiency. Here are the main advantages of UTF-8:

Universal compatibility.

UTF-8 supports all Unicode characters, meaning it can be used for any script in the world. This makes it ideal for international communications, websites and software.

Compatibility with ASCII

One of the biggest advantages of UTF-8 is that all ASCII characters (0-127) remain the same. This means that:

For example, the string:

Hello

is stored the same in both ASCII and UTF-8:

0x48 0x65 0x6C 0x6C 0x6F

This avoids compatibility issues with older systems.

Efficient storage and processing

Because UTF-8 uses variable length, it takes up less space for commonly used characters than other Unicode encodings such as UTF-16 or UTF-32.

Storage size comparison:

Text

ASCII (bytes)

UTF-8 (bytes)

UTF-16 (bytes)

UTF-32 (bytes)

"Hello"

5

5

10

20

"€"

N/A

3

2

4

"你好"

N/A

6

4

8

ASCII characters take up only 1 byte.

Chinese, Japanese, and Arabic characters may require 2 to 4 bytes.

UTF-16 and UTF-32 always take up more space for English-language texts.

Standard encoding on the web

According to W3Techs, more than 95% of all websites today are encoded in UTF-8. This is because:

Support in databases and programming languages

Most databases and programming languages support UTF-8 by default:

The broad support makes UTF-8 the safest choice for text storage and data exchange.

Disadvantages of UTF-8

Although UTF-8 has become the standard for character encoding, it also has some disadvantages, especially in specific situations. Here are the main limitations:

Higher storage space for certain characters.

Although UTF-8 is efficient for ASCII characters (1 byte per character), some Unicode characters can take up more space.

Comparison of characters in UTF-8 vs. other encodings:

Character

Unicode code point

UTF-8 (bytes)

UTF-16 (bytes)

UTF-32 (bytes)

A

U+0041

1

2

4

U+20AC

3

2

4

😀

U+1F600

4

2

4

Processing complexity

Because characters have variable length (1 to 4 bytes), it can be more difficult to work with UTF-8-encoded text in programming languages and databases.

Examples of complications:

Problems with older systems

Although UTF-8 is the standard in modern systems, old programs and devices may still expect ISO 8859-1 or Windows-1252. This can lead to:

Overhead with binary-oriented formats.

Some binary-oriented protocols do not work as well with UTF-8. For example:

No fixed byte size per character

Unlike UTF-32, where each character is always 4 bytes, the size in UTF-8 varies. This can lead to:

Disadvantages outweigh advantages

Although UTF-8 has some disadvantages, the advantages outweigh the benefits:

For most applications, UTF-8 is the best choice. Other encodings such as UTF-16 and UTF-32 are used only in very specific cases.

Use of UTF-8 in practice

UTF-8 is used in virtually all modern technologies. From websites and databases to programming languages and operating systems, UTF-8 is the standard encoding because of its flexibility and broad support. Here are some of the main applications.

Use of UTF-8 in web development

The Web runs on UTF-8. HTML, CSS, and JavaScript files are UTF-8 encoded by default, and modern browsers expect this encoding.

How do you set UTF-8 in HTML?

To ensure that a Web page correctly uses UTF-8, add the following meta tag in the <head> section of your HTML document:

<meta charset=“UTF-8”>.

This will display characters correctly regardless of language.

Why is this important?

Use of UTF-8 in databases

Modern databases such as MySQL, PostgreSQL, and SQLite support UTF-8 as a standard.

Why is UTF-8 important in databases?

Configuring MySQL for UTF-8

When creating a database or table in MySQL, you can set UTF-8 as the default:

CREATE DATABASE my_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Why utf8mb4 and not utf8?

MySQL's older utf8 implementation does not support 4-byte Unicode characters such as emojis (😀). Always use utf8mb4 for full Unicode support.

Use of UTF-8 in programming languages.

Most modern programming languages support UTF-8 as a standard.

Programming Language

Support of UTF-8

Python

Default string encoding is UTF-8

JavaScript

UTF-8 is standard in strings and JSON

Java

Strings are internally UTF-16, but UTF-8 is often used

C# (.NET)

UTF-8 is recommended for text files and APIs

PHP

Default encoding in HTML and databases

Go

Strings are UTF-8 by default

Example: UTF-8 strings in Python

text = “Hello, world! ”
print(text.encode(“utf-8”))  # Output: b'Hello, world! \[“b”].

Here the Unicode emoji 🌍 is correctly converted to UTF-8 bytes.

Using UTF-8 in operating systems

Operating systems such as Windows, macOS, and Linux support UTF-8 for file names, terminal display and applications.

Windows and UTF-8

Older Windows systems used Windows-1252 or UTF-16, but modern versions fully support UTF-8 in cmd and PowerShell.

To enable UTF-8 in Windows terminal:

chcp 65001

This switches the terminal to UTF-8 mode, displaying special characters correctly.

Use of UTF-8 in APIs and data exchange

Almost all modern APIs, JSON files and XML files use UTF-8 as standard encoding.

Example: JSON with UTF-8

{
    "naam": "Jörg Müller",
    "stad": "München",
    "emoji": "😀"
}

JSON is always UTF-8, making it easy to exchange data globally without character corruption.

UTF-8 is everywhere

UTF-8 is the most versatile and efficient encoding for modern technologies and remains the best choice for any application.

Standards and support for UTF-8

UTF-8 has not just become popular; it is an officially recognized and widely supported standard within various industries. From international organizations to programming languages and operating systems, UTF-8 is used almost everywhere.

Standards and specifications

UTF-8 is officially enshrined in several standards and specifications.

Standard

Description

Unicode Standard

Managed by the Unicode Consortium, specifies how characters are encoded.

RFC 3629

The official description of UTF-8 within the Internet Engineering Task Force (IETF).

ISO/IEC 10646

International standard that defines Unicode and UTF-8.

W3C Webstandaarden

HTML, CSS, and XML specify UTF-8 as the standard for character set encoding.

IANA Charactersets

UTF-8 is officially registered with the Internet Assigned Numbers Authority (IANA).

These standards ensure that UTF-8 is consistently used worldwide in software, hardware, and network protocols.

Support in programming languages and frameworks

Virtually all modern programming languages support UTF-8 directly or provide native support for Unicode.

Programming Language

Support of UTF-8

Python

Standard strings in UTF-8 since Python 3

JavaScript

JSON and strings are UTF-8 by default

Java

Support for UTF-8 in String and Charset

PHP

Default in mbstring functions

Go

Strings are always UTF-8

Ruby

UTF-8 is the default encoding

C# (.NET)

UTF-8 is recommended for text files and APIs

Swift

Strings are processed in UTF-8 by default

Many frameworks such as Django, React, Angular and Node.js use UTF-8 standard to ensure compatibility.

Support in databases

UTF-8 is the recommended encoding for text in databases because it is compatible with multiple languages and prevents character loss.

Database

Default encoding

MySQL

utf8mb4 recommended

PostgreSQL

UTF-8 standard

SQLite

UTF-8 by default

MongoDB

Supports UTF-8 in BSON

Microsoft SQL Server

Supports UTF-8 since SQL Server 2019

Notice:

Support in operating systems

Operating systems support UTF-8 to correctly process file names, text input and applications.

Operating system

UTF-8 support

Windows

Support in modern versions (cmd, PowerShell, Notepad)

macOS

Default UTF-8 support in files and terminal

Linux

Uses UTF-8 as standard in most distributions

Android

Supports UTF-8 for storage and UI display

iOS

Defaults UTF-8 for apps and file names

Windows users sometimes need to manually switch to UTF-8, for example in the terminal with:

chcp 65001

However, since Windows 10, UTF-8 is better supported by default.

Support in web browsers

All modern web browsers support UTF-8 and use it by default for web pages.

Browser

UTF-8 support

Chrome

Standard UTF-8

Firefox

Standard UTF-8

Safari

Standard UTF-8

Edge

Standard UTF-8

Opera

Standard UTF-8

Web pages without a specific encoding setting are usually interpreted as UTF-8 by browsers, indicating how universal the standard is.

Support in network protocols and files

Many network and file formats support UTF-8 to ensure global compatibility.

Protocol/File format

UTF-8 support

JSON

Always UTF-8 encoded

XML

Standard UTF-8 recommended

HTML

Standard UTF-8

CSV

Usually UTF-8 for international characters

SMTP (E-mail)

Supports UTF-8 for e-mail headers and content

HTTP en REST API’s

Standard in JSON and headers

Important for developers:

UTF-8 is the global standard

UTF-8 is the most widely used character encoding in the world. Its flexibility, efficiency, and universal compatibility have made it the standard for Web development, databases, programming languages and operating systems.

Due to its universal adoption and efficient storage, UTF-8 remains the best choice for word processing, storage and data exchange.

Frequently Asked Questions
What is UTF-8 encoding?

UTF-8 is a character encoding that can store all Unicode characters at 1 to 4 bytes per character. It is the standard encoding for the Web and modern software.


What is the UTF-8 code?

A character's UTF-8 code is its binary or hexadecimal representation. For example, the A has the UTF-8 code 0x41, and € has 0xE2 0x82 0xAC.


How many UTF-8 characters are there?

UTF-8 supports all 1,114,112 Unicode characters (from U+0000 to U+10FFFF), although not all code points are in use.


What are non-UTF-8 characters?

All characters can be encoded in UTF-8, but incorrectly encoded characters or old encodings such as Windows-1252 can cause problems.


Articles you might enjoy

Piqued your interest?

We'd love to tell you more.

Contact us
Tuple Logo
Veenendaal (HQ)
De Smalle Zijde 3-05, 3903 LL Veenendaal
info@tuple.nl‭+31 318 24 01 64‬
Quick Links
Customer Stories