Intl.Segmenter: Locale-Sensitive Text Segmentation in JavaScript

Break text into meaningful segments--graphemes, words, or sentences--that respect locale-specific rules. The built-in solution for global web applications.

What is Intl.Segmenter?

Intl.Segmenter is part of JavaScript's Internationalization API (ECMAScript Internationalization API Specification), designed to break text into meaningful segments based on locale-specific rules. Unlike simple string splitting methods, Intl.Segmenter understands how different languages structure words, sentences, and character boundaries.

The API addresses a fundamental challenge in text processing: determining where one meaningful unit ends and another begins. For English text with spaces between words, this seems trivial--but consider Japanese, where words run together without spaces, or emoji sequences that should be treated as single units. Intl.Segmenter handles these complexities automatically.

Modern web applications serve global audiences, and handling text correctly across different languages is essential for delivering polished user experiences. Whether you're building a text editor, implementing search functionality, or displaying word counts, Intl.Segmenter provides the foundation for proper international text handling. For teams working on web development projects that target multiple markets, mastering this API is essential for delivering exceptional user experiences.

Why Basic String Splitting Fails

If we use String.prototype.split(" ") to segment text in languages that don't use whitespace between words, we won't get the correct result. Japanese, Chinese, Thai, Lao, Khmer, and Myanmar all present challenges for naive string splitting approaches.

// Basic split fails for languages without spaces
const japaneseText = "吾輩は猫である。名前はたぬき。";
const splitResult = japaneseText.split(" ");
// Result: ["吾輩は猫である。名前はたぬき。"]
// The entire text is treated as one "word"

Intl.Segmenter correctly handles these cases:

const segmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
const segments = segmenter.segment(japaneseText);
// Correctly segments: 吾輩, は, 猫, で, ある, etc.

Constructor and Options

The Intl.Segmenter constructor accepts two parameters: a locale or array of locales, and an options object.

const segmenter = new Intl.Segmenter(locale, options);

Locale Parameter

The locale parameter specifies the language or language-region combination to use for segmentation rules. You can pass a single string like "en-US" or "ja-JP", or an array of locales for fallback behavior. The API uses the first locale that has available segmentation rules.

Granularity Option

The granularity option determines what type of segments to produce. Three values are supported:

"grapheme" (default): Splits at user-perceived character boundaries, handling emoji, combining characters, and ligatures correctly
"word": Splits at word boundaries, respecting locale-specific rules about what constitutes a "word"
"sentence": Splits at sentence boundaries, handling punctuation and abbreviation rules

Before creating a segmenter, you can check which locales are supported using Intl.Segmenter.supportedLocalesOf() to avoid falling back to the runtime's default locale.

Creating Intl.Segmenter Instances

1// Single locale2const wordSegmenter = new Intl.Segmenter("en-US", { granularity: "word" });3 4// Multiple locales for fallback5const segmenter = new Intl.Segmenter(["zh-CN", "zh-TW", "ja-JP"], {6 granularity: "word"7});8 9// Different granularity levels10const graphemeSegmenter = new Intl.Segmenter("en", { granularity: "grapheme" });11const sentenceSegmenter = new Intl.Segmenter("en", { granularity: "sentence" });12 13// Check supported locales14const supportedLocales = Intl.Segmenter.supportedLocalesOf(["ja-JP", "zh-CN", "en-US"]);15console.log(supportedLocales); // ["ja-JP", "zh-CN"] if en-US falls back to default

Granularity Levels

Intl.Segmenter offers three granularity levels, each serving different use cases:

Grapheme Level Splits at user-perceived character boundaries. This is the default granularity and handles complex Unicode sequences correctly--emoji, combining characters, and ligatures are treated as single units. For example, the family emoji 👨‍👩‍👧‍👦 is one grapheme even though it contains multiple Unicode code points.

Word Level Splits at locale-specific word boundaries. Each segment includes an isWordLike property indicating whether the segment contains actual word content (excluding punctuation and whitespace). This granularity is essential for word counting, text editors, and search functionality.

Sentence Level Splits at sentence boundaries, respecting locale-specific rules for punctuation and abbreviations. For instance, "Dr." followed by a period won't incorrectly split a sentence in English.

Granularity	Use Case	Example Output
grapheme	Character counting, emoji handling	"👨‍👩‍👧‍👦" as 1 unit
word	Word counting, text editors	"Hello world" → ["Hello", " ", "world"]
sentence	NLP, text analysis	"Hello! How are you?" → ["Hello! ", "How are you?"]

The segment() Method

The segment() method is the primary interface for extracting segments from text. It returns a Segments iterator object that you can iterate over to access individual segments.

Each segment object contains:

segment: The actual text of the segment
index: The starting position of the segment in the original string
input: A reference to the original input string
isWordLike (word granularity only): Boolean indicating whether the segment is word-like content (excludes punctuation, spaces)

The Segments iterator is lazy--it doesn't create an array upfront, which is memory-efficient for large texts. You can convert it to an array using Array.from() when needed.

Using segment() Method

1const segmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });2const text = "吾輩は猫である。名前はたぬき。";3 4const segments = segmenter.segment(text);5 6for (const segment of segments) {7 console.log(segment.segment, segment.index, segment.isWordLike);8}9// Output:10// { segment: '吾輩', index: 0, isWordLike: true }11// { segment: 'は', index: 2, isWordLike: false }12// { segment: '猫', index: 3, isWordLike: true }13// ...14 15// Convert to array16const segmentArray = Array.from(segmenter.segment(text));17console.log(segmentArray[0].segment); // "吾輩"

Practical Use Cases

Word Count for International Text

A common requirement is displaying word counts that work correctly for any language. By using the isWordLike property, you can filter out punctuation and whitespace to get accurate word counts:

function countWords(text, locale = "en") {
 const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
 let count = 0;
 for (const segment of segmenter.segment(text)) {
 if (segment.isWordLike) count++;
 }
 return count;
}

console.log(countWords("Hello world", "en")); // 2
console.log(countWords("吾輩は猫である", "ja")); // 3
console.log(countWords("Hello 世界", "en")); // 2

Text Truncation with Proper Character Handling

When truncating text, you need to respect grapheme boundaries to avoid cutting emoji or accented characters:

function truncateText(text, maxLength, locale = "en") {
 const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" });
 let result = "";
 for (const segment of segmenter.segment(text)) {
 if ((result.length + segment.segment.length) <= maxLength) {
 result += segment.segment;
 } else {
 break;
 }
 }
 return result.length < text.length ? result + "..." : result;
}

console.log(truncateText("Hello 👨‍👩‍👧‍👦 World", 10, "en"));
// "Hello 👨‍👩‍👧‍👦..." (proper truncation, no broken emoji)

Natural Language Processing Pipeline

For applications processing text input--like chatbots, search engines, or content analysis tools--proper sentence segmentation is essential. When building AI automation solutions that involve natural language processing, integrating Intl.Segmenter into your pipeline ensures accurate text analysis across all languages your application supports:

function extractSentences(text, locale = "en") {
 const segmenter = new Intl.Segmenter(locale, { granularity: "sentence" });
 return Array.from(segmenter.segment(text)).map(s => s.segment);
}

const paragraph = "This is the first sentence! And this is the second? Yes.";
console.log(extractSentences(paragraph, "en"));
// ["This is the first sentence! ", "And this is the second? ", "Yes. "]

Common Use Cases

Real-world applications for locale-sensitive text segmentation

International Word Counts

Accurate word counting across languages that don't use spaces between words, essential for editors and content management systems.

Smart Text Truncation

Truncate text at character boundaries without breaking emoji or accented characters, perfect for previews and excerpts.

NLP Pipeline Integration

Extract sentences for natural language processing, enabling features like summarization, sentiment analysis, and search.

Text Editor Functionality

Implement word navigation, selection, and editing features that work correctly in any language.

Performance Considerations

Reuse Segmenter Instances

Creating Intl.Segmenter instances has overhead, so reuse instances when processing multiple texts. Cache segmenters at the module or class level rather than creating new ones for each operation.

Processing Large Texts

For very large texts, consider processing in chunks to manage memory efficiently. The iterator-based approach means segments are generated on-demand rather than all at once.

Memory Efficiency

The Segments iterator is lazy--it doesn't create an array upfront, which is memory-efficient for large texts. However, converting to an array with Array.from() will load all segments into memory. Use the iterator directly when possible, and only convert to an array when you need random access to segments.

Performance Best Practices

1// Good: Reuse the segmenter instance2const wordSegmenter = new Intl.Segmenter("en", { granularity: "word" });3 4function processTexts(texts) {5 for (const text of texts) {6 for (const segment of wordSegmenter.segment(text)) {7 // Process each segment8 }9 }10}11 12// Bad: Creating a new segmenter for each text13function processTextsBad(texts) {14 for (const text of texts) {15 const segmenter = new Intl.Segmenter("en", { granularity: "word" });16 for (const segment of segmenter.segment(text)) {17 // Process each segment18 }19 }20}21 22// Processing large texts in chunks23async function processLargeText(text, segmenter, chunkSize = 10000) {24 const results = [];25 for (let i = 0; i < text.length; i += chunkSize) {26 const chunk = text.slice(i, i + chunkSize);27 for (const segment of segmenter.segment(chunk)) {28 results.push(segment);29 }30 }31 return results;32}

Browser Support and Fallbacks

Current Browser Support

Intl.Segmenter reached Baseline status in April 2024, meaning it works across the latest devices and browser versions. The API is now supported in Chrome, Firefox, Safari, and Edge without any configuration needed.

For applications needing to support older browsers, the FormatJS library provides a polyfill. The polyfill adds approximately 15KB to your bundle, so consider whether browser support requirements justify the additional weight.

Feature Detection

Use feature detection to determine whether Intl.Segmenter is available and provide fallbacks when necessary:

function isSegmenterSupported() {
 try {
 new Intl.Segmenter("en", { granularity: "word" });
 return true;
 } catch {
 return false;
 }
}

if (!isSegmenterSupported()) {
 // Load polyfill or use fallback implementation
 await import("@formatjs/intl-segmenter/polyfill");
}

Browser Support Status

2024

Baseline Year

Major Browser Engines

15KB

Polyfill Size (if needed)

Advanced Techniques

Combining with Regular Expressions

Intl.Segmenter can complement regular expressions for complex text processing. While the segmenter handles linguistic boundaries, regex can identify patterns within segments:

function highlightKeywords(text, keywords, locale) {
 const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
 const segments = segmenter.segment(text);

 let result = "";
 for (const segment of segments) {
 if (keywords.includes(segment.segment.toLowerCase()) && segment.isWordLike) {
 result += `<mark>${segment.segment}</mark>`;
 } else {
 result += segment.segment;
 }
 }
 return result;
}

This approach lets you build sophisticated text analysis tools that understand both linguistic structure and pattern matching.

Building a Word Navigator

1class WordNavigator {2 constructor(locale = "en") {3 this.segmenter = new Intl.Segmenter(locale, { granularity: "word" });4 }5 6 moveToNextWord(text, position) {7 const segments = this.segmenter.segment(text);8 for (const segment of segments) {9 if (segment.index >= position && segment.isWordLike) {10 return segment.index;11 }12 }13 return text.length;14 }15 16 moveToPreviousWord(text, position) {17 const segments = Array.from(this.segmenter.segment(text));18 for (let i = segments.length - 1; i >= 0; i--) {19 const segment = segments[i];20 if (segment.index < position && segment.isWordLike) {21 return segment.index;22 }23 }24 return 0;25 }26 27 getWordAt(text, position) {28 const segments = Array.from(this.segmenter.segment(text));29 for (const segment of segments) {30 if (segment.index <= position && 31 segment.index + segment.segment.length > position && 32 segment.isWordLike) {33 return { word: segment.segment, index: segment.index };34 }35 }36 return null;37 }38}39 40// Usage in a text editor41const navigator = new WordNavigator("en");42const text = "The quick brown fox";43console.log(navigator.moveToNextWord(text, 4)); // 5 (position after "The ")44console.log(navigator.getWordAt(text, 4)); // { word: "quick", index: 4 }

Best Practices

Key recommendations for effective text segmentation

Choose Right Granularity

Use 'grapheme' for character operations, 'word' for text processing, 'sentence' for paragraph analysis.

Reuse Segmenter Instances

Creating instances has overhead--cache them at module or class level for repeated use.

Handle Locale Fallbacks

Use Intl.Segmenter.supportedLocalesOf() to check availability before creating segmenters.

Test with Real Content

Always test segmentation with actual content in your target languages--linguistic rules vary significantly.

Frequently Asked Questions

Conclusion

Intl.Segmenter fills a critical gap in JavaScript's text processing capabilities. By providing locale-aware segmentation, it enables applications to handle international text correctly--a requirement for any global web application. The API is well-designed, performant, and now widely supported across browsers.

The API's simplicity--create a segmenter, call segment(), iterate over results--belies its power. Under the hood, it handles the complex linguistic rules that make proper text segmentation so challenging. Whether you're building a text editor, implementing search functionality, or simply displaying word counts, Intl.Segmenter should be your go-to solution for breaking text into meaningful pieces.

By leveraging this built-in capability, you can deliver better experiences for users worldwide. The combination of proper character handling, locale awareness, and excellent performance makes Intl.Segmenter an essential tool for modern web applications targeting international audiences. When implementing SEO services for multilingual websites, proper text segmentation ensures accurate content analysis and improved search visibility across all supported languages.

Ready to Build Global Web Applications?

Our team specializes in modern JavaScript development and internationalization. Let's discuss how we can help you deliver exceptional experiences for users worldwide.

Intl.Segmenter: Locale-Sensitive Text Segmentation in JavaScript

What is Intl.Segmenter?

Constructor and Options

Locale Parameter

Granularity Option

Granularity Levels

The segment() Method

Practical Use Cases

Word Count for International Text

Text Truncation with Proper Character Handling

Natural Language Processing Pipeline

International Word Counts

Smart Text Truncation

NLP Pipeline Integration

Text Editor Functionality

Performance Considerations

Reuse Segmenter Instances

Processing Large Texts

Memory Efficiency

Browser Support and Fallbacks

Current Browser Support

Feature Detection

Browser Support Status

Advanced Techniques

Combining with Regular Expressions

Choose Right Granularity

Reuse Segmenter Instances

Handle Locale Fallbacks

Test with Real Content

Frequently Asked Questions

What browsers support Intl.Segmenter?

How is Intl.Segmenter different from String.prototype.split()?

When should I use grapheme vs word granularity?

Does Intl.Segmenter work with Node.js?

Conclusion

Ready to Build Global Web Applications?

Sources