What is Intl.Segmenter?
Intl.Segmenter is part of JavaScript's Internationalization API (ECMAScript Internationalization API Specification), designed to break text into meaningful segments based on locale-specific rules. Unlike simple string splitting methods, Intl.Segmenter understands how different languages structure words, sentences, and character boundaries.
The API addresses a fundamental challenge in text processing: determining where one meaningful unit ends and another begins. For English text with spaces between words, this seems trivial--but consider Japanese, where words run together without spaces, or emoji sequences that should be treated as single units. Intl.Segmenter handles these complexities automatically.
Modern web applications serve global audiences, and handling text correctly across different languages is essential for delivering polished user experiences. Whether you're building a text editor, implementing search functionality, or displaying word counts, Intl.Segmenter provides the foundation for proper international text handling. For teams working on web development projects that target multiple markets, mastering this API is essential for delivering exceptional user experiences.
Constructor and Options
The Intl.Segmenter constructor accepts two parameters: a locale or array of locales, and an options object.
const segmenter = new Intl.Segmenter(locale, options);
Locale Parameter
The locale parameter specifies the language or language-region combination to use for segmentation rules. You can pass a single string like "en-US" or "ja-JP", or an array of locales for fallback behavior. The API uses the first locale that has available segmentation rules.
Granularity Option
The granularity option determines what type of segments to produce. Three values are supported:
- "grapheme" (default): Splits at user-perceived character boundaries, handling emoji, combining characters, and ligatures correctly
- "word": Splits at word boundaries, respecting locale-specific rules about what constitutes a "word"
- "sentence": Splits at sentence boundaries, handling punctuation and abbreviation rules
Before creating a segmenter, you can check which locales are supported using Intl.Segmenter.supportedLocalesOf() to avoid falling back to the runtime's default locale.
1// Single locale2const wordSegmenter = new Intl.Segmenter("en-US", { granularity: "word" });3 4// Multiple locales for fallback5const segmenter = new Intl.Segmenter(["zh-CN", "zh-TW", "ja-JP"], {6 granularity: "word"7});8 9// Different granularity levels10const graphemeSegmenter = new Intl.Segmenter("en", { granularity: "grapheme" });11const sentenceSegmenter = new Intl.Segmenter("en", { granularity: "sentence" });12 13// Check supported locales14const supportedLocales = Intl.Segmenter.supportedLocalesOf(["ja-JP", "zh-CN", "en-US"]);15console.log(supportedLocales); // ["ja-JP", "zh-CN"] if en-US falls back to defaultGranularity Levels
Intl.Segmenter offers three granularity levels, each serving different use cases:
Grapheme Level Splits at user-perceived character boundaries. This is the default granularity and handles complex Unicode sequences correctly--emoji, combining characters, and ligatures are treated as single units. For example, the family emoji 👨👩👧👦 is one grapheme even though it contains multiple Unicode code points.
Word Level
Splits at locale-specific word boundaries. Each segment includes an isWordLike property indicating whether the segment contains actual word content (excluding punctuation and whitespace). This granularity is essential for word counting, text editors, and search functionality.
Sentence Level Splits at sentence boundaries, respecting locale-specific rules for punctuation and abbreviations. For instance, "Dr." followed by a period won't incorrectly split a sentence in English.
| Granularity | Use Case | Example Output |
|---|---|---|
| grapheme | Character counting, emoji handling | "👨👩👧👦" as 1 unit |
| word | Word counting, text editors | "Hello world" → ["Hello", " ", "world"] |
| sentence | NLP, text analysis | "Hello! How are you?" → ["Hello! ", "How are you?"] |
The segment() Method
The segment() method is the primary interface for extracting segments from text. It returns a Segments iterator object that you can iterate over to access individual segments.
Each segment object contains:
- segment: The actual text of the segment
- index: The starting position of the segment in the original string
- input: A reference to the original input string
- isWordLike (word granularity only): Boolean indicating whether the segment is word-like content (excludes punctuation, spaces)
The Segments iterator is lazy--it doesn't create an array upfront, which is memory-efficient for large texts. You can convert it to an array using Array.from() when needed.
1const segmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });2const text = "吾輩は猫である。名前はたぬき。";3 4const segments = segmenter.segment(text);5 6for (const segment of segments) {7 console.log(segment.segment, segment.index, segment.isWordLike);8}9// Output:10// { segment: '吾輩', index: 0, isWordLike: true }11// { segment: 'は', index: 2, isWordLike: false }12// { segment: '猫', index: 3, isWordLike: true }13// ...14 15// Convert to array16const segmentArray = Array.from(segmenter.segment(text));17console.log(segmentArray[0].segment); // "吾輩"Practical Use Cases
Word Count for International Text
A common requirement is displaying word counts that work correctly for any language. By using the isWordLike property, you can filter out punctuation and whitespace to get accurate word counts:
function countWords(text, locale = "en") {
const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
let count = 0;
for (const segment of segmenter.segment(text)) {
if (segment.isWordLike) count++;
}
return count;
}
console.log(countWords("Hello world", "en")); // 2
console.log(countWords("吾輩は猫である", "ja")); // 3
console.log(countWords("Hello 世界", "en")); // 2
Text Truncation with Proper Character Handling
When truncating text, you need to respect grapheme boundaries to avoid cutting emoji or accented characters:
function truncateText(text, maxLength, locale = "en") {
const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" });
let result = "";
for (const segment of segmenter.segment(text)) {
if ((result.length + segment.segment.length) <= maxLength) {
result += segment.segment;
} else {
break;
}
}
return result.length < text.length ? result + "..." : result;
}
console.log(truncateText("Hello 👨👩👧👦 World", 10, "en"));
// "Hello 👨👩👧👦..." (proper truncation, no broken emoji)
Natural Language Processing Pipeline
For applications processing text input--like chatbots, search engines, or content analysis tools--proper sentence segmentation is essential. When building AI automation solutions that involve natural language processing, integrating Intl.Segmenter into your pipeline ensures accurate text analysis across all languages your application supports:
function extractSentences(text, locale = "en") {
const segmenter = new Intl.Segmenter(locale, { granularity: "sentence" });
return Array.from(segmenter.segment(text)).map(s => s.segment);
}
const paragraph = "This is the first sentence! And this is the second? Yes.";
console.log(extractSentences(paragraph, "en"));
// ["This is the first sentence! ", "And this is the second? ", "Yes. "]
Real-world applications for locale-sensitive text segmentation
International Word Counts
Accurate word counting across languages that don't use spaces between words, essential for editors and content management systems.
Smart Text Truncation
Truncate text at character boundaries without breaking emoji or accented characters, perfect for previews and excerpts.
NLP Pipeline Integration
Extract sentences for natural language processing, enabling features like summarization, sentiment analysis, and search.
Text Editor Functionality
Implement word navigation, selection, and editing features that work correctly in any language.
Performance Considerations
Reuse Segmenter Instances
Creating Intl.Segmenter instances has overhead, so reuse instances when processing multiple texts. Cache segmenters at the module or class level rather than creating new ones for each operation.
Processing Large Texts
For very large texts, consider processing in chunks to manage memory efficiently. The iterator-based approach means segments are generated on-demand rather than all at once.
Memory Efficiency
The Segments iterator is lazy--it doesn't create an array upfront, which is memory-efficient for large texts. However, converting to an array with Array.from() will load all segments into memory. Use the iterator directly when possible, and only convert to an array when you need random access to segments.
1// Good: Reuse the segmenter instance2const wordSegmenter = new Intl.Segmenter("en", { granularity: "word" });3 4function processTexts(texts) {5 for (const text of texts) {6 for (const segment of wordSegmenter.segment(text)) {7 // Process each segment8 }9 }10}11 12// Bad: Creating a new segmenter for each text13function processTextsBad(texts) {14 for (const text of texts) {15 const segmenter = new Intl.Segmenter("en", { granularity: "word" });16 for (const segment of segmenter.segment(text)) {17 // Process each segment18 }19 }20}21 22// Processing large texts in chunks23async function processLargeText(text, segmenter, chunkSize = 10000) {24 const results = [];25 for (let i = 0; i < text.length; i += chunkSize) {26 const chunk = text.slice(i, i + chunkSize);27 for (const segment of segmenter.segment(chunk)) {28 results.push(segment);29 }30 }31 return results;32}Browser Support and Fallbacks
Current Browser Support
Intl.Segmenter reached Baseline status in April 2024, meaning it works across the latest devices and browser versions. The API is now supported in Chrome, Firefox, Safari, and Edge without any configuration needed.
For applications needing to support older browsers, the FormatJS library provides a polyfill. The polyfill adds approximately 15KB to your bundle, so consider whether browser support requirements justify the additional weight.
Feature Detection
Use feature detection to determine whether Intl.Segmenter is available and provide fallbacks when necessary:
function isSegmenterSupported() {
try {
new Intl.Segmenter("en", { granularity: "word" });
return true;
} catch {
return false;
}
}
if (!isSegmenterSupported()) {
// Load polyfill or use fallback implementation
await import("@formatjs/intl-segmenter/polyfill");
}
Browser Support Status
2024
Baseline Year
3
Major Browser Engines
15KB
Polyfill Size (if needed)
Advanced Techniques
Combining with Regular Expressions
Intl.Segmenter can complement regular expressions for complex text processing. While the segmenter handles linguistic boundaries, regex can identify patterns within segments:
function highlightKeywords(text, keywords, locale) {
const segmenter = new Intl.Segmenter(locale, { granularity: "word" });
const segments = segmenter.segment(text);
let result = "";
for (const segment of segments) {
if (keywords.includes(segment.segment.toLowerCase()) && segment.isWordLike) {
result += `<mark>${segment.segment}</mark>`;
} else {
result += segment.segment;
}
}
return result;
}
This approach lets you build sophisticated text analysis tools that understand both linguistic structure and pattern matching.
1class WordNavigator {2 constructor(locale = "en") {3 this.segmenter = new Intl.Segmenter(locale, { granularity: "word" });4 }5 6 moveToNextWord(text, position) {7 const segments = this.segmenter.segment(text);8 for (const segment of segments) {9 if (segment.index >= position && segment.isWordLike) {10 return segment.index;11 }12 }13 return text.length;14 }15 16 moveToPreviousWord(text, position) {17 const segments = Array.from(this.segmenter.segment(text));18 for (let i = segments.length - 1; i >= 0; i--) {19 const segment = segments[i];20 if (segment.index < position && segment.isWordLike) {21 return segment.index;22 }23 }24 return 0;25 }26 27 getWordAt(text, position) {28 const segments = Array.from(this.segmenter.segment(text));29 for (const segment of segments) {30 if (segment.index <= position && 31 segment.index + segment.segment.length > position && 32 segment.isWordLike) {33 return { word: segment.segment, index: segment.index };34 }35 }36 return null;37 }38}39 40// Usage in a text editor41const navigator = new WordNavigator("en");42const text = "The quick brown fox";43console.log(navigator.moveToNextWord(text, 4)); // 5 (position after "The ")44console.log(navigator.getWordAt(text, 4)); // { word: "quick", index: 4 }Key recommendations for effective text segmentation
Choose Right Granularity
Use 'grapheme' for character operations, 'word' for text processing, 'sentence' for paragraph analysis.
Reuse Segmenter Instances
Creating instances has overhead--cache them at module or class level for repeated use.
Handle Locale Fallbacks
Use Intl.Segmenter.supportedLocalesOf() to check availability before creating segmenters.
Test with Real Content
Always test segmentation with actual content in your target languages--linguistic rules vary significantly.
Frequently Asked Questions
Conclusion
Intl.Segmenter fills a critical gap in JavaScript's text processing capabilities. By providing locale-aware segmentation, it enables applications to handle international text correctly--a requirement for any global web application. The API is well-designed, performant, and now widely supported across browsers.
The API's simplicity--create a segmenter, call segment(), iterate over results--belies its power. Under the hood, it handles the complex linguistic rules that make proper text segmentation so challenging. Whether you're building a text editor, implementing search functionality, or simply displaying word counts, Intl.Segmenter should be your go-to solution for breaking text into meaningful pieces.
By leveraging this built-in capability, you can deliver better experiences for users worldwide. The combination of proper character handling, locale awareness, and excellent performance makes Intl.Segmenter an essential tool for modern web applications targeting international audiences. When implementing SEO services for multilingual websites, proper text segmentation ensures accurate content analysis and improved search visibility across all supported languages.