Robert A Decker Programming Repository


Notes and articles that will reduce the pain

Escaping emojis

Robert Decker - Thursday, November 16, 2017

While working in Java with emojis you have to deal with surrogate pair characters - these appear as one character in a UI, for example, but in the background they're actually two characters.

Java String class lets you pull out code points which can be single characters or surrogate pair characters. However, in Java 7 there's no good way to iterate through these code points (java 8 adds a codePoints method that gives you an array that you can iterate through).

With Java 7 you can use a Character BreakIterator to iterate over each character, letting you extract plain characters and surrogate pairs. In the following code I escape the surrogate pairs into html entities.
		StringBuffer message = new StringBuffer();
		String str = "🤯😂ab春♞aáéí";
		BreakIterator ci = BreakIterator.getCharacterInstance(java.util.Locale.ENGLISH);
		ci.setText(str);
		int start = ci.first();
		for (int end = ci.next(); end != BreakIterator.DONE; start = end, end = ci.next()) {
			message.append(end - start >= 2 ? "&#" + str.codePointAt(start) + ";" : str.charAt(start));
		}
		_log.debug(message.toString());
Output:
🤯😂ab春♞aáéí


On StackOverflow you'll see solutions to use the InEmoticons CharacterSet, or to search a range of characters. However I couldn't find a good combination of regexs to get all of these surrogate pairs. For example, the following misses the first emoticon above:
Pattern emoticons = Pattern.compile("\\p{InEmoticons}");
Pattern emoticons = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");
You'll also see suggestions to use the EmojiParser library but again, that misses the first emoticon above.