Robert A Decker Programming Repository


Notes and articles that will reduce the pain

Counting Bytes and Chars for sending SMS

Robert Decker - Tuesday, October 31, 2017

Here I present code snippets and information on sending SMS messages through a tier-1 SMS provider, particularly information on char/byte counting in Java and JavaScript.

1. Introduction

SMS (short messaging service) is the most common means of communication on the planet and is available on every mobile phone. By using a tier-1 SMS provider that has made agreements with mobile telephony companies in every part of the world you can communicate with every cellphone on the planet, which means 80% of people in Africa, and nearly 100% of all people everywhere else, all without developing an app that users must download.

2. SMS Message Sizes and Content

When sending SMS messages you are limited to 1120 bits per message, or if you send a multi-part SMS you must include header information that informs the handset how to stitch the message together, reducing your message body length. There are three ways in which you can send bytes in SMS - GSM which is a limited set of 7-bit characters, Binary which is 8-bit, and Unicode which is 16-bit.

(It's actually a bit more complicated than this - for example, there are different character sets in the GSM standard for different parts of the world, so you're not working with the same character set if you send to Portugal vs sending to the USA - but that's a different topic...)


Mode
bits character size
header size*
  characters
GSM
 single sms
 1120
 7-bit  0 bits
 1120/7 = 160
 multi-part  1120  7-bit  48 bits
 (1120-48)/7 = 153
 Binary
 single sms
 1120  8-bit 0 bits
 1120/8 = 140
 multi-part  1120  8-bit  48 bits
 (1120-48)/8 =
134
 Unicode
 single sms
 1120  16-bit  0 bits
1120/16 =
70
 multi-part  1120  16-bit  0 bits
(1120-48)/16 =
67
* header size can vary but 48 bits is the minimum

The GSM 7-bit character set is a subset of the ISO-8859-1 (latin1) character set, which goes from 0x00 to 0xFF in hex. In the following table the grayed-out boxes are not available in this subset character set. By using this subset we are able to assign 7 bits per character rather than 8 bits.
 ISO-8859-1 Hex Codes With Valid GSM Characters
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
                    LF     CR    
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
                               
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
SP ! " # $ % & ' ( ) * + , - . /
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
@ A B C D E F G H I J K L M N O
50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
P Q R S T U V W X Y Z         _
60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
  a b c d e f g h i j k l m n o
70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
p q r s t u v w x y z          
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
                               
90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
                               
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
  ¡   £ ¤ ¥   §                
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
                              ¿
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
        Æ              
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF
            Ø           ß
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
      æ            
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
          ø          


3. Building SMS Messages


First, we have to define some constants and enums.

ISO88591_SUBSET is a byte array of each character in the 7-bit GSM subset.

The enum SmsFormat defines the three modes of sending an SMS and their sizes, both full message size and the reduced size when sending a multi-part SMS.

java:
// these are the valid ISO-8859-1 subset characters that can be sent as 7-bit characters in a GSM SMS
public static byte[] ISO88591_SUBSET = new byte[] { 
	0x0A, 0x0D,
	0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
	0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
	0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
	0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5F,
	0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
	0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A,
	(byte) 0xA1, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA7,
	(byte) 0xBF,
	(byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, (byte) 0xC9,
	(byte) 0xD1, (byte) 0xD6, (byte) 0xD8, (byte) 0xDC, (byte) 0xDF,
	(byte) 0xE0, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE8, (byte) 0xE9, (byte) 0xEC,
	(byte) 0xF1, (byte) 0xF2, (byte) 0xF6, (byte) 0xF8, (byte) 0xF9, (byte) 0xFC
};

// SMS Message sizes
public static final int SMS_7BIT_SIZE = 160;
public static final int SMS_8BIT_SIZE = 140;
public static final int SMS_16BIT_SIZE = 70;
public static final int SMS_7BIT_SIZE_SPLIT = 153;
public static final int SMS_8BIT_SIZE_SPLIT = 134;
public static final int SMS_16BIT_SIZE_SPLIT = 67;

// an enum for sms message formats
public static enum SmsFormat {
	// create the enums
	TEXT ("Text", SMS_7BIT_SIZE, SMS_7BIT_SIZE_SPLIT),
	BINARY ("Binary", SMS_8BIT_SIZE, SMS_8BIT_SIZE_SPLIT),
	UNICODE ("Unicode", SMS_16BIT_SIZE, SMS_16BIT_SIZE_SPLIT)
	;
    private final String formatName;
    private final int messageSize;
    private final int splitMessageSize;

    SmsFormat(String formatName, int messageSize, int splitMessageSize) {
        this.formatName = formatName;
        this.messageSize = messageSize;
        this.splitMessageSize = splitMessageSize;
    }
    public String formatName() {
        return this.formatName;
    }	    
    public int messageSize() {
        return this.messageSize;
    }	    
    public int splitMessageSize() {
        return this.splitMessageSize;
    }
}
javascript:

In javascript, we define the subset characters. There is no need for the enum.

// these are the valid ISO-8859-1 subset characters that can be sent as 7-bit characters in a GSM SMS
basic_chars_hex = [0x0A, 0x0D,
                  0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
                  0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
                  0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
                  0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5F,
                  0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
                  0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A,
                  0xA1, 0xA3, 0xA4, 0xA5, 0xA7,
                  0xBF,
                  0xC4, 0xC5, 0xC6, 0xC7, 0xC9,
                  0xD1, 0xD6, 0xD8, 0xDC, 0xDF,
                  0xE0, 0xE4, 0xE5, 0xE6, 0xE8, 0xE9, 0xEC,
                  0xF1, 0xF2, 0xF6, 0xF8, 0xF9, 0xFC];

When we have a string to send as SMS we need to determine the mode in which we send it. If possible, we try to send it in the GSM character set because we're able to fit more characters. If not, we have to send it as Unicode which has a limit of 70 characters which could potentially triple (70+70+20=160) your SMS charges.

We need two methods for this, first, given a byte, is it in the character subset? Second, examine each character (byte) in the string using this method and return the correct format based on the characters in the string.

java:

	public static boolean isValidSMSSubsetByte(byte b) {
		for (byte aByte : ISO88591_SUBSET) {
			if (b == aByte) {
				return true;
			}
		}
		return false;
	}

	public static SmsFormat smsCharacterSetSmsFormat(String str) {
		// examine all characters, but if we hit a unicode character then return right away
		for (char c : str.toCharArray()) {
			if (c > 255 || (! SMSUtilities.isValidSMSSubsetByte((byte)c))) {
				// character is outside of the ISO-8859-1 character set or it is in the character set but not the subset
				return SmsFormat.UNICODE;
			}
		}
		return SmsFormat.TEXT;
	}

javascript:

We have the same two methods in javascript, one method to check if a character is in the GSM subset, and the next to examine a string to see what length message we're allowed.

This is where I stop with the javascript code. I'm only using javascript to give feedback to the users while they build the SMS message, to let them know if the message can be sent in a single SMS or if it has to be split.
function isBasicChar(code) {
	for (var j = 0; j < basic_chars_hex.length; j++) {
		if (basic_chars_hex[j] == code
			return true;
		}
	}
	return false;
}

function smsCharacterSetMaxSize(str) {
	var SMS_ISO8859_SUBSET_SIZE = 160;
	var SMS_UNICODE_SIZE = 70;

	// examine all characters, but if we hit a unicode character then return right away
	var i = str.length;
	while (i--) {
		var code = str.charCodeAt(i); // uses javascript charCodeAt string method
		if (code > 255 || !isBasicChar(code)) {
			return SMS_UNICODE_SIZE;
		}
	}
	return SMS_ISO8859_SUBSET_SIZE
}

3.1 Counting Unicode Bytes

When we send a Unicode SMS we're limited to 70 characters because we're sending 2-byte characters. SMS uses the UCS-2 encoding which encodes 65,536 characters (up to FFFFh). UTF-16 encodes up to 1,114,112 characters (up to 10FFFFh) and so converting between the two isn't completely straightforward, but we can ignore the extra characters encoded in UTF-16 for now.

UCS-2 doesn't have a byte order mark (BOM) and so is always big endian, and UCS-2 does not support surrogate pairs.

Not all Java characters have a length of 1 character when you query a string - there are surrogate pairs that represent single characters in a Java string. For example, if you had a Java string with a single emoji, FACE WITH TEARS OF JOY, (😂), the string will have a charCount() of 2, not 1. The character is composed of a high surrogate and low surrogate pair. This single character actually ends up taking 4 bytes, while 春 is a single character, not a surrogate pair, but takes 3 bytes in a UTF-8 encoded string.

The UCS-2 character set doesn't exist in Java but by using the character set UTF-16BE (16-bit big endian byte order) we get pretty close to the same thing, except that surrogate pair characters are two separate characters and no longer linked, and we don't have access to the entire UTF-16 character set.

While we're counting characters to keep under the Unicode limit, we should actually be counting the bytes of each character in the string. And if we have to split the string we should not split in the middle of a surrogate pair (although this probably doesn't matter when sending an SMS)

To count characters, instead of using some of the obvious methods on java's String class we instead use a Character BreakIterator to iterate over what people would normally consider the characters in the String. With the Character BreakIterator, 😂 we can extract the surrogate pairs together

 

java:

The following code will run through the String mixString and split it into 5 byte segments.
		Charset UTF16BE = Charset.forName("UTF-16BE");
		String mixStr = "a😂b春♞aáéí";
		System.out.println("\"" + mixStr + "\"" + " java String.length:" +  mixStr.length() + " #bytes:" + mixStr.getBytes(UTF16BE).length);
		// create the BreakIterator and set the text we want to examine
		BreakIterator ci = BreakIterator.getCharacterInstance(java.util.Locale.ENGLISH);
		ci.setText(mixStr);

		int bytesLimit = 5; // limit of each string we're creating
		int byteCount = 0;
		StringBuffer currentPiece = new StringBuffer();
		Vector strings = new Vector(); // substrings split into byteLimit or less
		int start = ci.first();
		for (int end = ci.next(); end != BreakIterator.DONE; start = end, end = ci.next()) {
			System.out.println("start:" + start + " end:" + end + " str:" + mixStr.substring(start,end) + " length:" + mixStr.substring(start,end).length() + " #bytes:" + mixStr.substring(start,end).getBytes(UTF16BE).length);
			char[] chars = new char[(end - start)]; // size of char array is based on number characters from the iterator
			mixStr.getChars(start, end, chars, 0); // fill the char array
			byte[] bytes = new String(chars).getBytes(UTF16BE); // get the number of bytes that are in the char array
			if (byteCount + bytes.length > bytesLimit) {
				// we are beyond our limit of bytes so we save the current stringbuffer as a string and start a new stringbuffer
				strings.add(currentPiece.toString()); 
				currentPiece = new StringBuffer();
				byteCount = 0;
			}
			// append the chars to the stringbuffer that we're working with
			currentPiece.append(chars);
			// byte count of the current string is increased
			byteCount = byteCount + bytes.length;
		}
		// get any stragglers
		if (currentPiece.length() > 0) {
			strings.add(currentPiece.toString());
		}
		
		// debugging:
		for (String aString : strings) {
			System.out.println(aString.length() + ":" + aString.getBytes(UTF16BE).length + ":"+ aString);
		}
Output:
"a😂b春♞aáéí" java String.length:10 #bytes:20
start:0 end:1 str:a length:1 #bytes:2
start:1 end:3 str:😂 length:2 #bytes:4
start:3 end:4 str:b length:1 #bytes:2
start:4 end:5 str:春 length:1 #bytes:2
start:5 end:6 str:♞ length:1 #bytes:2
start:6 end:7 str:a length:1 #bytes:2
start:7 end:8 str:á length:1 #bytes:2
start:8 end:9 str:é length:1 #bytes:2
start:9 end:10 str:í length:1 #bytes:2
1:2:a
2:4:😂
2:4:b春
2:4:♞a
2:4:áé
1:2:í

Looking at the output in more detail:

1) "a😂b春♞aáéí" java String.length:10 #bytes:20
This shows that initial string, which looks like it's 9 characters, is actually 10 characters and 20 bytes.


2) We then iterate through each character in the Character BreakIterator
start:0 end:1 str:a length:1 #bytes:2
start:1 end:3 str:😂 length:2 #bytes:4
start:3 end:4 str:b length:1 #bytes:2
start:4 end:5 str:春 length:1 #bytes:2
start:5 end:6 str:♞ length:1 #bytes:2
start:6 end:7 str:a length:1 #bytes:2
start:7 end:8 str:á length:1 #bytes:2
start:8 end:9 str:é length:1 #bytes:2
start:9 end:10 str:í length:1 #bytes:2

This shows in the second line that the emoji is actually 2 characters (high and low surrogate pairs) and 4 bytes. The rest are two bytes.

3) And finally, these are the strings that we built that are 5 or fewer bytes:
1:2:a
2:4:😂
2:4:b春
2:4:♞a
2:4:áé
1:2:í


4. Conclusion

When sending SMS messages programmatically you first try to send the message as a GSM character message, which is a subset of the ISO-8859-1 character set composed of 7-bit characters, giving you a limit of 160 charactes in your message.

If you must send the message as a Unicode SMS you need to do more than just count characters to stay under the 70 character limit. You also need to count bytes or at least character surrogate pairs to stay under the 70 character limit, and you should attempt to not split your message in the middle of java Character surrogate pairs.