While testing our SMPP functionality I have sent a text message which contains all emojis (codes 1F601-1F64F) using UCS-2 encoding.
As emojis are outside the 16-Bit range they are encoded with surrogate pairs (ie. 2 16-Bit chars per emoji).
First surrogate char is 0xD83D and the second one in the range 0xDE00-0xDE4F.
When looking at the resulting SMS, one of the emojis is displayed as unknown symbol. This seems to happen because the surrogate pair is split in the middle, leaving 2 invalid unicode characters (ie. the last char of the first message and the first char of the last message are invalid, hence they are replaced with the replacement char '\uFFFD' or 65533 in decimal.
Code:
Code: Select all
ISubmitSmBuilder builder = SMS.ForSubmit()
.From(sourceAddress, sourceTon, sourceNpi)
.To(destinationAddress, destinationTon, destinationNpi)
.Coding(dataCoding)
.Text(text)
.ExpireIn(GatewayConfig.SmsExpiry);
string of first part of message as char array:
[0] 55357 '☐' char
[1] 56832 '☐' char
[2] 55357 '☐' char
[3] 56833 '☐' char
[4] 55357 '☐' char
......
[61] 56862 '☐' char
[62] 55357 '☐' char
[63] 56863 '☐' char
[64] 55357 '☐' char
[65] 56864 '☐' char
[66] 65533 '�' char <= Was first half of surrogate pair, is now replacement char.
Second message part:
[0] 65533 '�' char <= Was second half of surrogate pair, is now replacement char.
[1] 55357 '☐' char
[2] 56866 '☐' char
[3] 55357 '☐' char
[4] 56867 '☐' char
Original message:
[60] 55357 '☐' char
[61] 56862 '☐' char
[62] 55357 '☐' char
[63] 56863 '☐' char
[64] 55357 '☐' char
[65] 56864 '☐' char
[66] 55357 '☐' char <= This pair gets split
[67] 56865 '☐' char <=
[68] 55357 '☐' char
[69] 56866 '☐' char
[70] 55357 '☐' char
This is obviously an edge case but it would be great if the splitting would consider surrogate pairs especially as all emoji characters use surrogates.
Multi-part UCS-2 messages which don't use surrogate pairs work fine btw.
Does anyone have an idea how this can be fixed ?
Thanks and best regards,
Stefan