Compressing text messages

A few weeks ago, a friend asked me to set up LoRa hardware to communicate with each other from across the city. The use case is to "set up coms when the internet stops working", even though in the event of this "disaster" we'll probably just wait until it works again.

The LoRa layer we're using is called Meshtastic. Meshtastic allows nodes to send text messages of about 200-250 bytes. I wanted to find a way to pack as much information as possible into these messages, which led me to research language and compression. Packed with new knowledge I created a small protocol and implemented it. In this article, we'll walk through the search, and in the end I'll give a small overview of the protocol's workings and a link to the source code.

TL;DR; Compression rates of domain-specific dictionaries always outperform compression for the general case. It's such a superior strategy that examples can be found everywhere in both the analogue and the digital space.

Here are requirements to keep in mind as we search for a solution:

We want to send text messages with a maximum of about 200 bytes.
It will be used 'when the internet stops working', so most messages will be friendly messages ("How are you doing?"), emergency broadcasts ("SOS- wounded, need help"), or planning / logistic based ("Let's meet at x" / "Bring supplies to y").
Compression can be lossy as long as it doesn't result in syntactically different messages.

The image below shows a map with Meshtastic nodes in the Utrecht area. See the full map at map.meshnet.nl.

Languages

Toki Pona - the language of good - is a language made up of 140 words. This allows us to reference every word with just one byte. But translating to Toki Pona can cause a message to require more words. Take the following translation.

"Hello my name is Spip and I like to flap my arms and hug my face."
"kule! mi nimi li Spip. mi wile tawa e palisa mi en palisa mi wawa e kute mi."

It's an interesting fit for compression research because many words and letters repeat, but it would also require a translator program on both sides. I'm worried that original words may get lost in translation, so we'll skip this one for now.

Instead of using less words, we might use words that pack more information per letter. The language Ithkuil is "intended to express deeper levels of human cognition more overtly, logically, and precisely than natural languages". This makes it sound like an ideal candidate for a text messaging language, since the flat nature of text tends to cause miscommunication. Its design allows for crazy text density, but Ithkuil is incredibly complex, which makes it hard to create and/or maintain a translator program. A translator I found on Github only translates Ithkuil to english.

Here are three examples of the language. Notice how few words are required to generate sentences.

"rrala  udklälia" -> "a cat for use as a projectile weapon"
"Wežduihá  sstilomke" -> "The alien device (has) emitted a string of random short and long beeping sounds"
"Hlurmiô-igulotruxröxḑuökfái" -> "Go run outside along the wall until you reach the back outside corner of the house"

Even though word expansion seems to be working quite well, its root expansion properties arise from exact logic constructs that we don't normally use, so there might be little added benefit if we're going to translate it back to English anyway.

Aside from regular languages we can use memes and emojis to convey complex constructs that can be hard to express with words. Images like memes and emojis seem to fill the gaps and blind spots that we miss in natural languages, just like art, symbols and story plots do. To save the bandwidth of having to send images, we can store pre-defined memes and reference them by index.

Shortening words

People already type compressed messages using clipping words ("influenza" -> "flu"), acronyms ("OMG"), numbers ("l8r") and omitting vowels ("pls hlp") or using phonetics ("u r"). We have evidence of people compressing written language to at least since mid-4th century BC. Using symbols for quickly jolting down something is called shorthand. In recent times it's used by journalists, healthcare professionals, the police and secretaries. There is an entire world of shorthand languages. It's an rabbit hole I can recommend.

A shorthand machine – or stenograph – is a special keyboard on which you press all the letters of a word at once. They also use abbreviations - like 'stren' for 'strength' - to minimize the keys needed to press. People who type on these keyboards are called stenographers. Generally they are court reporters and closed captioners (live subtitle writers). They are the fastest typists in the word and can reach over 300 words per minute. To put that in perspective, seeing someone type 80 WPM on a regular keyboard already looks machine-like.

The Open Steno Project has created open source hardware and software to bring the stenograph to the public. Check out the software Plover if you want to start typing in steno today! Electronic stenography is mainly used for notulizing court cases, but can be used for anything. Here someone is programming in steno. It looks surprisingly effective.

Encoding entire phrases

Telegram codes

The stenograph is an important link between shorthand in analogue text processing and efficiency in digital communication. A similar link can be found in Telegram codes. Back in the day when people communicated with telegrams, they had to pay per word, so they were forced to optimize text compression. Codebooks were created to link codewords to entire phrases. The significance and efficiency of this technique cannot be overstated, as codebooks were invented for just about any profession. For example, in a code book for a shipping business, "Abdication" meant:

"To Cork for orders to discharge in the U. K., with option of charterers to order her to Liverpool direct before sailing, at 2s. 6d. less; and 5s. extra, if ordered to the continent to discharge between Havre and Hamburg, both ports inclusive."

Good luck finding a compression algorithm to shorten that!

Telegram codes were also used in romance. Take this article for example, which describes an encoded letter between two men. They used a private cable code because homosexuality was a felony in 1914's New York City.

The pattern is starting to become clear; stenographers use abbreviations – codes – to reference words, and Telegrams use codes to reference entire phrases. Let's review a few more analogue examples before we return to electronic compression. We'll see that it is the base for many digital compression techniques as well.

Making words or codes mean entire phrases is a superior compression technique, but it's domain-bound because of the finite number of messages you can efficiently reference. These references are sometimes called brevity codes in the analogue world. In IT this is called dictionary-based compression, a dictionary coder or a substitution coder. In the computer space we implement this by simply sending the index of a pre-defined dictionary, like a telegram code but with a number. In theory you can compress an entire book into a byte, if you agree that that byte references the content of that book. As far as I know, it's impossible to compress data into smaller packets than this. (You could of course agree that "if I am not sending at time x, this means y", thus compressing data to _no_ information.)

MSBC - NATO Brevity Codes

NATO's MSBC (Multi-Service Brevity Codes) is an excellent brevity code dictionary. It uses entire words because it's meant for voice-communication. In the 2023 MSBC there is an update-section with new words that we can use to track the development of weapon systems. For example, they added:

"KRAKEN" -> Release of long-range antiship missile
"KOBE" -> Friendly air-launched or surface-launched hypersonic weapon
"DRIVE-BY" -> Threat aircraft sensors or systems observed or in use

For some obscure reason they also added "GENEVA", which means "That something is protected by the geneva conventions". Another phrase that stood out to me was "TUMBLEWEED". It's used to request information to improve situational awareness. This can be expanded with words to indicate the current awareness level. I don't feel like we have good means to express such a request in regular speech (both in English and Dutch).

They also have instructions on how to use chat (IRC), and recommend the use of mIRC. I love this because I grew up with mIRC. In the section STANDARD TACTICAL CHAT (TC) they list abbreviations like AFK (away from keyboard), thx, pls, etc. Many words from the document are used civilian life, as well as phrases like "Heads up" and "Spoofing".

Seaspeak and other speaks

In the sea world we have SMCP (Standard Marine Communication Phrases) for voice communication and INTERCO (International Code of Signals) for other communication channels (like flaghoist, signal lamp, flag semaphore, radiotelegraphy, etc). SMCP has the most types of messages. It uses a template-like structure, as we see in many of these codes. It has a big dictionary with many categories and subcategories of messages. An example is A1/6.2.1.1.2 which references "Ice / iceberg(s) in position ... / area around ... .".

The Russian navy has an interesting sea-code signal that means "To act independently or according to instruction", as in, "I'm not following the rules for a good reason".

P2000 and 10-codes

Many professions have number-codes for domain-specific tasks.

In the Netherlands the emergency services use the P2000 radio system. A mix of numbers and codes is used to exchange information. Numbers indicate events, and words or letter combinations indicate entities or things. For example, AM means ambulance and 541 means "a small fire inside".

In the United States there are the 10-codes that are used by the police (like 10-85 meaning "Arrival delay due to .."). CB-10, described here as "the original social media for truckers", are 10-codes for civilians. They have codes for basic things you expect a trucker to need, like warn each other about speed traps or ask for police backup. They even had cool slang like "Van Gogh" meaning "A vehicle without a CB radio" (because it has no "ears".). In the Netherlands truckers also used to communicate this way via "Kanaal 19".

Back to computer theory

Dictionary-Based Compression is also very common in the computer world. Here they commonly don't use a dictionary of words or phrases, but of patterns like tri-grams (like "eve" or " is"), or whatever is efficiently referencable. Modern algorithms like LZ78 or LZW use such a dictionary, and they can update the dictionary while encoding of decoding.

The paper "A dictionary-based multi-corpora text compression system" from 2021 talks about a word-dictionary based compression algorithm. They use transformation characters, like the character '~' at the end of an encoded word to denote that the first letter of the word is capitalized, etc. How they implement delimiters - the great bottleneck of compression - is not specified.

There are many clever techniques for compressing data. The paper "A Block-sorting Lossless Data Compression Algorithm" from 1994 has 2853 freaking citations. It's about converting text to blocks so the blocks can be compressed separately. It then uses cyclic shifts to transform the block into a format that allows for the best compression.

Domain-specific techniques will always - as far as we know - outperform compressors for the general case. There is capnproto to compress data types, msgpack to compress data structures and cbor to compress json. To find an algorithm with the best compression rates for text we can check out the Hutter Prize. If you can compress a Gigabyte of Wikipedia text better than the last top-runner you can win thousands of dollars. Top contestants are PAQ and cmix. PAQ "preprocess text files by looking up words in an external dictionary", and the binary actually runs without errors. Cmix is currently the top algo for the Hutter Prize. It uses so much RAM that it crashes my laptop if I try to compress the simplest of files, so I don't trust it to run on little LoRa chips.

A compressor called LLMA is using an LLM, and claims to have the best text compression rate, even better than cmix. However, it uses a 1-Gig LLM model, compresses with about "16 bytes per second", and "Since LLM inference involves floating point operations, it cannot guarantee that the compressed data generated by a machine can be successfully decompressed on another machine".

PAQ seems the best candidate for 1 Gig of chaotic Wiki text, but unishox2 seems to perform better for small text messages. Unishox2 was created from a battle of multiple little compression methods. Smaz, one of the unishox2 contestants, is brilliantly simple and also worth mentioning.

Here is a table that compares the output of compression tools. Let's compress the sentence:

"If you decide to turn the trolley cart, there is a party waiting to happen. How will you approach this contrived ethical dilemma?"

In this case smaz works exceptionally, but for most short messages unishox2 is better.

Original message size: 130
Program     Bytes   Comment
-----------------------------------------
paq8l       109     (without the header)
zlib        106
cmix-21     101     (no dict)
textblock   83      (my attempt)
unishox2    80
smaz        71

Creating a protocol

For raw text, my first attempt was to create one- and two-byte dictionaries of words and to switch between them, but the delimiters to switch dictionaries cost too much overhead. Even when using a single two-byte dictionary, the implementation was outperformed by unishox2 for most of my test cases, so I switched to using unishox2 for arbitrary text compression.

Here you can find the source for the first version of the protocol. It uses multiple phrase-dictionaries, complimented by data types for things like GPS coordinates, numbers and text that isn't in the dictionary. The program will then check if it saves data if the resulting binary blob is compressed with a general-purpose compressor like zlib.

Closing

We've looked at ways in which people compress text in the analogue and digital space. It's definitely possible to create a protocol to pack a lot of information in under 200 bytes. The superior strategy for domain-specific text-based communication is to use a dictionary of phrases and to reference it on index.

The search also sparked my interest in the Sapir-Whorfhypothese, which states that the language in which you communicate determines - or at least influences - your thought patterns. Could texting have affected the way we think? What about the speed at which we can type? If we all started talking Toki Pona, would that bring us closer or would it set us apart? The world of languages is ever developing and even I don't think we'll ever find the perfect language to express our exact thoughts, be it a natural language or a programming one.