After reading this article about Tesla using a clever algorithm to catch an employee leaking data, I thought it would be interesting to analyze their algorithm and see if I could design a better one.
The Tesla Algorithm
According to this Tweet by Elon Musk, “We sent what appeared to be identical emails to all, but each was actually coded with either one or two spaces between sentences, forming a binary signature that identified the leaker.”
Pros
The algorithm is easy to understand, implement and validate.
The signature can be detected from a screenshot.
The number of unique signatures grows exponentially with the number of sentences. So, an email with 32 sentences can identify 232 or 4.29 billion email addresses—which is roughly every email address in existence according to this estimate.
Cons
The signature is visible to the human eye making it easy to detect.
The mapping between email address and signature must be stored in a database and storing sensitive user information comes with additional storage costs, privacy and compliance considerations and security risks.
The Zero-Width Hash Algorithm
Using HMAC, cryptographically hash the email address with a secret key to generate a unique signature.
Using the zero-width encoding scheme described below, encode the signature into invisible text.
Finally, embed the invisible signature into an email message.
What is zero-width encoding?
Zero-width encoding is a bespoke text-to-invisible-text encoding scheme that encodes text in binary using the invisible characters zero-width space, zero-width non-joiner and zero-width joiner to represent 0, 1 and the end of each byte respectively. The reference implementation is freely available on GitHub here.
Pros
The signature is invisible to the human eye.
The number of unique signatures is a function of the underlying cryptographic hash function. For example, with SHA-256 we can identify up to 1077 email addresses—which is close to the total number of atoms in the known universe.
With HMAC, we don’t need to store the mapping between email address and signature thus eliminating the costs and risks associated with storing sensitive user information.
Cons
The algorithm is hard to understand, implement and validate.
The signature cannot be detected from a screenshot.
Conclusion
After considering the pros and cons of each algorithm, I would choose the Tesla algorithm for its simple design and because it has more than enough capacity to uniquly identify every email address currently in existence—which is a number far greater than any single company will ever employ.