How Acoustic Data Transmission Works

A technical guide to sending data through sound - the physics, the protocols, and why it still matters in 2026.

Introduction

Sound is one of the oldest data transmission media humans have. We've been encoding messages into acoustic signals since the first signal drum and the first ship's bell - long before electricity, longer before radio. What's interesting is that we never stopped. The same fundamental approach - encode information into a waveform, broadcast it through the air, decode it on the other side - quietly underpins a surprisingly active corner of modern computing.

This guide explains how acoustic data transmission actually works in 2026: the physics, the protocols in active use, the reliability tradeoffs, and the security properties that make it interesting for specific applications no other communication channel can serve.

It is written for engineers, researchers, and technical readers who want a comprehensive reference rather than a marketing overview. It cites primary sources where possible.

A brief history

The story of acoustic data transmission as a deliberate engineering practice is older than most realize, and it starts with accessibility rather than computing.

In 1963, deaf physicist Robert Weitbrecht built a device that translated the sound from a telephone earpiece into electrical signals and converted outgoing electrical pulses back into sound for the mouthpiece. This invention - sometimes called the Weitbrecht modem - enabled long-distance text communication for deaf users via teletypewriter (TTY) technology. It was the first widely deployed acoustic coupler.^[1]

Three years later, in 1966, John Van Geen at the Stanford Research Institute (now SRI International) was working on a version capable of supporting 8-bit ASCII terminals at faster rates. Van Geen's contribution was a circuit that could detect bits of data amid the considerable hiss of long-distance telephone connections - making reliable digital communication over an analog audio channel practical for the first time. The first commercial model based on this work was built by Livermore Data Systems in 1968.^[2]

The mechanism was simple. A user dialed a remote computer manually, placed the phone handset into rubber cups containing a microphone and speaker, and the device emitted and received tones - converting binary data to acoustic signals that travelled over standard telephone lines. Typical transmission rates were 300 baud, sometimes only 150. At 300 baud, downloading 1 MB of data took over seven hours.^[2]

Acoustic couplers were eventually superseded by direct-connect modems in the late 1970s, which were faster and more reliable. But the underlying idea - encoding data into audio waveforms - never went away. It quietly evolved into the dial-up modem tones of the 1990s, then into a new generation of consumer applications in the 2010s, then into something genuinely novel in the 2020s.

What changed was the channel. Acoustic couplers used the telephone network as a transport. Modern data-over-sound uses the air directly - speaker to microphone, with no wire in between.

The physics

To understand why acoustic data transmission works at all, it helps to start with the channel itself.

The audible frequency range for humans is roughly 20 Hz to 20 kHz. Consumer speakers and microphones - the ones embedded in phones, laptops, and tablets - are designed to reproduce and capture this range with reasonable fidelity. Most consumer hardware also has some response in the ultrasonic range above 20 kHz, though performance falls off quickly.^[3]

Within this channel, the engineering problem is the same one Van Geen solved in 1966: how do you reliably encode binary data into a waveform that survives transmission through air, captures cleanly through a microphone, and decodes correctly on the receiving end?

The most common answer is frequency-shift keying (FSK). The principle is straightforward: assign different frequencies to different data values. A simple FSK protocol might use 1875 Hz to represent a binary 0 and 1925 Hz to represent a binary 1. The sender plays one frequency or the other for a fixed duration. The receiver listens for the dominant frequency in each time slot and decodes accordingly.

In practice, modern protocols don't just use two frequencies - they use many. By dividing the available bandwidth into many distinct frequency bins and transmitting multiple frequencies simultaneously (each carrying part of the data), the throughput multiplies. This is the principle behind multi-frequency FSK.

A few constraints govern what's possible:

Bandwidth determines throughput. A wider frequency range allows more bins, which allows more bits per symbol, which means higher data rates. The audible range gives roughly 4-5 kHz of usable bandwidth before high-frequency components start dropping off on consumer hardware.

Symbol duration determines reliability. Shorter symbols mean higher data rates but worse error tolerance. The longer each tone is held, the more averaging the receiver can do to filter out noise.

Noise is real but tractable. Speech and music have most of their energy below 4 kHz. Background noise above 4 kHz is far quieter, which is why higher-frequency protocols tend to be more reliable in noisy environments - they're operating where ambient noise rarely reaches.

The Shannon-Hartley theorem still applies. The channel capacity of any communications medium is bounded by its bandwidth and signal-to-noise ratio. Acoustic channels have low bandwidth (tens of kHz at best) and middling SNR (typical room noise floors), which puts a hard ceiling on practical throughput. We're talking bytes per second, not megabytes per second.

This is fine. Acoustic data transmission is not competing with WiFi or Bluetooth on raw throughput. It's solving different problems.

How modern protocols work: ggwave as a case study

The most widely used open-source data-over-sound library today is ggwave, written by Georgi Gerganov and released under the MIT license in 2018.^[4] It is the same author who later wrote llama.cpp and ggml, and ggwave shares the same engineering sensibility - small, dependency-free, and built around a clear protocol specification.

ggwave's modulation scheme is a useful concrete example because it represents what's possible with current consumer hardware.

The core mechanism is multi-frequency FSK with the following properties:^[5]

Data is split into 4-bit chunks
Three bytes are transmitted at each moment in time using six tones - one tone per 4-bit chunk
The 6 tones are emitted in a 4.5 kHz range divided into 96 equally-spaced frequencies
Frequency spacing: dF = 46.875 Hz
For audible (non-ultrasonic) protocols: F0 = 1875 Hz (so the band runs from 1875 Hz to roughly 6375 Hz)
For ultrasonic protocols: F0 = 15000 Hz (so the band runs from 15 kHz to roughly 19.5 kHz, above the comfortable hearing range for most adults)
Original data is encoded with Reed-Solomon error correction codes before transmission
Beginning and end of each transmission are marked with special sound markers

The bandwidth rate ranges from 8 to 16 bytes per second depending on the protocol parameters chosen. This is slow by network standards but adequate for the use cases data-over-sound actually serves: short payloads, identifiers, encryption keys, command tokens, configuration strings.

A few details are worth understanding because they explain a lot of the practical behavior:

Why 96 frequencies. Subdividing the available bandwidth into many bins increases the information density per symbol. Six tones drawn from 96 possible frequencies allows transmission of three bytes of data per symbol period - far more efficient than sending one frequency at a time.

Why Reed-Solomon. Reed-Solomon codes are a class of error-correcting codes well-suited to burst errors, which is what acoustic channels actually produce. A momentary noise spike or a brief frequency dropout will corrupt several consecutive symbols, not random scattered bits. Reed-Solomon recovers gracefully from this pattern of corruption.

Why audible vs ultrasonic protocols. Audible-range transmission has the highest reliability across diverse hardware because every consumer speaker and microphone reproduces 1.8-6.4 kHz competently. Ultrasonic transmission is preferred when audibility is undesirable but requires hardware that performs well above 15 kHz - which not all consumer devices do, particularly older laptop speakers.

The receiving side performs a Fast Fourier Transform on the captured audio to detect dominant frequencies in each symbol period, reconstructs the original 4-bit chunks, applies Reed-Solomon decoding to correct errors, and outputs the original payload.

What's actually being built with this

Despite its narrow throughput, data-over-sound has accumulated a meaningful set of production deployments.

Chirp.io was founded in 2011 in the UK and built an SDK for embedding acoustic data transmission into mobile apps. Their use cases ranged from retail loyalty programs to broadcast content interaction (a TV ad emits a chirp; phones in the room receive a coupon). Chirp was acquired by Sonos in 2020, where the technology was integrated into Sonos's product setup flow.^[6]

ToneTag is an Indian company that has built an acoustic payments network. Users tap their phone to a payment terminal - but instead of NFC, the phone receives a payment token via ultrasonic chirp. ToneTag has raised over $39 million and reportedly processes billions of dollars in transaction volume monthly, with backing from Amazon and partnerships with major Indian banks.^[7]

Lisnr is a US-based company building "ultrasonic data over sound" infrastructure for proximity-based use cases, particularly in automotive (key handoff, parking systems) and contactless payments.

ggwave itself has applications across hobbyist IoT projects, "talking buttons" for embedded devices, audio QR codes for cross-device clipboard sharing, device pairing without Bluetooth, and even the "GibberLink" project - a demonstration where AI voice assistants, after detecting each other, switch from natural language to ggwave-encoded audio for higher-throughput machine-to-machine communication.^[8]

chirpfile is the only production application we're aware of that uses acoustic transmission specifically for cryptographic key delivery in a file transfer context. The architecture is documented in our security model and whitepaper.

A 2019 study published in Personal and Ubiquitous Computing by Pering et al. compared acoustic data transmission against QR codes and Bluetooth for sharing contact information between mobile devices. The acoustic transmission completed transactions in a mean of 2.4 seconds, versus 8.3 seconds for Bluetooth. The advantage was largely in the absence of pairing - both devices simply needed to be within audible range of each other, which removed the discovery and authentication overhead that dominates Bluetooth interactions.^[9]

This is the actual value proposition of data-over-sound in 2026: not raw speed, but the absence of pairing, network setup, and ecosystem dependency.

Reliability and the hardware problem

Acoustic data transmission has one engineering challenge that doesn't exist on the same scale in radio-based protocols: hardware variance.

Every speaker and every microphone has its own frequency response curve. A high-end studio condenser microphone reproduces 20 Hz to 20 kHz almost flat. A cheap laptop speaker drops off above 8 kHz and has resonant peaks in the midrange. A phone earpiece is heavily filtered for the human voice frequency range. This means the same FSK signal, played through different speakers and captured by different microphones, can produce wildly different results.

In practice, this manifests as protocol selection. A robust acoustic transmission system has to either:

Test the hardware combination first and select an appropriate protocol
Default to a frequency range that works on the lowest-common-denominator hardware (typically the audible 1.8-6.4 kHz range)
Allow the user to escalate to a more reliable mode (lower frequency, slower data rate) if the default fails

ggwave addresses this by offering multiple protocol modes - different combinations of frequency band, data rate, and error correction strength. Implementations can fall back from ultrasonic to audible when the receiving hardware can't decode the higher frequencies.

Distance also matters. Most acoustic data protocols are reliable within 1-3 meters in typical office or home environments. Beyond that, the signal-to-noise ratio degrades quickly. This is generally a feature rather than a bug - it means transmissions are inherently localized - but it imposes a hard constraint on the use cases data-over-sound can serve.

Background noise is the other variable. Speech and music below 4 kHz interferes with audible protocols. HVAC systems and fluorescent lights can produce noise in specific frequency ranges. Ultrasonic protocols are largely immune to human-audible interference but vulnerable to specific industrial noise sources.

The honest summary: acoustic data transmission is reliable enough for short, error-correctable payloads at close range. It is not reliable enough - and likely never will be - for streaming media or large file transfers.

The security properties

The security applications of acoustic data transmission are where things get genuinely interesting, and where the field has moved most in the last few years.

A radio-based communication channel - Bluetooth, WiFi, NFC - is vulnerable to interception by any antenna in range, where "range" is often much larger than the user expects. Bluetooth transmissions can be captured at tens of meters with the right equipment. WiFi is captured by anyone on the same network. Even NFC, which is designed for close range, can be intercepted with specialized hardware at greater distances than its nominal "tap-to-pay" use suggests.

An acoustic transmission has a fundamentally different physical envelope. Sound waves attenuate quickly with distance - roughly 6 dB per doubling of distance in open air, faster in cluttered environments. Walls and doors substantially block acoustic signals. Within a closed room, an acoustic transmission is essentially confined to that room.

This creates a security primitive that radio-based channels cannot replicate: physical presence as a cryptographic property.

If a piece of secret data - say, an encryption key - is delivered acoustically, then the only devices that can have received that key are devices that were physically in the room when it was transmitted. This is not a policy claim. It is not a vendor promise. It is a property of the physics of sound propagation.

The implication for cryptographic protocols is significant. Most "end-to-end encrypted" services rely on key exchange over the same network that carries the data. If the network is compromised, both the data and the key potentially are. If keys travel exclusively through an acoustic channel and data travels through an internet relay, then a network attacker - even one with full access to the relay server - has only encrypted ciphertext and no path to the key. The security model degrades gracefully under server compromise, which is rare in conventional architectures.

This is the architecture chirpfile uses. The detailed threat model is documented in the chirpfile security guide, but the high-level shape is:

Data is encrypted client-side using a fresh AES-128-GCM key.
The encrypted ciphertext is uploaded to a relay server. The server cannot decrypt it.
The decryption key is encoded as an FSK audio signal via ggwave and played through the sender's speaker.
The receiving device, in the same room, captures the audio via its microphone and decodes the key.
The receiving device fetches the ciphertext from the relay and decrypts it locally.
The relay deletes the ciphertext after first download.

The result is a file transfer system where physical presence in the same room is mathematically required to decrypt the file. No amount of remote access, network privilege, or relay server access enables decryption, because the key was never on the relay or the network in any form.

The honest threat model includes some edge cases worth naming:

Acoustic interception is theoretically possible. A microphone in the room - a smart speaker, a phone left on the table, a hidden device - could capture the chirp. The attack surface is real but narrow: the attacker needs physical presence in the room at the moment of transmission. This is a much smaller surface than network-based attacks.

Browser memory is a transient attack surface. The key exists briefly in browser memory during decryption. A malicious browser extension with appropriate permissions could capture it. This is true of any browser-based cryptographic system and is not specific to acoustic delivery.

Replay is constrained by burn-after-read. chirpfile's relay deletes the ciphertext on first successful download. Even if a chirp is recorded and played back later, the encrypted blob is no longer available to decrypt.

Server compromise leaks nothing decryptable. The strongest property of the architecture: a full breach of the chirpfile relay server reveals only encrypted blobs whose keys never passed through it. This is a stronger guarantee than most "secure" file transfer services can make.

What acoustic data transmission isn't suited for

Being clear about what doesn't work is as important as describing what does.

Acoustic data transmission is not suitable for:

High-throughput data transfer. At 8-16 bytes per second, transmitting any meaningful payload takes prohibitive time. Acoustic channels are for short payloads - keys, identifiers, tokens, short messages.
Long-range communication. Beyond a few meters, signal degradation makes reliable decoding impractical without specialized acoustic hardware.
Quiet environments where audibility is undesirable. Audible protocols are not subtle. Ultrasonic protocols are subtle but require capable hardware.
Environments with specific acoustic interference. Industrial settings with high-frequency mechanical noise, very loud public spaces, or environments with strong echo can defeat reliable transmission.
Battery-constrained microcontrollers receiving ultrasonic signals. The sampling rate required (48 kHz) is more than many low-power microcontrollers can analyze in real time.

The right framing is that acoustic data transmission is a specialized channel for a specific class of use cases: short payloads, in-room delivery, no pairing or network setup required, and - when it matters - physical presence as a security guarantee. It is not a general-purpose communication medium and shouldn't be marketed as one.

A note on the future

A few trajectories are worth tracking.

AI-to-AI communication. The GibberLink demonstration showed that AI voice assistants, once they detect they're talking to another AI, can switch to ggwave-encoded audio for higher throughput than natural language. This is gimmicky in 2026 but plausibly meaningful as agent-to-agent communication becomes more common.

Improved ultrasonic hardware. Newer phones and laptops have better speaker and microphone performance in the 17-22 kHz range than devices from five years ago. As this trend continues, ultrasonic protocols become more reliable across more hardware, expanding the practical use case envelope.

Quantum-resistant cryptography in acoustic channels. Post-quantum key exchange protocols generally require larger key sizes, which means longer acoustic transmissions. The acoustic channel may become the limiting factor for proximity-based key exchange in a post-quantum world. This is a research area actively under exploration.

Regulatory attention. Acoustic tracking - using ultrasonic signals embedded in TV ads or in-store audio to identify devices in proximity - is a real privacy concern that has drawn FTC and EU regulator attention. The general public's awareness of "data over sound" is largely shaped by this negative use case rather than the legitimate ones, which is worth being honest about.

Conclusion

Acoustic data transmission has been around in some form since 1963. It's not a trend, it's not new, and it's not going away - but it has gone through a real renaissance in the last decade, driven by the convergence of three factors: cheap consumer hardware that can reliably reproduce a useful frequency range, robust open-source protocols like ggwave, and security applications where physical presence as a cryptographic property is genuinely valuable.

The throughput will never compete with WiFi, Bluetooth, or NFC. That's not the point. The value of acoustic data transmission is everything the channel doesn't require: no pairing, no network setup, no ecosystem dependency, and, uniquely, no possibility of remote interception when the room is the perimeter.

For a small set of well-defined applications - proximity payments, device pairing, cryptographic key delivery - these properties are not just useful. They are uniquely available through this channel and no other.