From Prototype to Daily Tool: Building Sounds U
Author: Mitravasu Prakash
Discord's built-in soundboard is useful, but it did not fit how my friends and I used sounds in voice chat.
Our server treated sounds like a shared archive. We saved lines from shows, movies, games, and random moments that had become part of our calls.
The problem was the limit. Discord gave us 8 soundboard slots, so every new sound meant deleting an old one.
That limit made sense for a built-in feature. Storage has to be managed somehow. But it raised a useful question: if storage was the constraint, what if I hosted the files myself?
Discord does not let you plug custom storage into its native soundboard, so I built the interaction around a Discord bot.
The bot could join a voice channel, wait for commands, and play audio from the machine running it.
That became SoundsU.
This project started as a small workaround, but it eventually became a useful lesson in right-sized architecture. The interesting part was not building a large system. It was taking a one-file prototype bot and slowly turning it into something reliable, usable, and easier to extend without designing for imaginary scale.
Starting With the Smallest Useful Version
The first version was a proof of concept.
The core flow was simple:
- Users uploaded mp3 files through Discord.
- The bot stored those files locally.
- Users ran commands to play sounds.
- The bot joined the voice channel and played the requested audio.
I used discord.py for Discord integration. PyNaCl and ffmpeg handled voice playback.
Uploads worked through a message with a command and an attached file. Validation stayed narrow:
- Only mp3 files were accepted, and Discord's attachment limit prevented oversized uploads.
The metadata model was also minimal. Each sound was an mp3 file. The filename was the sound name.
For the first version, that was enough.
The server had 7 people, and the bot only served one Discord server. There was no need for multi-server isolation, a database, or a complex permissions system.
Version 1 supported:
- uploading sounds
- playing sounds
- joining and leaving voice channels
- sleep kickouts
- GIF-triggered playback
The GIF trigger came from a server suggestion.
Instead of requiring every sound to use a command, the bot could detect GIFs with matching names. If someone sent a GIF whose name matched a sound, the bot stripped the GIF formatting and played the audio.
It was a small feature, but it matched how people actually used Discord. Not every interaction needed to be a formal command.
That became an early theme of the project: the best features were usually the ones that fit naturally into existing server behavior.
Version 1 Architecture
The first version was built to prove the idea, not to become a long-term system.
Everything lived in one file: connect.py.
The file was only around 200 lines, but it owned too many responsibilities:
- startup and shutdown
- Discord client setup
- command handling
- file reading
- upload validation
- sound playback
- voice channel behavior
- sleep logic
The bot worked, but the structure still felt like a prototype.
Adding features meant touching the same file repeatedly. Unrelated areas of logic were starting to blur together.
At this stage, it was acceptable. The goal was to learn whether the idea worked at all. A cleaner architecture would not have mattered if nobody wanted to use the bot.
The mistake would have been treating the prototype from this spike as the final design.
When the Prototype Became Worth Maintaining
The bot became worth maintaining about a week after Version 1 worked.
At first, I ran it manually on my PC before joining voice chat. That was fine for testing, but the usage pattern changed quickly.
People were using it every day. The bot had moved from a quick experiment to something the server expected to be available.
That changed the priorities.
The bot no longer just needed to work when I was testing it. It needed to stay online, be easier to use, and be less painful to change.
I found an old Raspberry Pi, installed Ubuntu Server, and hosted the bot there.
I was not sure whether 1 GB of RAM and an SD card would be enough, but after monitoring usage, the bot barely consumed any resources. The Raspberry Pi ended up running reliably for 6 months.
The weak point was not performance. It was durability.
At some point, the Pi was unplugged while being moved, and the next boot led to a blank screen. The OS install seemed to have corrupted.
The sounds were recoverable, but not automatically.
I had to search our Discord channel history, redownload the uploaded files, and restore them by hand.
That outage clarified the next priorities:
- Rebuild the bot with cleaner boundaries.
- Move it to a more stable machine.
That became Version 2.
Version 2: Making It Feel Native to Discord
For Version 2, I was not trying to turn the bot into a large platform.
The goal was to make it easier to use and easier to maintain.
The biggest UX change was moving from custom text commands to Discord's slash command interface.
In Version 1, discoverability depended on the help command. Users could ask what commands existed, but they still had to remember names, arguments, and sound names.
In Version 2, typing / showed the available commands, descriptions, required fields, and optional fields directly in Discord.
The interaction changed from remembering bot syntax to following Discord's own command UI.
Autocomplete made the biggest difference.
As the sound library grew, users needed a better way to find sounds. In Version 1, that meant calling a list command or remembering names.
In Version 2, sound arguments used autocomplete. As the user typed, the bot returned the top 25 matching sound names.
The search was deliberately simple: case-insensitive prefix matching.
That was enough because the names were short and predictable. We defaulted to lowercase names using snake case or kebab case.
Examples:
vine-boom
emotional_damage
airhorn
The same autocomplete-backed list could be reused outside the play command.
Sleep mode, for example, used it when selecting which sound should play before disconnecting everyone.
That made the sound library feel like a shared source of truth across the bot.
Version 2 also added a few Discord UI components where they helped. The bot did not need a complicated interface, but structured output made some commands easier to scan.
I added two small card-style UI components:
Sleep Info Card, which showed sleep settings in a clearer hierarchy.Sounds Card, which displayed available sounds in a more distinct format.
I considered pagination for the sound list, but it was not necessary.
Once autocomplete existed, users rarely needed to browse the full list. The main interaction became searching from the command field.
That was a useful lesson: the best UI improvement was not making a better list. It was making the list less necessary.
Version 2 Architecture
The Version 2 refactor split the bot into clearer components.
At the top level:
Clienthandled Discord initialization, intents, command registration, and event handlers.APIexposed bot behavior to external experiments.
Commands were grouped by domain:
Soundshandled listing, playing, uploading, and deleting sounds.Sleephandled sleep mode, sleep scheduling, and sleep sound configuration.Voice Channelhandled joining, leaving, and kicking users from voice channels.Ownerhandled admin commands, mainly command tree syncing after deployment.
The most important internal change was moving sound access behind SoundManager.
In Version 1, sound-related behavior was mixed into the same file as Discord setup, command handling, and voice playback. In Version 2, commands could ask SoundManager for available sounds without needing to know how files were stored.
That made new commands easier to add.
A new domain of behavior could live in its own command module and be registered through bot initialization. If it needed sound information, it could use SoundManager instead of reading the filesystem directly.
The codebase became easier to extend, but the storage model stayed intentionally simple.
Keeping Storage Simple on Purpose
Version 2 was not meant to turn the bot into a public multi-tenant platform.
It was still designed for one Discord server with seven users.
That shaped the storage decision. Each server did not need its own sound library, configuration store, or database schema. The sound library could remain a folder of mp3 files, with filenames as the main metadata.
SoundManager gave that simple storage model a cleaner interface.
On startup, it scanned the mp3 folder and loaded the available sound names into memory. Play commands used that list.
When a sound was uploaded or deleted, the bot rescanned the folder.
At this scale, rescanning was fast enough. There were only dozens of sounds. A more complex cache or index would not have made the bot meaningfully faster. It would have added implementation cost without solving a real problem.
I also kept validation close to the feature that needed it.
Upload-specific checks, like requiring mp3 files, stayed in the upload command instead of moving into SoundManager.
That separation made the design easier to reason about. Sleep commands needed to query available sounds, but they did not need to know how uploads were validated.
The refactor was not about adding infrastructure. It was about separating responsibilities where Version 1 had started to blur them.
Features Shaped by the Server
Some features were specific to how our server behaved.
Sleep mode was global for the server. It was not meant to be a per-user preference system.
The idea was shared: if sleep mode was on, the call should end around the configured time. If people needed to keep working or hanging out, they could rejoin.
The default sleep time was midnight.
Sleep mode was off by default, but midnight was a good starting point when enabled. Users could also configure the sleep sound through autocomplete.
Welcome sounds followed the same approach.
Instead of building personalized join sounds, I kept one default welcome sound that played whenever someone joined the voice channel.
There were occasional edge cases, like someone having connection issues and rejoining repeatedly.
The existing stop command was enough for those moments. I did not add cooldowns or extra guardrails because the feature was manageable at the scale of the server.
This was another place where keeping the scope small helped.
A public bot might need per-server settings, per-user preferences, abuse prevention, and cooldown rules. Our bot needed to work well for one server where everyone understood the context.
Experimenting With Whisper
After Version 2, I experimented with a separate project called wehear4u.
The idea was to trigger sounds from live conversation without someone manually running a command.
The awkward part was audio capture.
wehear4u did not listen to Discord voice chat directly because I could not get Discord voice audio out of the bot cleanly. For testing, someone had to run the service on their own machine, where it listened to their microphone or local audio output, transcribed speech, detected sound names, and triggered playback through SoundsU.
The flow looked like this:
The Version 2 architecture made this easier.
The core playback logic was no longer tangled inside one file. I could expose a REST API through FastAPI and let an external system trigger the bot.
I originally considered using an LLM to interpret conversation and decide which sounds were relevant.
For this use case, that was unnecessary. The sound names were already short and designed to be spoken.
Direct sound-name detection was simpler, deterministic, and responsive.
Audio was processed in chunks, with a default chunk size of around 5 seconds and pause detection to decide when to process speech.
Whisper was accurate enough for the experiment, and transcription itself was fast.
The main latency problem came from HTTP.
Opening a new HTTP connection to the SoundsU service added around two seconds before the request could be sent.
Using a persistent session with a keep-alive timeout reduced that delay and made playback feel near-instant.
The experiment also exposed a different problem: some sound names were too common.
A word like okay could trigger playback too frequently.
I considered cooldowns, confidence thresholds, and per-sound trigger settings. Each guardrail had tradeoffs.
If someone wanted to trigger a sound at the exact moment of a joke, a cooldown could make the system feel broken.
Since this was an experiment, I kept it simple and responsive. The stop command was enough to halt unwanted playback quickly.
The Whisper experiment never became the main way we used the bot, but it showed the value of the Version 2 refactor.
The commands had clear boundaries, and so the REST API could expose parts of the bot to external experiments, letting new systems trigger existing bot actions instead of duplicating that logic.
Hosting and Reliability
The deployment evolved with usage.
Version 1 started as something I ran manually on my PC.
Once people were using it daily, I moved it to a Raspberry Pi running Ubuntu Server. That worked surprisingly well for half a year, even with only 1 GB of RAM and an SD card.
The weak point was reliability.
When the Raspberry Pi lost power, the OS install became corrupted and the bot went down. Moving from an SD-card Raspberry Pi to an old PC with an SSD removed the weakest part of the setup.
The old PC was not useful as a normal Windows machine anymore, but it was more than enough for hosting a Discord bot.
I had Docker support available, but I ended up running the bot directly on the server in a detached screen session.
During development, it was easier to start, stop, and inspect the bot through the same script I was already using.
Once that workflow was reliable, I kept it.
For this kind of single-server homelab deployment, attaching to a known screen session was simpler than managing container names and Docker commands.
The more important reliability improvement was at the machine level.
I configured a startup script through systemd, so if the machine restarted after a power outage, the bot could start without manual intervention.
For network outages, the bot stayed alive and reconnected once the network came back.
The current setup has been consistently up for months, with downtime mostly limited to intentional maintenance.
Debugging a Discord Voice Issue
One of the more interesting bugs happened when the bot started entering a connect/disconnect loop.
It would attempt to connect to a voice channel, disconnect, then retry with increasing delays: 1 second, 2 seconds, 5 seconds, 10 seconds, and so on.
That pattern made the issue look related to Discord voice connection handling rather than one specific command.
I SSH'd into the server and inspected the logs. The bot was hitting Discord's 4006 error (Session no longer valid).
The bot had been working for months without code changes, so I suspected the issue was not in my application code.
That did not completely rule out a local bug, but there was no recent change, no accumulating local state, and no obvious config drift.
The voice connection flow was handled through discord.py, so the likely explanations were:
discord.pychanged something, and my code needed to change.- Discord changed something, and
discord.pyhad not accounted for it yet.
I searched the discord.py GitHub issues for the error code and found reports that matched what I was seeing.
Discord had changed how voice server connections were handled, moving away from assumptions the library had made around voice server routing.
The fix had already been merged into the library but had not been released yet.
To unblock the bot, I cloned the discord.py repository with the fix and pointed uv to use that local clone instead of the public package.
Once the official release included the fix, I switched back.
This was a useful reminder that SDKs are not perfect abstractions. They wrap a platform at a point in time, and some of their assumptions can become outdated as that platform evolves.
It also made me appreciate the discord.py community. The issue had already been identified, discussed, and fixed in public, which made it much easier to diagnose and unblock the bot.
Current Impact
SoundsU is still small, but it has real usage.
The server has 7 people, including me, and the bot has been used daily since it was hosted persistently. The original Discord soundboard limit was 8 sounds. SoundsU currently stores around 30.
The biggest improvement from Version 1 to Version 2 was reduced interaction friction.
Before, users had to discover commands through the help command and remember command names, arguments, and sound names. After integrating with Discord's built-in command tree, commands became visible through Discord's own UI. Required fields, optional fields, descriptions, and autocomplete all appeared during usage.
The bot also became easier to extend.
Adding a feature became a matter of adding a module and registering it, rather than threading more logic into one central file.
It did not become a large platform, and it did not need to. It became a reliable tool that fit how our server already behaved.
Future Direction
The next improvement is upload flow.
Playback and discovery are already much smoother in Version 2, but adding a new sound still starts outside the bot. Users need to find or create an mp3 file before uploading it.
Version 3 would bring that flow into Discord.
Instead of uploading a prepared file, users could paste a YouTube link and let the bot handle the clip extraction and mp3 conversion.
That would make adding sounds feel less like managing files and more like saving a clip directly into the server's shared sound library.
Reflection
SoundsU started as a workaround for Discord's soundboard limit, but it became more interesting once people actually started relying on it.
The first version proved the idea was useful. It solved the immediate problem, fit naturally into our server, and showed that the bot was worth maintaining.
Version 2 was about making that maintenance easier. The goal was not to add complex infrastructure, but to set clearer boundaries around the parts of the bot that had started to blur together.
That made the system easier to extend. New commands, UI improvements, and external experiments could build on the existing behavior instead of working around it.
The bot is still a small personal project, but that is what made it useful. It was shaped around one server, real usage, and the problems that appeared over time.