There's this popular daily FM Radio show in Greece which posts their shows on SoundCloud after broadcasting them. It's a good -albeit not great, just HTML5 audio is fine- way to listen the show on demand if you're on Desktop. The website is not mobile friendly and the whole embedded SoundCloud experience is sub-optimal. Let alone that you cannot just add the feed to your favorite podcast player to enjoy it.
There's an RSS feed on iTunes but it's manually updated and inevitably lags a day or two behind, depending on the availability of the maintainer.
I decided to fix the problem myself and since this turned out to be a solution involving a bunch of interesting technologies I thought to write a blog post about it. If you only care about the podcast you can find it here.
Step 1: Extracting content from SoundCloud
The episodes are embedded in the official website but are hidden in SoundCloud.
Probably there's a
hidden attribute you can set to SoundCloud media. That
explains why my first attempt to download the episodes
using SoundScrape failed with the
later complaining that it can't find any videos.
Then I started examining SoundCloud's JS and JSON responses sent when you click the play button, with the ultimate goal to write a SoundCloud downloader. The service follows a typical authenticate-then-get unique auto-expiring link to S3, which it can be automated but it's not fun to do.
While taking a break from parsing JSON responses it occurred to me that youtube-dl despite it's very specific name it supports other websites too, actually hundreds of them. v Run youtube-dl against a URL with embedded SoundCloud audio and youtube-dl will find and download the best version of the audio file including the cover thumbnail!
All I need now is simple python script to extract all URLs with embedded
SoundCloud audio and feed it to youtube-dl as a list using the
Step 2: Generate the Podcast RSS
With all the mp3 files for the show downloaded, next step is to generate the Podcast RSS. FeedGen is a simple pythonic library which builds RSS feeds, including extensions for podcasts and iTunes attributes.
Step 3: Serve the Podcast RSS
I serve all my personal websites using Dokku running on my VPS. I used a Debian based Docker image and installed Python2 and the needed python libraries for the feed generation. Also installed nginx-light to serve the content, both the RSS and the audio files.
I originally used the genRSS project to
generate the RSS which complained about the Unicode characters in the mp3
filenames when run from the Docker image. I fixed this by adding
to the supported locales and running
locale-gen on image build.
RUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \ locale-gen ENV LC_ALL en_US.UTF-8
The docker image default command runs
nginx with a minimal nginx.conf.
Dokku takes care of everything else, including getting certificates from LetsEncrypt.
Step 4: Update the Feed
Cron runs a command to update the feed daily from Mon-Fri every 5 minutes from the moment the show ends and up to an hour after. The show producers are very consistent on uploading the show on time so that seems to just work. To be on the safe side I added another run two hours after the show ends.
The cron runs on the host, using
dokku run. The podcast and the audio files
are stored in a Docker volume and therefore both the web serving process and the
cron job can access this persistent storage at the same time.
Youtube-dl is smart enough to not re-download content which exists, so running the command multiple times does not hammer the servers.
Step 5: Monitoring
For an automation to be perfect it must be monitored. As with all my websites, I setup a NewRelic Synthetics monitor which monitors the feed serving and that the content of the feed appears valid by looking for "pubDate" text.
To monitor the cronjob cURL a healthchecks.io
provided URL at the very end of the bash script that co-ordinates the fetching
and building of the feed. Make sure
set -e your
bash scripts so they exit after the first failed command. Not setting
always call cURL even if a step fails.
Fun fact: It's the second time I build a podcast for this show. First one was around 2008.
- Podcast URL https://ellinofreneia.sealabs.net
- Show's Website https://www.ellinofreneianet.gr
- GitHub repo