Automating Podcast generation from SoundCloud

// under mozilla-planet ellak-planet podcast

There's this popular daily FM Radio show in Greece which posts their shows on SoundCloud after broadcasting them. It's a good -albeit not great, just HTML5 audio is fine- way to listen the show on demand if you're on Desktop. The website is not mobile friendly and the whole embedded SoundCloud experience is sub-optimal. Let alone that you cannot just add the feed to your favorite podcast player to enjoy it.

There's an RSS feed on iTunes but it's manually updated and inevitably lags a day or two behind, depending on the availability of the maintainer.

I decided to fix the problem myself and since this turned out to be a solution involving a bunch of interesting technologies I thought to write a blog post about it. If you only care about the podcast you can find it here.

Step 1: Extracting content from SoundCloud

The episodes are embedded in the official website but are hidden in SoundCloud. Probably there's a hidden attribute you can set to SoundCloud media. That explains why my first attempt to download the episodes using SoundScrape failed with the later complaining that it can't find any videos.

Then I started examining SoundCloud's JS and JSON responses sent when you click the play button, with the ultimate goal to write a SoundCloud downloader. The service follows a typical authenticate-then-get unique auto-expiring link to S3, which it can be automated but it's not fun to do.

While taking a break from parsing JSON responses it occurred to me that youtube-dl despite it's very specific name it supports other websites too, actually hundreds of them. v Run youtube-dl against a URL with embedded SoundCloud audio and youtube-dl will find and download the best version of the audio file including the cover thumbnail!

All I need now is simple python script to extract all URLs with embedded SoundCloud audio and feed it to youtube-dl as a list using the --batch-file argument.

Step 2: Generate the Podcast RSS

With all the mp3 files for the show downloaded, next step is to generate the Podcast RSS. FeedGen is a simple pythonic library which builds RSS feeds, including extensions for podcasts and iTunes attributes.

Step 3: Serve the Podcast RSS

I serve all my personal websites using Dokku running on my VPS. I used a Debian based Docker image and installed Python2 and the needed python libraries for the feed generation. Also installed nginx-light to serve the content, both the RSS and the audio files.

I originally used the genRSS project to generate the RSS which complained about the Unicode characters in the mp3 filenames when run from the Docker image. I fixed this by adding en_US.UTF-8 to the supported locales and running locale-gen on image build.

RUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \
    locale-gen
ENV LC_ALL en_US.UTF-8

The docker image default command runs nginx with a minimal nginx.conf.

Dokku takes care of everything else, including getting certificates from LetsEncrypt.

Step 4: Update the Feed

Cron runs a command to update the feed daily from Mon-Fri every 5 minutes from the moment the show ends and up to an hour after. The show producers are very consistent on uploading the show on time so that seems to just work. To be on the safe side I added another run two hours after the show ends.

The cron runs on the host, using dokku run. The podcast and the audio files are stored in a Docker volume and therefore both the web serving process and the cron job can access this persistent storage at the same time.

Youtube-dl is smart enough to not re-download content which exists, so running the command multiple times does not hammer the servers.

Step 5: Monitoring

For an automation to be perfect it must be monitored. As with all my websites, I setup a NewRelic Synthetics monitor which monitors the feed serving and that the content of the feed appears valid by looking for "pubDate" text.

To monitor the cronjob cURL a healthchecks.io provided URL at the very end of the bash script that co-ordinates the fetching and building of the feed. Make sure to set -e your bash scripts so they exit after the first failed command. Not setting -e will always call cURL even if a step fails.

Actually use those two tools so much, I maintain two related projects NeReS and Babis.

Fun fact: It's the second time I build a podcast for this show. First one was around 2008.