Nextcloud and whisper

Intro

I recently read about Whisper CPP1 (again) and decided to play around with it a bit. The idea to use such a model to automatically transcribe recordings uploaded to my nextcloud, was in my head for a while already and this time I decided to go ahead and try.

My Nextcloud already has the OCR-Workflow enabled, which works like a charm using Tesseract to add text to scanned documents, so this felt like a natural step forward.

Whisper CPP

Whisper CPP is a project, that focuses on providing a C/C++ implementation2 of the famous OpenAI Whisper model(s). The most important points for this project are:

  • Multi-language. Apparently, that is "expected", but I actually feared it would be english only.
  • Freely downloadable: I will not go into the discussions regarding OpenAis openness. As a non-researcher hobbyist this is all I need.
  • Multiple models of different sizes. No way I am running the large model in the background all the time.

Specifically for the CPP one:

  • Runs on my server, which is pretty old and only has 4 CPU threads and no dedicated graphic.

The documentation is also very nice and it is super easy to setup. My only complaint is that the main script has thousands of arguments, that are easily confused like -f for the input file and -of for the output file, except it is only the basename without the extension and no file is written out until you also enable at least one of the output type flags.

Noteworthy: The input should be converted using ffmpeg:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le out.wav

This is explained in the README of the project and might change in the future.

Flows (external script)

Nextcloud has, since quite some time, a mechanic to automatically perform tasks in the background whenever a file changes / is uploaded / … Any such rule is called "Flow" in Nextcloud.

There are preimplemented ones (such as the OCR one), but you can also add your own scripts using the Workflow external scripts app.

I have to say, that how all of this works is not super transparent to me: When exactly are the scripts executed? How does this interact with the nextcloud cron job? What is the exact syntax for regexes?

Also the documentation could use some more examples in my opinion, but with some time and dedication it somehow works after trying out multiple options. At least somewhat: Right now my regex matches on all files with the word "recording" in the filename. I would like to at least have an extension or even better match on a directory or files with a certain tag, but this way I at least got it running and as of now there are no false positives.

Bringing it together

Final script

To be honest: This is not really final, rather the first working version at which point I stopped careing. There are definetively things to improve:

  1. Create a temporary directory instead. This would avoid having to overwrite the tmp file with ffmpeg.

Also setup a trap to remove the files in case something goes wrong.

  1. Check the input. My regex does not contain a file extension, so
#!/bin/bash
converted_tmp_file=$(mktemp --suffix .wav)
model=/srv/whisper.cpp/models/ggml-small.bin

data_dir=/var/www/nextcloud/data
infile=$data_dir/$1
infile_no_ext="${1%.*}"
out_no_ext=$data_dir/$infile_no_ext

# the temp file is created immediately, so we need to overwrite it
ffmpeg -i $infile -ar 16000 -ac 1 -c:a pcm_s16le $converted_tmp_file -y

whisper -f $converted_tmp_file -of $out_no_ext -otxt -ovtt -l auto -m $model

rm $converted_tmp_file

Making Nextcloud aware of the changes

Now that the script runs and the files are created where nextcloud puts the data, it should show up in the webinterface and locally, right? Turns out, it does not. Files added directly to the filesystem are not added to the database meaning Nextcloud is not aware of them.

This was a huge annoyance to me, because the solutions I googled for, did not work for me. The occ command from nextcloud has the option to scan for files. Something like:

sudo -u www-data /path/to/occ scan:files --path /new/file --unscanned

should work or at least was suggested somewhere, but it always returned 0 new files in my tests. Scanning all files did work however:

sudo -u www-data /path/to/nextcloud/occ scan:files --all

Clearly this is not desirable, but I was willing to use it for a while until I figured out how to add single files.

But then I read, that Nextcloud has a config option to watch for filesystem changes: https://help.nextcloud.com/t/how-to-make-nextcloud-aware-of-added-files/10824/12

I am not sure why the proposed solution is to add 'filesystem_check_changes' => 0 to the config in /config/config.php. From reading the Nextcloud Dokumentation the default is already 0 and that translates to not checking for changes. I set it to 1 in my config and it seems to work pretty much as expected.

Remaining Issues

  • My local folder does not get synced automitaclly after the files are added and available in the webinterface. Have to check how I setup the client.
  • The script needs some finetuning as explained in the script section
  • I would like to change the flow to work on tagged files only. Need to figure out how tags work and how to use that in the flow.

Footnotes:

1

Probably on Hacker News, but I don't remember exactly.

2

They write "inference" instead of implementation. Not that familiar with the Deep Learning lingo…

Emacs 28.1 (Org mode 9.5.2)