RESEARCHERS in the United States have attempted to ascertain which transcribes a sermon more accurately: a human or a machine.
Transcripts of sermons are often published on church websites. While some preachers stick to a script, which can then be published, extempore sermons or sermons from notes require a transcription to be typed from an audio recording.
Technological advances, however, have meant that this task can be carried out by a computer program, and the Pew Research Center in the United States has compared the results.
As part of a study in 2019, followed up in 2020, researchers downloaded 60,000 audio and video files of sermons. The object was to analyse topics discussed in sermons in different denominations. They used Amazon Transcribe, a speech-recognition service, to transcribe the sermons.
The researchers discovered a problem, however. “The Amazon service did not always get specific religious terminology or names right,” they write in an analysis published online. “A few examples included ‘punches pilot’ instead of ‘Pontius Pilate’ and ‘do Toronto me’ in lieu of ‘Deuteronomy.’”
The researchers then asked a “third-party human transcription service to tackle portions of some of the sermons that Amazon Transcribe had already transcribed, and then compared the results between the two”.
The took a “stratified random sample” of 200 sermons from different regions of the US: the Midwest, the South, and “a combined region that merges the Northeast and the West”. Sermons were drawn from four denominations for which they had a sufficient sample size: “mainline Protestant, Evangelical Protestant, historically Black Protestant, and [Roman] Catholic.”
Audio samples of the sermons, lasting between 30 and 210 seconds, were sent to the human transcription service.
They compared the machine and human transcription services using a metric called “Levenshtein distance”, which, they write, “counts the number of discrete edits — insertions, deletions and substitutions — at the character level necessary to transform one text string into another”. For example, if the word “Covid” is transcribed as “cove in”, there is a Levenshtein distance of three, because three edits are required: one to add a space between the “v” and the “i”, one edit to add an “e” after the “v”, and one edit to substitute the “d” for an “n”.
The researchers found that, across all the files that they analysed, “the average difference between machine transcriptions and human transcriptions was around 11 characters per 100. That is, for every 100 characters in a transcription text, approximately 11 differed from one transcription method to the other.”
They also detected a “small but statistically significant” difference in Levenshtein distances between denominations. “Text taken from Catholic sermons, for example, had more inconsistency between transcripts than was true of those taken from evangelical Protestant sermons. And sermons from historically Black Protestant churches had significantly more inconsistency in transcriptions when compared with the other religious traditions.”
They continue, however: “While these differences were statistically significant, their magnitude was relatively small. Even for historically Black Protestant sermons — the tradition with the largest mismatch between machines and humans — the differences worked out to around just 15 characters per 100, or four more than the overall average.”
When it came to how accurately humans and machines transcribed regional accents, the researchers were surprised: they had expected machines to struggle most with Southern accents — but, in fact, “transcriptions of sermons from churches in the Midwest had significantly more inconsistency between machine and human transcriptions than those in other regions.”
The difference was not great, however: “Midwestern sermons, despite having the greatest inconsistency across regions, had only two more character differences per 100 characters than the overall average.”
They are not sure why machines found Midwestern sermons more difficult to transcribe; but one factor, they write, might be “worse audio quality than those from other regions”.
The researchers conclude that “issues with transcription quality can be tied to the quality of the audio being transcribed — which presents challenges for humans and computers alike. . .
“By the same token, machine transcription may perform worse or better on certain accents or dialects — but that’s also true for human transcribers. When working with audio that has specialized vocabulary (in our case, religious terms), human transcribers sometimes made errors where machines did not. This is likely because a robust machine transcription service will have a larger dictionary of familiar terms than the average person. Similarly, we found that humans are more likely to make typos, something one will not run into with machine transcription.”
They warn, however, that “the reliability of machine transcription can sometimes backfire. When presented with a segment of tricky audio, for example, humans can determine that the text is ‘unintelligible.’ A machine, on the other hand, will try to match the sounds it hears as closely as possible to a word it knows with little to no regard for grammar or intelligibility. While this might produce a phonetically similar transcription, it may deviate far from what the speaker truly said.”