Have you ever encountered a large amount of text that has not been recorded? You must return the case information before publishing the text. In this post, I will show you how to correct capital letters in documents.
Truecasing is a natural language processing problem in finding the correct capitalization of words in a text where such information is not available.
Use cases include transcripts from multiple audio sources, automatic speech recognition, optical character recognition, medical information, web messaging, and games. Raw text sources are often produced in a hurry, and most words are written in lower case.
There are several practical approaches to the truecasing problem:
- sentence segmentation: Divide the entered text into sentences and capitalize the first word of each sentence.
- speech part (POS) coding: retrieves the definition and context of each word in the sentence and adds words with specific tags, e.g. nouns.
- name entity recognition (NER): classify the words in the sentence into certain categories and decide to update, eg the names of the persons, etc.
- statistical modeling: training a statistical model of words and word groups that usually appear in capital letters.
I sropose a simple practical truecaser combining sentence segmentation and sub-of-speech techniques as follows.
The algorithm correctly identifies the first occurrence of the word stone as the name of a person who should be capitalized.
How does the method work for a large number of non-letter documents? I test below the bullet using well-known reviews Yelp dataset. The text is small before my code processes it. The resulting uppercase letter is compared to the original version BLUE score.
We get 86% of the comparison as a measure of quality by comparing the amount of n-gram between the predicted and actual uppercase letters.