Text analysis

A ready-to-run tutorial for building a structured data set from text.

Picture Gerd Altmann from Pixabay

In this tutorial, I will illustrate how to build a data set from text. As an example, I consider the birth register, which contains the following text:

On August 21 1826 a son was born to John Bon and named him Francis.
On June 11 1813 a daughter was born to James Donne naming her Mary Sarah.
On January 1 1832 a son was born to his father David Borne and named him John.

Each line of the document contains a birth register. All birth registrations have almost the same structure, although they differ in some details. The registry is saved in a txt file named register.txt.

In this tutorial, I will describe step by step how to find out fathers ’names and surnames, children’s names, birthdays, and children’s gender.

Follow these steps to extract information:

  • convert text a pandas a data frame in which each row corresponds to a row in the register
  • analyzes the structure of a typical sentence to decompose part of the speech.
  • extract all required information.

LasI use two steps Spa library, a Python library for handling natural language.

First, I open the text from the file system, split it one line at a time, and save each line as a list item:

with open('register.txt', 'r') as f:
text = [line for line in f.readlines()]

Then I build pandas data frame and list created:

import pandas as pddf = pd.DataFrame(text,columns=['text'])
df.head()
Image by the author

I will take the first sentence as an example and perform basic NLP processing. This can be done nlp() function spaCy library. That en_core_web_sm a glossary that can be installed using the following command: python -m spacy download en_core_web_sm. spaCy the library supports many languages ​​whose dictionaries can be installed with the same command.

Once the glossary is installed, I import it and download it. Then I can perform NLP processing.

import en_core_web_sm
import spacy
text = df['text'][0]nlp = en_core_web_sm.load()
doc = nlp(text)

doc the variable contains all the processed data. In my case, all I need is PoS, which can be decompressed as follows:

features = []
for token in doc:
features.append({'token' : token.text, 'pos' : token.pos_})

For each token in the statement, I store the ID and its PoS and append the results to a list by name features. Now I can build a data frame with decompressed properties. This is done only to visualize the results in an organized way:

fdf = pd.DataFrame(features)
fdf.head(len(fdf))
Image by the author

I notice that the father (John Bon) is marked PROPN, which means the person’s name and is preceded by the word that and then the word and. This does not apply to all sentences in the data set, so I must also consider the structure of the other sentences, as will be explained later.

The child is also recognized as a PROPN, always followed by PUNCT and preceded by a word him or his.

Now I can build a model that picks up my father’s name and last name. I define two variables first_tokens and last_tokens, which contain the preceding words and the following words in each sentence, respectively:

first_tokens = ['to', 'father']
last_tokens = ['and', 'naming']

I define a model that can be used to identify the father:

pattern_father = [[{'LOWER' : {'IN' : first_tokens}},
{'POS':'PROPN', 'OP' : '+'},
{'LOWER': {'IN' : last_tokens}} ]]

I say the father has to start with the word in the previous words (without the corresponding letters), then I look for PROPN (one or more times) and finally I look for one of the following words.

I configure the following function:

from spacy.matcher import Matcherdef get_father(x):
nlp = en_core_web_sm.load()
doc = nlp(x)
matcher = Matcher(nlp.vocab)
matcher.add("matching_father", pattern_father)
matches = matcher(doc)
sub_text = ''
if(len(matches) > 0):
span = doc[matches[0][1]:matches[0][2]]
sub_text = span.text
tokens = sub_text.split(' ')

name, surname = tokens[1:-1]
return name, surname

This feature takes advantage of spaCy Matcher() a class that searches for a previously defined template. If I find a match, I delete the first and last word of the match and return the result.

Now I take advantage of the data frame apply() function to calculate the father for each text in the data set:

new_columns = ['father name','surname']
for n,col in enumerate(new_columns):
df[col] = df['text'].apply(lambda x: get_father(x)).apply(lambda x: x[n])
Image by the author

The procedure for unloading each child is quite similar to the procedure followed for picking up a father. Thus, first, I define the first and last identifier and the template:

first_tokens = ['him', 'her']
last_tokens = ['.']
pattern_son = [[{'LOWER' : {'IN' : first_tokens}},
{'POS':'PROPN', 'OP' : '+'},
{'LOWER': {'IN' : last_tokens}} ]]

and then I specify the action to pick up the child:

def get_child(x):
nlp = en_core_web_sm.load()
doc = nlp(x)
matcher = Matcher(nlp.vocab)
matcher.add("matching_son", pattern_son)
matches = matcher(doc)
sub_text = ''
if(len(matches) > 0):
span = doc[matches[0][1]:matches[0][2]]
sub_text = span.text
# remove punct
sub_text = sub_text[:-1]
tokens = sub_text.split(' ')

return ' '.join(tokens[1:])

Note that the last line of the previous function allows you to decompress merged names.

Finally, I benefit apply() function to count each child:

df['child'] = df['text'].apply(lambda x: get_child(x))
Image by the author

Now I specify a function that picks up the date. In this case, I do not take advantage spaCy library. I acknowledge the sentence and find that the date always starts with the second symbol (month) and ends with the fourth symbol (year). I convert the month from a string to a number. In addition, I return the date in the format YYYY-MM-DD.

def get_date(x):
months={"January":"01","February":"02","March":"03","April":"04","May":"05","June":"06",
"July":"07","August":"08","September":"09","October":"10","November":"11","December":"12",}
tokens = x.split(" ")
# month
month = months[tokens[1]]
# day
day=tokens[2]
if(len(day)==1):
day="0"+day

# year
year = x.split(" ")[3]

return (year+"-"+month+"-"+day)

Apply the specified function to the data frame:

df['date'] = df['text'].apply(lambda x: get_date(x))
Image by the author

Finally, I distinguish between sex by taking advantage of the following idea: if a sentence contains the word boy, the child is a man, otherwise he is a woman:

def get_gender(x):
if 'son' in x:
return 'M'
return 'F'
df['gender'] = df['text'].apply(lambda x: get_gender(x))
Image by the author

The dataset is ready to be exported as a CSV file:

df.to_csv('structured_register.csv')

In this tutorial, I have illustrated how structured information can be extracted from unstructured text. I have taken advantage of two spaCy library: nlp() , perform NLP, and Matcher() search for a pattern in the string.

spaCy the library is very powerful, so stay tuned if you want to learn other features provided;)

The full code for this tutorial can be downloaded from the My Page Github Archive.

If you want an update on my research and other activities, you can follow me Twitter, Youtube and Github.

LEAVE A REPLY

Please enter your comment!
Please enter your name here