Home About Me Applied Linguistics Research Course Material Blog posts

Python Code for Tokenising a Text

Published: 23 December 2022

...psssst...ssharing isss caring...

I picked up Python over the last summer (thank goodness for Intellisense and the entire Python community for sharing their codes, it made putting this together so much easier), and spent an afternoon putting this together just for fun. It isn’t much, but I thought if it helps someone then why not just put this out there. It's probably not the cleanest code out there, but it gets the job done.

Python Code

First, import all the relevant packages and functions that you will need.

import nltk
import collections
import re
import string
import pandas as pd
import csv

from collections import Counter
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Then, download the list of stopwords from Natural Language Toolkit (NLTK).

nltk.download("stopwords")
stopwords_ls = list(set(stopwords.words("english")))

You may use the print() function to check the number and list of stopwords included in the NLTK's list.

print("Total English stopwords: ", len(stopwords_ls))
print(stopwords_ls)

I find that the existing list of stopwords in NLTK tend to comprise mostly modals, and some of the extremely highly frequent words, such as "a" and "the" are not included, so I've added some of my own stopwords. It's not the most elegant way since you need to keep adding more words whenever you come across one, but for now it gets the job done.

my_extra = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]
stopwords_ls.extend(my_extra)

Next, open the text file that you would like to convert into a tokenised wordlist.

with open("C:/YOURPATH/YOURTEXT.txt", encoding = "utf8") as f:
    text = f.read()

You can clean your text by converting all letters to lower case, remove all punctuations, numbers, and HTML tags. This could've been written differently to make it more concise, but this is what I've cobbled from the Internet.

def clean_text(text):
    #convert to lower case
    cleaned_text = text.lower()
    #remove HTML tags
    html_pattern = re.compile("<.*?>")
    cleaned_text = re.sub(html_pattern, "", cleaned_text)
    #remove punctuations
    cleaned_text = cleaned_text.translate(str.maketrans("","", string.punctuation))
    return cleaned_text.strip()

def no_number_preprocessor(tokens):
    r = re.sub("(\d)+", "", tokens)
    return r

no_num_text = no_number_preprocessor(text)
cleaned_text = clean_text(no_num_text)

Voila! Now you have a clean text! Using the cleaned text, you can tokenise the words and filter out the list of stop words to create a filtered wordlist.

wordtokens = word_tokenize(cleaned_text)
filtered_list = []

for w in wordtokens:
    if w not in stopwords_ls:
        filtered_list.append(w)
filtered_list

The following is used to derive the raw frequency counts of each token.

#for counting tokenized words
collections.Counter(filtered_list)

Finally, you can use the following to create a table comprising the words and their frequency counts, and export it into a spreadsheet.

#converting into a list of words and their frequencies
cnt = Counter(filtered_list)
wordlist = [list(i) for i in cnt.items()]

#converting list into a dataframe
df = pd.DataFrame(data = wordlist, columns = ["word", "count"])

#exporting wordlist
df.to_csv("C:/YOURPATH/YOURLIST.csv", index = False, encoding = "utf-8")

And there you go! Just change the name and directory for your text file and you can convert any reading material that you may have for your students. And all it takes is only a second to run the programme! You may also use Tableau to create a word cloud from the wordlist and then use it for classroom activities.

1 2