Tools and Techniques for NLProc
Need Text Cleaning? You Need This👇
Solution: Install clean-text
library (pip install clean-text,pip install Unidecode
). Import library and then simply use clean()
method to clean your text documents.
from cleantext import clean
final = """
Zürich has a famous website https://www.zuerich.com/
WHICH ACCEPTS 40,000 € and adding a random string, :
abc123def456ghi789zero0 for this demo. '
"""
clean(final,
fix_unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
lower=True, # lowercase text
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
no_urls=False, # replace all URLs with a special token
no_emails=False, # replace all email addresses with a special token
no_phone_numbers=False, # replace all phone numbers with a special token
no_numbers=False, # replace all numbers with a special token
no_digits=False, # replace all digits with a special token
no_currency_symbols=False, # replace all currency symbols with a special token
no_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<PHONE>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
lang="en" # set to 'de' for German special handling
)
###Sentence Tokenization
Solution: I find NLTK’s sent_tokenize()
more useful than Spacy’s doc.sents()
method.
import nltk
from nltk.tokenize import sent_tokenize
strings = '''
As we can see we need to do some cleaning. We have some oddly named categories and I also checked for null values. From our data exploration, we have a few handy functions to clean the data we will use here again. For example, remove all digits, HTML strings and stopwords from our text and to lemmatise the words.
'''
list_of_sents = sent_tokenize(strings)
print(list_of_sents)
Removing Control Characters (\n\r\t
)
Solution: clean
from clean-text
library can easily take care of that. But if you prefer regex, here’s how you can take care of it:
import re
s = "We have some \n oddly \t named categories and I also checked \r for null values."
regex = re.compile(r'[\n\r\t]')
s = regex.sub(" ", s)
print(s)