Files & Character Encoding

Open a Text File

If you want to read or write a text file with Python, it is necessary to first open the file. To open a file, you can use Python’s built-in open() function.

open('sample-file.txt', encoding='utf-8')
<_io.TextIOWrapper name='sample-file.txt' mode='r' encoding='utf-8'>

Inside the open() function parentheses, you insert the filepath to be opened in quotation marks. You should also insert a character encoding, which we will talk more about below. This function returns what’s called a file object.

Read a Text File

A file object does not contain readable text. To read this file object as text, you need to use the .read() method.

open('sample-file.txt', encoding='utf-8').read()
'This text file is now open and read!'

Write a Text File

The default mode for the open() function is to read text files: mode = 'r'.

But you can use the open() function to write files, too. Simply set the mode to write: mode = 'w'

open('a-new-file.txt', mode='w', encoding='utf-8')
<_io.TextIOWrapper name='a-new-file.txt' mode='w' encoding='utf-8'>

To write something to this newly opened text fle, you can use the .write() method.

open('a-new-file.txt', mode='w', encoding='utf-8').write('I just wrote this to a text file. Alright!')

If we read this newly created text file, we can see that the .write() method worked correctly:

open('a-new-file.txt', mode='r', encoding='utf-8').read()
'I just wrote this to a text file. Alright!'

Character Encoding

encoding='utf-8'

Why do we need to include encoding='utf-8' to open our text file? Well, UTF-8 is a character encoding (a specific kind of Unicode). We need to specify a character encoding because — gasp! — computers don’t actually know what text is. Character encodings are systems that map characters to numbers. Each character is given a specific ID number. This way, computers can actually read and understand characters.

You can check any characters’ “code point,” or place in the Unicode universe, with the function ord()

ord("a")
97
ord("💩")
128169
ord("ত")
2468
ord("!")
33

Unicode is the most popular character encoding on the internet. It even includes emojis. Yet, as Aditya Mukerjee points out in his essay “I Can Text You A Pile of Poo, But I Can’t Write My Name”, Unicode still does not include characters that are essential to the Bengali alphabet as well as to many other non-English languages.

Adding (UTF-8) Encoding

It’s always good practice to explicitly specify UTF-8 encoding when opening files.

sample_text_default = open('sample-character-encoding.txt', encoding='utf-8').read()
print(sample_text_default)
***
This is an example of curly quotation marks:
“She said, ‘I won’t bungle the encoding!’”
***

***
This is an example of an emoji:
💩
***

***
This is an example of Bengali:
আদিত্য মুখোপাধ্যায় টাইপ করতে পারেন - তবে নিজের নাম বানান করতে পারবেন না
(Aditya Mukerjee can type 💩 but cannot spell his own name)
***

***
This is an example of German:
Was ist, wenn wir über deutsche Sprachen recherchieren wollen?
(What if we want to research German languages?)
***

Look what happens if we read in the exact same text with a different encoding.

sample_text_iso = open('sample-character-encoding.txt', encoding='iso-8859-1').read()
print(sample_text_iso)
***
This is an example of curly quotation marks:
“She said, ‘I won’t bungle the encoding!’”
***

***
This is an example of an emoji:
💩
***

***
This is an example of Bengali:
আদিত্য মুখোপাধ্যায় টাইপ করতে পারেন - তবে নিজের নাম বানান করতে পারবেন না
(Aditya Mukerjee can type 💩 but cannot spell his own name)
***

***
This is an example of German:
Was ist, wenn wir über deutsche Sprachen recherchieren wollen?
(What if we want to research German languages?)
***
sample_text_ascii = open('sample-character-encoding.txt', encoding='ascii').read()
print(sample_text_ascii)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-70-1d14095d0aa9> in <module>
----> 1 full_text_ascii = open('sample-character-encoding.txt', encoding='ascii').read()
      2 print(full_text_ascii)

~/anaconda3/lib/python3.7/encodings/ascii.py in decode(self, input, final)
     24 class IncrementalDecoder(codecs.IncrementalDecoder):
     25     def decode(self, input, final=False):
---> 26         return codecs.ascii_decode(input, self.errors)[0]
     27 
     28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 49: ordinal not in range(128)

Debugging Tip

If you’re trying to read or analyze a text file, and it looks kind of weird, it’s likely an encoding error:

sample_text_iso = open('sample-character-encoding.txt', encoding='iso-8859-1').read()
print(sample_text_iso)
***
This is an example of curly quotation marks:
“She said, ‘I won’t bungle the encoding!’”
***

***
This is an example of an emoji:
💩
***

***
This is an example of Bengali:
আদিত্য মুখোপাধ্যায় টাইপ করতে পারেন - তবে নিজের নাম বানান করতে পারবেন না
(Aditya Mukerjee can type 💩 but cannot spell his own name)
***

***
This is an example of German:
Was ist, wenn wir über deutsche Sprachen recherchieren wollen?
(What if we want to research German languages?)
***

As David C. Zentgraf writes in his useful blog post about character encoding:

If you open a document and it looks like this [see garbled stuff above], there’s one and only one reason for it: Your text editor, browser, word processor or whatever else that’s trying to read the document is assuming the wrong encoding. That’s all. The document is not broken…there’s no magic you need to perform, you simply need to select the right encoding to display the document.

No magic! Just double check the encoding.

More Advanced: Open and Read All Files in a Directory

We haven’t fully discussed Python modules and for loops yet, but once you’re comfortable with these concepts, it’s helpful to know how to work with all the files in a directory.

Import Path library

from pathlib import Path
directory_path = 'sample-directory'

Loop through any file in the directory with the star * character, which matches anything

for filepath in Path(directory_path).glob('*'):
    print(filepath)
sample-directory/.ipynb_checkpoints
sample-directory/01-sample-file.txt
sample-directory/02-sample-file.txt
sample-directory/03.py

Loop through just text files in the directory with *.txt, which matches only files that end with “.txt”

for filepath in Path(directory_path).glob('*.txt'):
    print(filepath)
sample-directory/01-sample-file.txt
sample-directory/02-sample-file.txt

To read these text files, simply add in the open() function and .read() method

for filepath in Path(directory_path).glob('*.txt'):
    print(open(filepath, encoding='utf-8').read())
Here's the contents of the first file!
Here's the contents of the second file!