Programming with Python
Reading and writing files
Learning Objectives
- Understand the difference between package-specific file I/O functions and Python’s file I/O system.
- Learn to write to a file.
- Understand the meaning/use of special characters
\t
and\n
. - Learn the basics of parsing a file.
There are a number of ways to read and write files in Python. In fact, we’ve already used one of these: numpy.loadtxt()
. As we’ve seen, Numpy’s loadtxt()
function is very useful for reading a tab- or comma-delimited file and storing the results as a Numpy array. That said, numpy.loadtxt()
only covers a small number of use cases. numpy.loadtxt()
won’t help us if we’re not dealing with numbers, for instance. For cases like these, we will learn make use of Python’s basic file I/O abilities.
Reading files
At it’s simplest, we can open a file with the open()
function. There is an optional argument for what file opening mode we wish to use- the default is ‘r’ for reading. Let’s open up an sample FASTA file (data/ioExample.fa
) for this exercise.
Although FASTA files are specific to genomics, they are a good example of a raw text format that can’t be handled by the likes of numpy.loadtxt()
(there is a SeqIO module from BioPython that handles this easily, but we’re going to pretend that doesn’t exist for the purpose of learning). FASTA files are quite simple in structure- a >
provides a label and other information for a nucleotide sequence (these are the letters of the DNA alphabet), then the actual data follows. You can have several sequences like these in a file.
file = open('data/ioExample.fa', 'r')
Iterating through lines of the file is quite painless… by default, a for-loop will iterate through all the lines of the file.
for line in file:
print(line)
>FBtr0070823 type=mRNA; loc=X:join(5901982..5902096,5902690..5905227); ID=FBtr0070823; name=Act5C-RB; dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497; score=15; score_text=Strongly Supported; MD5=50caec406caeac9bd69f481d94bb76b9; length=2653; parent=FBgn0000042; release=r6.11; species=Dmel;
TTTCAGTCGGTTTATTCCAGTCATTCCTTTCAAACCGTGCGGTCGCTTAGCTCAGCCTCGCCACTTGCGTTTACAGTAGT
TTTCACGCCTTGAATTTGTTAAATCGAACAAAAAGCTTACAAAATGTGTGACGAAGAAGTTGCTGCTCTGGTTGTCGACA
[cut rest of file output]
Notice that there is an extra space between every line in this case. This is caused by two things: each line in the file ends with the \n
character (which triggers a new line), and the print()
function ends every line with a \n
by default (giving us 2 \n
s, or 2 line breaks between each line. Let’s fix our program to not add an extra \n
when printing things back. We will also close the file at the end of our program with file.close()
.
Our program should now look something like this:
file = open('data/ioExample.fa')
for line in file:
print(line, end='')
file.close()
>FBtr0070823 type=mRNA; loc=X:join(5901982..5902096,5902690..5905227); ID=FBtr0070823; name=Act5C-RB; dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497; score=15; score_text=Strongly Supported; MD5=50caec406caeac9bd69f481d94bb76b9; length=2653; parent=FBgn0000042; release=r6.11; species=Dmel;
TTTCAGTCGGTTTATTCCAGTCATTCCTTTCAAACCGTGCGGTCGCTTAGCTCAGCCTCGCCACTTGCGTTTACAGTAGT
TTTCACGCCTTGAATTTGTTAAATCGAACAAAAAGCTTACAAAATGTGTGACGAAGAAGTTGCTGCTCTGGTTGTCGACA
ACGGCTCTGGCATGTGCAAGGCCGGATTTGCCGGAGACGATGCTCCCCGCGCCGTCTTCCCATCGATTGTGGGACGTCCC
[cut rest of file output]
Using with
Remembering to close our files every time we read from them can be a pain. Fortunately, Python has a solution for us: the with
keyword. with
starts an indented block of text, and once our program leaves this block, Python will automatically close the file. We save the variable name (in this case the filename) using the word as
. This is best demonstrated after trying things out ourselves:
with open('data/ioExample.fa') as file:
for line in file:
print(line, end='')
>FBtr0070823 type=mRNA; loc=X:join(5901982..5902096,5902690..5905227); ID=FBtr0070823; name=Act5C-RB; dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497; score=15; score_text=Strongly Supported; MD5=50caec406caeac9bd69f481d94bb76b9; length=2653; parent=FBgn0000042; release=r6.11; species=Dmel;
TTTCAGTCGGTTTATTCCAGTCATTCCTTTCAAACCGTGCGGTCGCTTAGCTCAGCCTCGCCACTTGCGTTTACAGTAGT
TTTCACGCCTTGAATTTGTTAAATCGAACAAAAAGCTTACAAAATGTGTGACGAAGAAGTTGCTGCTCTGGTTGTCGACA
ACGGCTCTGGCATGTGCAAGGCCGGATTTGCCGGAGACGATGCTCCCCGCGCCGTCTTCCCATCGATTGTGGGACGTCCC
[cut rest of file output]
By now, many of you will have noticed that FASTA files include a header line (that starts with >
). Let’s change our program to only print these header lines.
with open('data/ioExample.fa') as file:
for line in file:
if line[0] == '>':
print(line, end='')
>FBtr0070823 type=mRNA; loc=X:join(5901982..5902096,5902690..5905227); ID=FBtr0070823; name=Act5C-RB; dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497; score=15; score_text=Strongly Supported; MD5=50caec406caeac9bd69f481d94bb76b9; length=2653; parent=FBgn0000042; release=r6.11; species=Dmel;
>FBtr0070822 type=mRNA; loc=X:join(5900861..5901013,5902690..5904173); ID=FBtr0070822; name=Act5C-RA; dbxref=FlyBase_Annotation_IDs:CG4027-RA,FlyBase:FBtr0070822,REFSEQ:NM_167053,REFSEQ:NM_167053; score=15; score_text=Strongly Supported; MD5=c727ce58ff19fd7683f6319cb502963e; length=1637; parent=FBgn0000042; release=r6.11; species=Dmel;
>FBtr0100662 type=mRNA; loc=X:join(5900861..5901013,5902690..5904788); ID=FBtr0100662; name=Act5C-RC; dbxref=FlyBase_Annotation_IDs:CG4027-RC,FlyBase:FBtr0100662,REFSEQ:NM_001014726,REFSEQ:NM_001014726; score=15; score_text=Strongly Supported; MD5=ee68d31c61dc8c920876794d471f9ad6; length=2252; parent=FBgn0000042; release=r6.11; species=Dmel;
>FBtr0100663 type=mRNA; loc=X:join(5900861..5901013,5902690..5904464); ID=FBtr0100663; name=Act5C-RD; dbxref=FlyBase_Annotation_IDs:CG4027-RD,FlyBase:FBtr0100663,REFSEQ:NM_001014725,REFSEQ:NM_001014725; score=15; score_text=Strongly Supported; MD5=3006846211d7279213e967e20265cb0b; length=1928; parent=FBgn0000042; release=r6.11; species=Dmel;
>FBtr0345894 type=mRNA; loc=X:join(5900861..5901013,5902690..5905399); ID=FBtr0345894; name=Act5C-RE; dbxref=FlyBase:FBtr0345894,FlyBase_Annotation_IDs:CG4027-RE,REFSEQ:NM_001297986; score=15; score_text=Strongly Supported; MD5=538411fdf0fb678c9c68f422d83255db; length=2863; parent=FBgn0000042; release=r6.11; species=Dmel;
Cleaning up input
At this point, our simple program prints out the headers for the different DNA sequences, but we don’t yet have a way of parsing this information to make it useable (each header is just one gigantic string). To break this data up into more useable chunks, we will use the split() function. split()
breaks up a string into a list using a delimiter character we specify.
Note that we will change the print command to use line breaks again (by removing the end
argument). We will break up each line using semicolons as a delimiter.
with open('data/ioExample.fa') as file:
for line in file:
if line[0] == '>':
line = line.split(';')
print(line)
['>FBtr0070823 type=mRNA', ' loc=X:join(5901982..5902096,5902690..5905227)', ' ID=FBtr0070823', ' name=Act5C-RB', ' dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497', ' score=15', ' score_text=Strongly Supported', ' MD5=50caec406caeac9bd69f481d94bb76b9', ' length=2653', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' \n']
['>FBtr0070822 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5904173)', ' ID=FBtr0070822', ' name=Act5C-RA', ' dbxref=FlyBase_Annotation_IDs:CG4027-RA,FlyBase:FBtr0070822,REFSEQ:NM_167053,REFSEQ:NM_167053', ' score=15', ' score_text=Strongly Supported', ' MD5=c727ce58ff19fd7683f6319cb502963e', ' length=1637', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' \n']
['>FBtr0100662 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5904788)', ' ID=FBtr0100662', ' name=Act5C-RC', ' dbxref=FlyBase_Annotation_IDs:CG4027-RC,FlyBase:FBtr0100662,REFSEQ:NM_001014726,REFSEQ:NM_001014726', ' score=15', ' score_text=Strongly Supported', ' MD5=ee68d31c61dc8c920876794d471f9ad6', ' length=2252', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' \n']
['>FBtr0100663 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5904464)', ' ID=FBtr0100663', ' name=Act5C-RD', ' dbxref=FlyBase_Annotation_IDs:CG4027-RD,FlyBase:FBtr0100663,REFSEQ:NM_001014725,REFSEQ:NM_001014725', ' score=15', ' score_text=Strongly Supported', ' MD5=3006846211d7279213e967e20265cb0b', ' length=1928', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' \n']
['>FBtr0345894 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5905399)', ' ID=FBtr0345894', ' name=Act5C-RE', ' dbxref=FlyBase:FBtr0345894,FlyBase_Annotation_IDs:CG4027-RE,REFSEQ:NM_001297986', ' score=15', ' score_text=Strongly Supported', ' MD5=538411fdf0fb678c9c68f422d83255db', ' length=2863', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' \n']
One thing that to notice is that the end of our lines still have the \n
character, and the line still begins with >
(we don’t want that part). We can get rid of these with some clever indexing before we use split()
.
with open('data/ioExample.fa') as file:
for line in file:
if line[0] == '>':
line = line[1:-3]
line = line.split(';')
print(line)
['FBtr0070823 type=mRNA', ' loc=X:join(5901982..5902096,5902690..5905227)', ' ID=FBtr0070823', ' name=Act5C-RB', ' dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497', ' score=15', ' score_text=Strongly Supported', ' MD5=50caec406caeac9bd69f481d94bb76b9', ' length=2653', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' ']
['FBtr0070822 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5904173)', ' ID=FBtr0070822', ' name=Act5C-RA', ' dbxref=FlyBase_Annotation_IDs:CG4027-RA,FlyBase:FBtr0070822,REFSEQ:NM_167053,REFSEQ:NM_167053', ' score=15', ' score_text=Strongly Supported', ' MD5=c727ce58ff19fd7683f6319cb502963e', ' length=1637', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' ']
['FBtr0100662 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5904788)', ' ID=FBtr0100662', ' name=Act5C-RC', ' dbxref=FlyBase_Annotation_IDs:CG4027-RC,FlyBase:FBtr0100662,REFSEQ:NM_001014726,REFSEQ:NM_001014726', ' score=15', ' score_text=Strongly Supported', ' MD5=ee68d31c61dc8c920876794d471f9ad6', ' length=2252', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' ']
['FBtr0100663 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5904464)', ' ID=FBtr0100663', ' name=Act5C-RD', ' dbxref=FlyBase_Annotation_IDs:CG4027-RD,FlyBase:FBtr0100663,REFSEQ:NM_001014725,REFSEQ:NM_001014725', ' score=15', ' score_text=Strongly Supported', ' MD5=3006846211d7279213e967e20265cb0b', ' length=1928', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' ']
['FBtr0345894 type=mRNA', ' loc=X:join(5900861..5901013,5902690..5905399)', ' ID=FBtr0345894', ' name=Act5C-RE', ' dbxref=FlyBase:FBtr0345894,FlyBase_Annotation_IDs:CG4027-RE,REFSEQ:NM_001297986', ' score=15', ' score_text=Strongly Supported', ' MD5=538411fdf0fb678c9c68f422d83255db', ' length=2863', ' parent=FBgn0000042', ' release=r6.11', ' species=Dmel', ' ']
Excellent, why don’t we try saving each line as an element of a list? The output of this code should be identical to that from before. I’ve added some comments here and there for readability.
# initialize our list
data = []
# open and parse the file
with open('data/ioExample.fa') as file:
for line in file:
if line[0] == '>':
line = line[1:-3]
line = line.split(';')
data.append(line)
# print back the data
for header in data:
print(header)
Writing files
Writing files is done in an almost identical manner to reading them. To demonstrate this, we will write our data to a file, and separate the data elements with tabs (\t
). Note that we absolutely need to manually add in a \n
at the end of each line, or all of the data will be written to one line.
# initialize our list
data = []
# open and parse the file
with open('data/ioExample.fa') as file:
for line in file:
if line[0] == '>':
line = line[1:-3]
line = line.split(';')
data.append(line)
# print back the data
with open('data/ioOutput.tsv') as output:
for header in data:
for element in header:
output.write(element + '\t')
output.write('\n')
Traceback (most recent call last):
File "/home/jeff/Documents/PycharmProjects/swc-sandbox/fastaIO.py", line 16, in <module>
output.write(element + '\t')
io.UnsupportedOperation: not writable
Oh no, it looks like Python isn’t letting us write to the file! The reason for this is that we need to call open()
with the optional mode
argument. Specifically, we must use mode='w'
to indicate we wish to write to the file (the default mode is 'r'
) for reading.
# initialize our list
data = []
# open and parse the file
with open('data/ioExample.fa', mode='r') as file:
for line in file:
if line[0] == '>':
line = line[1:-3]
line = line.split(';')
data.append(line)
# print back the data
with open('data/ioOutput.tsv', mode='w') as output:
for header in data:
for element in header:
output.write(element + '\t')
output.write('\n')
The file produced should look like this:
FBtr0070823 type=mRNA loc=X:join(5901982..5902096,5902690..5905227) ID=FBtr0070823 name=Act5C-RB dbxref=FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497 score=15 score_text=Strongly Supported MD5=50caec406caeac9bd69f481d94bb76b9 length=2653 parent=FBgn0000042 release=r6.11 species=Dmel
FBtr0070822 type=mRNA loc=X:join(5900861..5901013,5902690..5904173) ID=FBtr0070822 name=Act5C-RA dbxref=FlyBase_Annotation_IDs:CG4027-RA,FlyBase:FBtr0070822,REFSEQ:NM_167053,REFSEQ:NM_167053 score=15 score_text=Strongly Supported MD5=c727ce58ff19fd7683f6319cb502963e length=1637 parent=FBgn0000042 release=r6.11 species=Dmel
FBtr0100662 type=mRNA loc=X:join(5900861..5901013,5902690..5904788) ID=FBtr0100662 name=Act5C-RC dbxref=FlyBase_Annotation_IDs:CG4027-RC,FlyBase:FBtr0100662,REFSEQ:NM_001014726,REFSEQ:NM_001014726 score=15 score_text=Strongly Supported MD5=ee68d31c61dc8c920876794d471f9ad6 length=2252 parent=FBgn0000042 release=r6.11 species=Dmel
FBtr0100663 type=mRNA loc=X:join(5900861..5901013,5902690..5904464) ID=FBtr0100663 name=Act5C-RD dbxref=FlyBase_Annotation_IDs:CG4027-RD,FlyBase:FBtr0100663,REFSEQ:NM_001014725,REFSEQ:NM_001014725 score=15 score_text=Strongly Supported MD5=3006846211d7279213e967e20265cb0b length=1928 parent=FBgn0000042 release=r6.11 species=Dmel
FBtr0345894 type=mRNA loc=X:join(5900861..5901013,5902690..5905399) ID=FBtr0345894 name=Act5C-RE dbxref=FlyBase:FBtr0345894,FlyBase_Annotation_IDs:CG4027-RE,REFSEQ:NM_001297986 score=15 score_text=Strongly Supported MD5=538411fdf0fb678c9c68f422d83255db length=2863 parent=FBgn0000042 release=r6.11 species=Dmel
Applying the concepts
It looks like there is a lot of redundant information in our file. Everything to the right of the =
signs is useful, the stuff to the left of the =
is just a label. Find a way to write only the stuff to the right of the equal signs.
Appending to a file
Notice how our file we just wrote completely overwrote the previous contents of the file. Can you find a way to append to the end of a file instead?
Hint: check the mode
argument of the open()
documentation.
The final output after the challenge problems looks like this:
mRNA X:join(5901982..5902096,5902690..5905227) FBtr0070823 Act5C-RB FlyBase_Annotation_IDs:CG4027-RB,FlyBase:FBtr0070823,REFSEQ:NM_078497,REFSEQ:NM_078497,REFSEQ:NM_078497 15 Strongly Supported 50caec406caeac9bd69f481d94bb76b9 2653 FBgn0000042 r6.11 Dmel
mRNA X:join(5900861..5901013,5902690..5904173) FBtr0070822 Act5C-RA FlyBase_Annotation_IDs:CG4027-RA,FlyBase:FBtr0070822,REFSEQ:NM_167053,REFSEQ:NM_167053 15 Strongly Supported c727ce58ff19fd7683f6319cb502963e 1637 FBgn0000042 r6.11 Dmel
mRNA X:join(5900861..5901013,5902690..5904788) FBtr0100662 Act5C-RC FlyBase_Annotation_IDs:CG4027-RC,FlyBase:FBtr0100662,REFSEQ:NM_001014726,REFSEQ:NM_001014726 15 Strongly Supported ee68d31c61dc8c920876794d471f9ad6 2252 FBgn0000042 r6.11 Dmel
mRNA X:join(5900861..5901013,5902690..5904464) FBtr0100663 Act5C-RD FlyBase_Annotation_IDs:CG4027-RD,FlyBase:FBtr0100663,REFSEQ:NM_001014725,REFSEQ:NM_001014725 15 Strongly Supported 3006846211d7279213e967e20265cb0b 1928 FBgn0000042 r6.11 Dmel
mRNA X:join(5900861..5901013,5902690..5905399) FBtr0345894 Act5C-RE FlyBase:FBtr0345894,FlyBase_Annotation_IDs:CG4027-RE,REFSEQ:NM_001297986 15 Strongly Supported 538411fdf0fb678c9c68f422d83255db 2863 FBgn0000042 r6.11 Dmel