如何使用MMAP函数提高Python的文件读取性能?
简介...
MMAP,即内存映射,当映射到文件时,它使用操作系统的虚拟内存直接访问文件系统上的数据,而不是使用普通的I/O函数访问数据。从而提高了I/O性能,因为它不需要为每次访问都进行单独的系统调用,也不需要在缓冲区之间复制数据。
事实上,任何在内存中的东西,例如在内存中创建的SQLlite数据库,其性能都比磁盘上的数据库要好。
内存映射文件可以根据需要被视为可变字符串或类文件对象。
MMAP支持许多方法,例如close()、flush()、read()、readline()、seek()、tell()、write(),并且可以很好地与切片操作甚至正则表达式一起使用。
操作方法...
1. 假设有一个包含以下内容的文本文件。您可以通过使用Google搜索示例文本获得此文本。将这些内容复制到input.txt文件中。
Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.
Id porro facete cum. No est veritus detraxit facilisis, sit ea clita decore essent. Ut eam labores fuisset menandri, ex sit brute viderer eleifend, altera argumentum vel ex. Duo at zril sensibus, eu vim ullum assentior, quando possit at his.
Te nam tempor posidonium scripserit, eam mundi reprimique dissentias ne. Vim te soleat offendit democritum. Nam an diam elaboraret, quaeque dissentias an has. Autem legendos dignissim ad vis, sea ex amet petentium reprehendunt, inermis constituam philosophia ne mel. Esse noster lobortis usu ne.
Nec reque postea urbanitas ut, mea in nulla invidunt ocurreret. Ei duo iuvaret numquam. Ferri nemore audire te est, mel et detracto noluisse. Nec eu habeo justo, id pro posse apeirian volutpat. Mea sonet quaestio ne.
Atqui quaeque alienum te vim. Graeco aliquip liberavisse pro ut. Te similique reformidans usu, te mundi aliquando ius. Meis scripta minimum quo no, meis prima fabellas eu eam, laoreet delicata forensibus ut vim. Et quo vocibus mediocritatem, atqui summo an eam.
2. 我们将使用mmap()函数创建一个内存映射文件。我们可以通过文件对象的fileno()方法或os.open()来传递文件名。
注意:用户有责任在调用mmap()之前打开文件,并在之后关闭它。
mmap()的第二个参数是以字节为单位的大小,表示要映射的文件的哪一部分。如果值为0,则映射整个文件。还有一个额外的参数可以使用,即ACCESS_READ用于只读访问,ACCESS_WRITE用于直写访问,ACCESS_COPY用于写时复制访问。
import mmap input_text = """Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam. Id porro facete cum. No est veritus detraxit facilisis, sit ea clita decore essent. Ut eam labores fuisset menandri, ex sit brute viderer eleifend, altera argumentum vel ex. Duo at zril sensibus, eu vim ullum assentior, quando possit at his. Te nam tempor posidonium scripserit, eam mundi reprimique dissentias ne. Vim te soleat offendit democritum. Nam an diam elaboraret, quaeque dissentias an has. Autem legendos dignissim ad vis, sea ex amet petentium reprehendunt, inermis constituam philosophia ne mel. Esse noster lobortis usu ne. Nec reque postea urbanitas ut, mea in nulla invidunt ocurreret. Ei duo iuvaret numquam. Ferri nemore audire te est, mel et detracto noluisse. Nec eu habeo justo, id pro posse apeirian volutpat. Mea sonet quaestio ne. Atqui quaeque alienum te vim. Graeco aliquip liberavisse pro ut. Te similique reformidans usu, te mundi aliquando ius. Meis scripta minimum quo no, meis prima fabellas eu eam, laoreet delicata forensibus ut vim. Et quo vocibus mediocritatem, atqui summo an eam. """ # create a inout file with some text input_file = 'input.txt' f = open(input_file, "w+") f.write(input_text) f.close() #Open the file in read mode with open(input_file, 'r') as f: with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as m: print(f"Output \n*** Output first 5 bytes of the {input_file} is {m.read(5)} ") print(f"*** Output Next 10 bytes of the {input_file} is {m.read(10)} ")
输出
*** Output first 5 bytes of the input.txt is b'Lorem' *** Output Next 10 bytes of the input.txt is b' ipsum dol'
3. 我们已经读取了文件并将其映射到内存中,并使用.read()读取了前5个字节。因此,在第一次读取之后,文件指针向前移动了10个字节。现在,如果您再进行一次读取,例如read(10)字节,它将为您提供第6-15个字节。
4. 要设置内存映射文件以进行更新,请在映射它之前以'r+'(而不是'w')打开它。
我将通过一个例子向您展示如何在原地修改部分行。
import mmap import shutil input_file = 'input.txt' input_copy = input_file.replace('input','input_copy') # Make a Copy of the file just to make sure original is un-modified. shutil.copyfile(input_file,input_copy) # word word = b'ipsum' # modified word modified_word = word[::-1] # Open the file to receive updates with open(input_copy, 'r+') as f: with mmap.mmap(f.fileno(), 0) as m: print(f"output \n *** Line before updates \n {m.readline().rstrip()}") # Rewind using seek m.seek(0) # find the word and reverse it loc = m.find(word) m[loc:loc + len(word)] = modified_word m.flush() # Rewind using seek m.seek(0) print(f" \n *** Line after updates \n {m.readline().rstrip()}") f.seek(0) print(f" \n *** Final file \n {f.readline().rstrip()}")
输出
*** Line before updates b'Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.' *** Line after updates b'Lorem muspi dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.' *** Final file Lorem muspi dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.
5. 单词“ipsum”在第一行的中间被替换为内存和文件中。
6. 如果由于任何原因您想查看内存中的更改并且不想更新磁盘上的文件,请使用ACCESS_COPY。
import mmap import shutil input_file = 'input.txt' input_copy = input_file.replace('input','input_copy') # Make a Copy of the file just to make sure original is un-modified. shutil.copyfile(input_file,input_copy) # word word = b'ipsum' # modified word modified_word = word[::-1] # Open the file to receive updates with open(input_copy, 'r+') as f: with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY) as m: print(f"output \n *** Line before updates \n {m.readline().rstrip()}") # Rewind using seek m.seek(0) # find the word and reverse it loc = m.find(word) m[loc:loc + len(word)] = modified_word m.flush() # Rewind using seek m.seek(0) print(f" \n *** Line after updates \n {m.readline().rstrip()}") f.seek(0) print(f" \n *** Final file \n {f.readline().rstrip()}")
输出
*** Line before updates b'Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.' *** Line after updates b'Lorem muspi dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.' *** Final file Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.
7. 观察输入和输出的内容没有改变,而更改只应用于内存中。