파이썬에서 큰 파일을 읽는 게으른 방법?

lottoking 2020. 4. 1. 08:12

파이썬에서 큰 파일을 읽는 게으른 방법?

나는 4GB의 매우 큰 파일을 가지고 있으며 그것을 읽으려고 할 때 컴퓨터가 정지합니다. 그래서 조각별로 읽고 각 조각을 처리 한 후 처리 된 조각을 다른 파일에 저장하고 다음 조각을 읽습니다.

yield이 조각들에 어떤 방법 이 있습니까?

나는 게으른 방법 을 갖고 싶습니다 .

게으른 함수를 작성하려면 다음을 사용하십시오 yield.

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('really_big_file.dat')
for piece in read_in_chunks(f):
    process_data(piece)

또 다른 옵션은 사용 iter및 도우미 기능입니다.

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)

파일이 라인 기반 인 경우 파일 객체는 이미 지연된 라인 생성기입니다.

for line in open('really_big_file.dat'):
    process_data(line)

컴퓨터, OS 및 Python이 64 비트 인 경우 mmap 모듈 을 사용 하여 파일의 컨텐츠를 메모리에 맵핑하고 색인 및 슬라이스로 액세스 할 수 있습니다. 다음은 설명서의 예입니다.

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()

컴퓨터, OS 또는 python 중 하나가 32 비트 인 경우 mmap-ing 큰 파일은 주소 공간의 많은 부분을 예약하고 메모리 프로그램을 고갈시킬 수 있습니다.

file.readlines() 반환되는 행에서 읽은 행 수와 비슷한 선택적 크기 인수를 사용합니다.

bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)

이미 많은 좋은 답변이 있지만 최근에 비슷한 문제가 발생하여 필요한 솔루션이 여기에 나열되어 있지 않으므로이 스레드를 보완 할 수 있다고 생각했습니다.

시간의 80 %, 파일을 한 줄씩 읽어야합니다. 그런 다음이 대답 에서 제안한 것처럼 파일 객체 자체를 지연 생성기로 사용하려고합니다.

with open('big.csv') as f:
    for line in f:
        process(line)

그러나 최근에 행 구분 기호가 실제로는 '\n'아니지만 매우 큰 (거의) 단일 라인 csv에 부딪 쳤습니다 '|'.

한 줄씩 읽는 것은 옵션이 아니지만 여전히 한 줄씩 처리해야했습니다.
이 csv의 일부 필드 (자유 텍스트 사용자 입력)가 포함되어 있기 때문에 처리
'|'하기 '\n'전에 변환 하는 것도 의문의 여지가 없습니다 '\n'.
csv 라이브러리를 사용하는 것도 최소한 초기 버전의 lib 에서 입력 행을 한 행씩 읽도록 하드 코딩되어 있기 때문에 제외되었습니다 .

다음 스 니펫을 생각해 냈습니다.

def rows(f, chunksize=1024, sep='|'):
    """
    Read a file where the row separator is '|' lazily.

    Usage:

    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(row)
    """
    incomplete_row = None
    while True:
        chunk = f.read(chunksize)
        if not chunk: # End of file
            if incomplete_row is not None:
                yield incomplete_row
                break
        # Split the chunk as long as possible
        while True:
            i = chunk.find(sep)
            if i == -1:
                break
            # If there is an incomplete row waiting to be yielded,
            # prepend it and set it back to None
            if incomplete_row is not None:
                yield incomplete_row + chunk[:i]
                incomplete_row = None
            else:
                yield chunk[:i]
            chunk = chunk[i+1:]
        # If the chunk contained no separator, it needs to be appended to
        # the current incomplete row.
        if incomplete_row is not None:
            incomplete_row += chunk
        else:
            incomplete_row = chunk

나는 큰 파일과 다른 청크 크기로 성공적으로 테스트했습니다 (알고리즘이 크기에 의존하지 않는지 확인하기 위해 청크 크기를 1 바이트로 시도했습니다).

f = ... # file-like object, i.e. supporting read(size) function and 
        # returning empty string '' when there is nothing to read

def chunked(file, chunk_size):
    return iter(lambda: file.read(chunk_size), '')

for data in chunked(f, 65536):
    # process the data

업데이트 :이 방법은 https : //.com/a/4566523/38592에서 가장 잘 설명됩니다.

우리는 다음과 같이 쓸 수 있다고 생각합니다.

def read_file(path, block_size=1024): 
    with open(path, 'rb') as f: 
        while True: 
            piece = f.read(block_size) 
            if piece: 
                yield piece 
            else: 
                return

for piece in read_file(path):
    process_piece(piece)

명성이 낮아서 댓글을 달 수는 없지만 SilentGhosts 솔루션이 file.readlines ([sizehint])를 사용하면 훨씬 쉬워집니다.

파이썬 파일 메소드

편집 : SilentGhost가 옳지 만 다음보다 낫습니다.

s = "" 
for i in xrange(100): 
   s += file.next()

나는 다소 비슷한 상황에 처해있다. 청크 크기를 바이트 단위로 알고 있는지 확실하지 않습니다. 보통은 아니지만 필요한 레코드 수 (줄)는 다음과 같습니다.

def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]

업데이트 : nosklo 감사합니다. 여기 제가 의미하는 바가 있습니다. 그것은 '중간'청크를 잃는 것을 제외하고는 거의 작동합니다.

chunk = [next(gen) for i in range(lines_required)]

트릭은 줄을 잃지 않고하지만 잘 보이지 않습니다.

python 공식 문서 https://docs.python.org/zh-cn/3/library/functions.html?#iter를 참조하십시오.

어쩌면이 방법은 더 파이썬 일 수 있습니다.

from functools import partial

"""A file object returned by open() is a iterator with
read method which could specify current read's block size"""
with open('mydata.db', 'r') as f_in:

    part_read = partial(f_in.read, 1024*1024)
    iterator = iter(part_read, b'')

    for index, block in enumerate(iterator, start=1):
        block = process_block(block)    # process block data
        with open(f'{index}.txt', 'w') as f_out:
            f_out.write(block)

한 줄씩 처리하기 위해 이것은 우아한 해결책입니다.

  def stream_lines(file_name):
    file = open(file_name)
    while True:
      line = file.readline()
      if not line:
        file.close()
        break
      yield line

빈 줄이없는 한.

다음 코드를 사용할 수 있습니다.

file_obj = open('big_file')

open ()은 파일 객체를 반환

다음 크기를 얻기 위해 os.stat를 사용하십시오.

file_size = os.stat('big_file').st_size

for i in range( file_size/1024):
    print file_obj.read(1024)

참고 URL : https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python

'IT' 카테고리의 다른 글

Java-현재 클래스 이름을 얻습니까? (0)	2020.04.01
Jest를 사용하여 단일 테스트를 어떻게 실행합니까? (0)	2020.04.01
프로그램이 실행되는 디렉토리는 어떻게 얻습니까? (0)	2020.04.01
CSS3 애니메이션 종료시 최종 상태 유지 (0)	2020.04.01
커서 위치에서 시작하여 VIM에서 줄을 빠르게 삭제하려면 어떻게해야합니까? (0)	2020.04.01

현재글파이썬에서 큰 파일을 읽는 게으른 방법?

내가 바로 로또왕!

공연, 영화순위, C#, javascript, spring, 관광, jquery, 행사, spring3, 무비순위, 볼거리, 여행, Java, 자바, c++, DVD순위, 뮤지컬, 축제, 연극, 놀거리,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

lottoking