Python Pandas - 使用 HDF5 格式

在处理大型数据集时，我们可能会遇到“内存不足”错误。可以使用 HDF5 等优化的存储格式来避免此类问题。pandas 库提供了诸如 HDFStore 类和 读/写 API 等工具，以便轻松地存储、检索和操作数据，同时优化内存使用和检索速度。

HDF5 代表 分层数据格式版本 5，是一种开源文件格式，旨在有效地存储大型、复杂和异构数据。它以类似于文件系统的分层结构组织数据，其中组充当目录，数据集充当文件。HDF5 文件格式可以以分层结构存储不同类型的数据（例如数组、图像、表格和文档），使其成为管理异构数据的理想选择。

使用 HDFStore 处理 HDF5 格式

pandas 中的 HDFStore 类用于以类似字典的方式管理 HDF5 文件。HDFStore 是一个类似字典的对象，可以使用 PyTables 库以 HDF5 格式读取和写入 Pandas 数据。

示例：在 Pandas 中使用 HDFStore 创建 HDF5 文件

以下是一个演示如何在 Pandas 中使用 pandas.HDFStore 类创建 HDF5 文件 的示例。

import pandas as pd
import numpy as np

# Create the store using the HDFStore class
store = pd.HDFStore("store.h5")

# Display the store
print(store)

# It is important to close the store after use
store.close()

以下是上述代码的输出 -

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

注意：要在 pandas 中使用 HDF5 格式，您需要 pytables 库。它是 pandas 的可选依赖项，必须使用以下命令之一单独安装 -

# Using pip
pip install tables

# or using conda installer
conda install pytables

示例：使用 Pandas 中的 HDFStore 将数据写入/读取到 HDF5

HDFStore 是一个类似字典的对象，因此我们可以使用键值对直接将数据写入和读取到 HDF5 存储中。以下示例演示了相同的功能 -

import pandas as pd
import numpy as np

# Create the store
store = pd.HDFStore("store.h5")

# Create the data 
index = pd.date_range("1/1/2024", periods=8)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

# Write Pandas data to the Store, which is equivalent to store.put('s', s)
store["s"] = s  
store["df"] = df

# Read Data from the store, which is equivalent to store.get('df')
from_store = store["df"]
print('Retrieved Data From the HDFStore:\n',from_store)

# Close the store after use
store.close()

以下是上述代码的输出 -

Retrieved Data From the HDFStore:
                    A         B         C
2024-01-01  0.200467  0.341899  0.105715
2024-01-02 -0.379214  1.527714  0.186246
2024-01-03 -0.418122  1.008820  1.331104
2024-01-04  0.146418  0.587433 -0.750389
2024-01-05 -0.556524 -0.551443 -0.161225
2024-01-06 -0.214145 -0.722693  0.072083
2024-01-07  0.631878 -0.521474 -0.769847
2024-01-08 -0.361999  0.435252  1.177110

使用 Pandas API 读取和写入 HDF5 格式

Pandas 还提供了高级 API 来简化与 HDFStore（也就是 HDF5 文件）的交互。这些 API 允许您直接读取和写入 HDF5 文件的数据，而无需手动创建 HDFStore 对象。以下是 pandas 处理 HDF5 文件的主要 API -

pandas.read_hdf()：从 HDFStore 读取数据。
pandas.DataFrame.to_hdf() 或 pandas.Series.to_hdf()：使用 HDFStore 将 Pandas 对象数据写入 HDF5 文件。

使用 to_hdf() 将 Pandas 数据写入 HDF5

to_hdf() 函数允许您使用 HDFStore 将 pandas 对象（如 DataFrame 和 Series）直接写入 HDF5 文件。此函数提供了各种可选参数，例如压缩、处理缺失值、格式选项等，允许您有效地存储数据。

示例

此示例使用 DataFrame.to_hdf() 函数将数据写入 HDF5 文件。

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},index=['x', 'y', 'z']) 

# Write data to an HDF5 file using the to_hdf()
df.to_hdf("data_store.h5", key="df", mode="w", format="table")

print("Data successfully written to HDF5 file")

以下是上述代码的输出 -

Data successfully written to HDF5 file

使用 read_hdf() 从 HDF5 读取数据

pandas.read_hdf() 方法用于检索存储在 HDF5 文件中的 Pandas 对象。它接受要从中读取数据的文件名、文件路径或缓冲区。

示例

此示例演示了如何使用 pd.read_hdf() 方法从 HDF5 文件“data_store.h5”中键“df”下存储的数据。

import pandas as pd

# Read data from the HDF5 file using the read_hdf()
retrieved_df = pd.read_hdf("data_store.h5", key="df")

# Display the retrieved data
print("Retrieved Data:\n", retrieved_df.head())

以下是上述代码的输出 -

Retrieved Data:
    A  B
x  1  4
y  2  5
z  3  6

使用 to_hdf() 将数据追加到 HDF5 文件

可以通过使用 to_hdf() 函数的 mode="a" 选项将数据追加到现有的 HDF5 文件。当您想要将新数据添加到文件而不覆盖现有内容时，这很有用。

示例

此示例演示了如何使用 to_hdf() 函数将数据追加到现有 HDF5 文件。

import pandas as pd
import numpy as np

# Create a DataFrame to append
df_new = pd.DataFrame({'A': [7, 8], 'B': [1, 1]},index=['i', 'j'])

# Append the new data to the existing HDF5 file
df_new.to_hdf("data_store.h5", key="df", mode="a", format="table", append=True)

print("Data successfully appended")

# Now read data from the HDF5 file using the read_hdf()
retrieved_df = pd.read_hdf("data_store.h5", key='df')

# Display the retrieved data
print("Retrieved Data:\n", retrieved_df.head())

以下是上述代码的输出 -

Data successfully appended
Retrieved Data:
    A  B
x  1  4
y  2  5
z  3  6
i  7  1
j  8  1

打印页面