ArcticDB
2024年6月7日
在 Python 中存储时间序列数据?忘了 CSV 吧,ArcticDB 是更明智的选择。它专为大型数据集设计,提供令人难以置信的效率。考虑一个有 1,000 列(名为 `c0` 到 `c999`),每列包含 100,000 行以秒为单位递增的时间序列数据,数值为 0 到 1 之间的随机浮点数的数据帧,如下例所示:将其保存到 CSV 会很麻烦,需要 1.93GB 存储空间,而 ArcticDB 则非常精简,只需 0.8GB。
写入速度?在具有 2 核和 8GB 内存的 Jupyter notebook 上,ArcticDB 将数据保存到本地存储的平均速度为 5 秒(5 次运行),而使用 pyarrow 引擎保存 CSV 的平均速度则为 180 秒。ArcticDB 的读取速度也很快,平均只需 1.2 秒(5 次运行),而 CSV 的平均速度为 53.5 秒。
此外,使用 ArcticDB,您可以在存储中查询数据集,例如您可以选择您想要读取的时间范围,而使用 CSV 则需要将整个文件读回并在本地进行查询。
ArcticDB 可以在任何地方工作——从您的本地 PC 到 S3 或 Azure Blob Storage 等云存储——并简化共享,消除了文件传输的麻烦。它针对 Pandas DataFrames 进行了优化,可以直接存储和检索,没有复杂性。凭借其高效的列式存储,ArcticDB 减少了文件大小并加快了访问速度。此外,其内置的版本控制提供了数据修改的透明历史记录,这是 CSV 所缺乏的优势。
以下是一些模板代码,可帮助您入门。有关更多信息,请参阅 ArcticDB API 指南:Introduction — ArcticDB
import numpy as np
import pandas as pd
import arcticdb as adb
# Connect to Arctic database using the specified URL
arctic = adb.Arctic("lmdb://arcticdb_demo")
# Get or create a library for storing data
lib = arctic.get_library('test_adb_vscsv', create_if_missing=True)
# Define the size of the DataFrame to be 100,000 x 1000
rows = 100000
cols = 1000
large = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('1/2/2020', periods=rows, freq="H"))
# Write the DataFrame to the Arctic library
write_record = lib.write("large", large)
# Display the write result
write_record
# Read the DataFrame back from the Arctic library
read_record = lib.read("large")
# Display the read result
read_record
# Print the data from the read operation
print(read_record.data)
您是否读了这篇迷你博客,心想:“我有很多 CSV 文件,转用 ArcticDB 会受益匪浅?”如果是这样,下面有一个示例演示如何将数据从 CSV 文件导入到 ArcticDB。`csv_to_ADB` 函数可用于自动将这些数据导入 ArcticDB。您可以循环遍历文件夹中的每个 CSV 文件,并为每个文件调用 `csv_to_ADB`,使用 CSV 文件名作为符号名称,或在需要时提供新的符号名称。这对于批量处理大量数据特别有用。请在代码片段底部找到使用此函数的一些示例。
最佳实践
确保所有 CSV 文件格式一致,特别是 `DateTime` 列,以避免索引错误。
使用 `pyarrow` 引擎更快地读取 CSV 文件,特别是对于大型数据集。
妥善处理文件读取错误或写入 ArcticDB 的冲突等异常,确保流程不会突然停止。
如果文件数量较多且系统资源允许,考虑并行处理以加快整个操作速度。
维护一份日志,记录每个文件处理的成功或失败,以便进行审计和故障排除。
# install pandas, arcticdb, pyarrow
import os
import numpy as np
import pandas as pd
import arcticdb as adb
def csv_to_ADB(csv_path, lib, index_name='DateTime', symbol_name=None, engine="pyarrow"):
"""
The function `csv_to_ADB` takes a CSV file, converts it into a pandas DataFrame,
and writes the DataFrame to an ArcticDB chunksize rows at a time.
Parameters
----------
csv_name: string
The name/path of the CSV file to be read.
lib: Arctic Library
Where the data will be stored.
index_name: string
The column in the CSV to be used as the DataFrame index (defaults to 'DateTime').
symbol_name: string
Optional new name for the symbol when stored in the Arctic library.
The function uses the CSV file basename as the symbol name if symbol_name is not provided.
engine: string
The parser engine to use for reading the CSV (defaults to 'pyarrow' for performance).
"""
if not symbol_name:
symbol_name = os.path.splitext(csv_path)[0]
df = pd.read_csv(csv_path, engine=engine)
df.set_index(index_name, inplace=True)
lib.update(symbol_name, df, upsert=True)
# Connect to a local Arctic database and get or create a library
arctic = adb.Arctic("lmdb://arcticdb_demo")
lib = arctic.get_library('test_adb_vscsv', create_if_missing=True)
# Here are some examples of using this function. First, what we will do is:
# Make random second data from the 1/1/2020, then save that to ArcticDB from CSV using the CSV file name as the symbol.
rows = 100000
cols = 1000
data1 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('1/1/2020', periods=rows, freq="s"))
data1.index.name = 'DateTime'
data1.to_csv('test_data_1.csv', index=True)
csv_to_ADB('test_data_1.csv',lib)
print(lib.read('test_data_1').data.tail())
# Make random second data from 2020-01-02 03:46:40, then save that to ArcticDB from CSV using the CSV file name as the symbol.
data2 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('2020-01-02 03:46:40', periods=rows, freq="s"))
data2.index.name = 'DateTime'
data2.to_csv('test_data_2.csv', index=True)
csv_to_ADB('test_data_2.csv',lib)
print(lib.read('test_data_2').data)
# Use the CSV file for 1/1/2020 and write that to ArcticDB with a different symbol name called made_up_data_1.
csv_to_ADB('test_data_1.csv',lib, symbol_name='made_up_data_1')
print(lib.read('made_up_data_1').data)
# Use the CSV file for 2/1/2020 and write that to ArcticDB with a different symbol name called made_up_data_2.
csv_to_ADB('test_data_2.csv',lib, symbol_name='made_up_data_2')
print(lib.read('made_up_data_1').data)
# Update made_up_data_1 with the data from 2/1/2020 so that made_up_data_1 will have the data for both 1/1/2020 and 2/1/2020.
csv_to_ADB('test_data_2.csv',lib, symbol_name='made_up_data_1')
print(lib.read('made_up_data_1').data)
# Rewrite data to test_data_1.csv with the data from 2020-01-02 03:46:40 and write that new data to ArcticDB.
data2 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('2020-01-02 03:46:40', periods=rows, freq="s"))
data2.index.name = 'DateTime'
data2.to_csv('test_data_1.csv', index=True)
csv_to_ADB('test_data_1.csv',lib)
print(lib.read('test_data_1').data)