为什么你应该使用 ArcticDB 而不是 CSV 来保存你的 Pandas DataFrames

ArcticDB

2024年6月7日

在 Python 中存储时间序列数据？忘了 CSV 吧，ArcticDB 是更明智的选择。它专为大型数据集设计，提供令人难以置信的效率。考虑一个有 1,000 列（名为 `c0` 到 `c999`），每列包含 100,000 行以秒为单位递增的时间序列数据，数值为 0 到 1 之间的随机浮点数的数据帧，如下例所示：将其保存到 CSV 会很麻烦，需要 1.93GB 存储空间，而 ArcticDB 则非常精简，只需 0.8GB。

写入速度？在具有 2 核和 8GB 内存的 Jupyter notebook 上，ArcticDB 将数据保存到本地存储的平均速度为 5 秒（5 次运行），而使用 pyarrow 引擎保存 CSV 的平均速度则为 180 秒。ArcticDB 的读取速度也很快，平均只需 1.2 秒（5 次运行），而 CSV 的平均速度为 53.5 秒。

此外，使用 ArcticDB，您可以在存储中查询数据集，例如您可以选择您想要读取的时间范围，而使用 CSV 则需要将整个文件读回并在本地进行查询。

ArcticDB 可以在任何地方工作——从您的本地 PC 到 S3 或 Azure Blob Storage 等云存储——并简化共享，消除了文件传输的麻烦。它针对 Pandas DataFrames 进行了优化，可以直接存储和检索，没有复杂性。凭借其高效的列式存储，ArcticDB 减少了文件大小并加快了访问速度。此外，其内置的版本控制提供了数据修改的透明历史记录，这是 CSV 所缺乏的优势。

以下是一些模板代码，可帮助您入门。有关更多信息，请参阅 ArcticDB API 指南：Introduction — ArcticDB

import numpy as np
import pandas as pd
import arcticdb as adb
   
# Connect to Arctic database using the specified URL
arctic = adb.Arctic("lmdb://arcticdb_demo")
 
# Get or create a library for storing data
lib = arctic.get_library('test_adb_vscsv', create_if_missing=True)
 
# Define the size of the DataFrame to be 100,000 x 1000
rows = 100000
cols = 1000
large = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('1/2/2020', periods=rows, freq="H"))
 
# Write the DataFrame to the Arctic library
write_record = lib.write("large", large)
 
# Display the write result
write_record
 
# Read the DataFrame back from the Arctic library
read_record = lib.read("large")
 
# Display the read result
read_record
 
# Print the data from the read operation
print(read_record.data)

您是否读了这篇迷你博客，心想：“我有很多 CSV 文件，转用 ArcticDB 会受益匪浅？”如果是这样，下面有一个示例演示如何将数据从 CSV 文件导入到 ArcticDB。`csv_to_ADB` 函数可用于自动将这些数据导入 ArcticDB。您可以循环遍历文件夹中的每个 CSV 文件，并为每个文件调用 `csv_to_ADB`，使用 CSV 文件名作为符号名称，或在需要时提供新的符号名称。这对于批量处理大量数据特别有用。请在代码片段底部找到使用此函数的一些示例。

最佳实践

确保所有 CSV 文件格式一致，特别是 `DateTime` 列，以避免索引错误。
使用 `pyarrow` 引擎更快地读取 CSV 文件，特别是对于大型数据集。
妥善处理文件读取错误或写入 ArcticDB 的冲突等异常，确保流程不会突然停止。
如果文件数量较多且系统资源允许，考虑并行处理以加快整个操作速度。
维护一份日志，记录每个文件处理的成功或失败，以便进行审计和故障排除。

# install pandas, arcticdb, pyarrow
import os
import numpy as np 
import pandas as pd
import arcticdb as adb
 
   
def csv_to_ADB(csv_path, lib, index_name='DateTime', symbol_name=None, engine="pyarrow"):
    """
    The function `csv_to_ADB` takes a CSV file, converts it into a pandas DataFrame,
    and writes the DataFrame to an ArcticDB chunksize rows at a time.
  
    Parameters
    ----------
    csv_name: string
        The name/path of the CSV file to be read.
  
    lib: Arctic Library
        Where the data will be stored.
      
    index_name: string
        The column in the CSV to be used as the DataFrame index (defaults to 'DateTime').
  
    symbol_name: string
        Optional new name for the symbol when stored in the Arctic library.
        The function uses the CSV file basename as the symbol name if symbol_name is not provided. 
  
    engine: string
        The parser engine to use for reading the CSV (defaults to 'pyarrow' for performance).
    """
    if not symbol_name:
        symbol_name = os.path.splitext(csv_path)[0]
   
    df = pd.read_csv(csv_path, engine=engine)
    df.set_index(index_name, inplace=True)
    lib.update(symbol_name, df, upsert=True)
  
# Connect to a local Arctic database and get or create a library
arctic = adb.Arctic("lmdb://arcticdb_demo")
lib = arctic.get_library('test_adb_vscsv', create_if_missing=True)
    
# Here are some examples of using this function. First, what we will do is:
# Make random second data from the 1/1/2020, then save that to ArcticDB from CSV using the CSV file name as the symbol.
rows = 100000
cols = 1000
data1 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('1/1/2020', periods=rows, freq="s"))
 
data1.index.name = 'DateTime'
data1.to_csv('test_data_1.csv', index=True)
csv_to_ADB('test_data_1.csv',lib)
print(lib.read('test_data_1').data.tail())
 
# Make random second data from 2020-01-02 03:46:40, then save that to ArcticDB from CSV using the CSV file name as the symbol.
data2 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('2020-01-02 03:46:40', periods=rows, freq="s"))
data2.index.name = 'DateTime'
data2.to_csv('test_data_2.csv', index=True)
csv_to_ADB('test_data_2.csv',lib)
print(lib.read('test_data_2').data)
   
# Use the CSV file for 1/1/2020 and write that to ArcticDB with a different symbol name called made_up_data_1.
csv_to_ADB('test_data_1.csv',lib, symbol_name='made_up_data_1')
print(lib.read('made_up_data_1').data)
   
# Use the CSV file for 2/1/2020 and write that to ArcticDB with a different symbol name called made_up_data_2.
csv_to_ADB('test_data_2.csv',lib, symbol_name='made_up_data_2')
print(lib.read('made_up_data_1').data)
   
   
# Update made_up_data_1 with the data from 2/1/2020 so that made_up_data_1 will have the data for both 1/1/2020 and 2/1/2020.
csv_to_ADB('test_data_2.csv',lib, symbol_name='made_up_data_1')
print(lib.read('made_up_data_1').data)
   
 
# Rewrite data to test_data_1.csv with the data from 2020-01-02 03:46:40 and write that new data to ArcticDB.
data2 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('2020-01-02 03:46:40', periods=rows, freq="s"))
data2.index.name = 'DateTime'
data2.to_csv('test_data_1.csv', index=True)
csv_to_ADB('test_data_1.csv',lib)
print(lib.read('test_data_1').data)

> > 联系我们

2025年3月21日

Man Group 案例研究：使用 Python 在 PB 级规模下生成 Alpha 并管理风险。

本博客探讨了 ArcticDB 如何通过克服传统数据库系统的限制来改变量化研究。

Elle Palmer

2024年9月20日

资产管理的终极资产

在当今快节奏的金融世界中，资产管理人不断寻求为其客户创造 Alpha 的方法。实现这一目标最关键的因素之一是数据生产力和管理。

Elle Palmer

为什么你应该使用 ArcticDB 而不是 CSV 来保存你的 Pandas DataFrames

相关文章

Man Group 案例研究：使用 Python 在 PB 级规模下生成 Alpha 并管理风险。

资产管理的终极资产