TLDR: Scrooge can now scan the Ethereum blockchain via the JSON-RPC protocol. This blog post covers the details on how to use it and the future plans for it.

Scrooge is an open-source, MIT license, DuckDB extension that provides functions and scanners aimed at easily integrating DuckDB with financial data analysis. I’ve recently added the capability to scan the Ethereum blockchain via the JSON-RPC protocol, and this blog post will cover its current capabilities and limitations. Note that the scanner is in an experimental state; both the API and overall performance are likely to change. It is not intended for production use!

I’ve been a cryptocurrency enthusiast for many years. The idea of being able to execute transactions without centralized control, combined with the algorithmic nature of coin production, has always resonated with me. I’ve held cryptocurrencies through all their ups and downs, and one of my very close friends even gifted me a fake bitcoin during the 2018 bear markets as joke :-). bitcoin

As a crypto enthusiast, I always found the available services for blockchain analysis to be lacking. The key-value storage structure in which data is stored makes running analytical or exploratory workloads cumbersome.

Of course, some services to query the blockchain exist, but there are limitations on local, free, open-source options for efficiently executing queries on blockchain data.

For that reason, I decided to implement an Ethereum scanner in Scrooge, allowing people to query Ethereum data directly in DuckDB. The scanner allows users to gather information about token transfers and uniswap pools, it is parallel, supports a progress bar, and implements a projection pushdown optimization.

All examples provided in this blogpost are available in a easy-to-run colab example.

There are many reasons to use DuckDB for such a use case. Here is a short list of what I believe to be the most beneficial:

  • Free: DuckDB and Scrooge are both completely free and open-source, allowing you to execute queries directly on your laptop with no added costs.
  • Privacy: Since you are performing local queries and DuckDB/Scrooge don’t perform any kind of telemetry, your privacy is guaranteed.
  • Efficiency: DuckDB is one of the fastest database systems for data analysis. It implements many state-of-the-art techniques for data analytics, such as columnar storage, vectorized execution, unnesting of subqueries, and more.
  • Integration: DuckDB supports writing to many common file formats, including Parquet, CSV, and JSON. It also supports efficient integration with dataframe libraries like Pandas and Polars, so you can get your Ethereum data in almost any shape or form desired.

Setup

Since scrooge is a third-party unsigned DuckDB extension, you must execute the following to properly install it in your DuckDB instance.

import duckdb
# Starts a memory database, with the configuratin that allows for unsigned extensions
con = duckdb.connect(':memory:', config={'allow_unsigned_extensions' : 'true'})

# Sets a custom S3 bucket, where DuckDB will look for non-official extensions.
con.execute("SET custom_extension_repository='https://scrooge-duck.s3.us-west-2.amazonaws.com';")
con.execute("SET allow_extensions_metadata_mismatch=true;")


# Install Scrooge
con.execute("INSTALL 'scrooge'")
# Loads Scrooge
con.execute("LOAD 'scrooge'")

How it works

Scrooge currently queries Ethereum data via the JSON-RPC protocol. Therefore, a client node that supports the protocol must be available. In general, I run a local node on an external SSD drive, but you can use any available RPC provider (e.g., Infura.

By default, the scanner connects to http://127.0.0.1:8545, but you can connect to a different provider by setting the eth_node_url option.

By default, the scanner connects to http://127.0.0.1:8545 but you can connect to a provider by setting the eth_node_url option.

e.g.,

set eth_node_url= 'https://mempool.merkle.io/rpc/eth/pk_mbs_0b647b195065b3294a5254838a33d062';

The scanner, called read_eth, takes five parameters:

  1. The address, token symbol, or Uniswap pool you want to gather information from.
  2. An event type. Currently, only transfer and sync are currently supported.
  3. The starting block number for scanning.
  4. The ending block number for scanning.
  5. (Optional) The number of blocks you want to scan per thread. This is important for parallelism and to overcome limitations of free RPC providers.

In essence, the scanner makes several eth_getLogs requests and transforms the results into a DuckDB table.

For example, you can query USDT transfers with the following SQL query:

FROM read_eth(
    'USDT',
    'Transfer',
    20034078,
    20034100,
    blocks_per_thread = 10
)
limit 1
----
"0xdac17f958d2ee523a2206206994597c13d831ec7"  Transfer  "0x0208a65db173bee464fe73c39156a25aa63b8e7f21b658dce7b933b563fe31f6"  20034081  [2867479] 186 false [0x00000000000000000000000022f9dcf4647084d6c31b2765f6910cd85c178c18, 0x0000000000000000000000007afa9d836d2fccf172b66622625e56404e465dbd]  "0x28e66e57b41a89e8854a4fecfc0a0e6f234609e707a379e3670b2fe12490e2b9"  62

In this case, the query returns one USDT transfer from account 0x00000000000000000000000022f9dcf4647084d6c31b2765f6910cd85c178c18 to 0x0000000000000000000000007afa9d836d2fccf172b66622625e56404e465dbd of value 2867479.

Scrooge also supports direct token and uniswap symbols translation to addresses, but there is also the possibility of directly, querying a given address. As an example the following query would provide the same result.

FROM read_eth(
    '0xdac17f958d2ee523a2206206994597c13d831ec7',
    'Transfer',
    20034078,
    20034100,
    blocks_per_thread = 10
)
limit 1

Scrooge also tries to give advice on possible symbols, in case of mispelling.

For example

FROM read_eth(
    'USDM',
    'Transfer',
    20034078,
    20034100
);
----
Invalid Input Error: Failed to infer the address "USDM". Address must be either a hex (e.g., 0x...) or a valid token symbol.
Suggested token symbols: 
SD, USDY, USDV, USDT, USDP

Exporting Ethereum data

One amazing feature of DuckDB is its versatility in handling different data formats.

Exporting data from the Ethereum blockchain to CSVs, Parquets, JSONs is as easy as:

import duckdb

con = duckdb.connect()

query = '''
FROM read_eth(
    '0xdac17f958d2ee523a2206206994597c13d831ec7',
    'Transfer',
    20034078,
    20034100,
    blocks_per_thread = 10
)
'''

# Writes a CSV File
con.sql(query).write_csv('ether.csv')
# Writes a Parquet File
con.sql(query).write_parquet('ether.parquet')
# Writes JSON
write_json_query = f"COPY ({query}) TO \'ether.json\';"
con.execute(write_json_query)

ROI example

As said in the introduction, the scanner supports Transfer and Sync Events, and it supports querying token symbols and uniswap pools. As an example of what you can do with this, the following query can answer the question of: who were the best day-traders of ethereum on June 28th.

We start of by filtering WETH transactions and storing them in a DuckDB Table, we get the block_number, the amount of the transfer and the addresses:

CREATE TABLE WETH_Transfers AS
SELECT
    block_number,
    data[1] / 1000000000000000000 AS amount,
    lower(topics[1]) AS from_address,
    lower(topics[2]) AS to_address
FROM read_eth(
    'WETH',
    'Transfer',
    20190000,
    20195000,
    blocks_per_thread = 100
);

We then query the WETH_USDT uniswap pool, to get the value of WETH in USD, at a given block.

CREATE TABLE WETH_Prices AS
SELECT
    block_number,
    (data[2] / data[1]) * 1000000000000 AS price
FROM read_eth(
    'WETH_USDT',
    'SYNC',
    20189990,
    20195000,
    blocks_per_thread = 100
);

Finally, the following query calculates the transaction values in and out of each address. And returns the top 5 accounts with the highest net values over that day.

WITH WETH_Transfer_Values AS (
    SELECT
        t.block_number,
        t.amount,
        t.from_address,
        t.to_address,
        p.price,
        t.amount * p.price AS value
    FROM WETH_Transfers t
    JOIN LATERAL (
        SELECT price
        FROM WETH_Prices p
        WHERE p.block_number <= t.block_number
        ORDER BY p.block_number DESC
        LIMIT 1
    ) p ON true
),
Address_Values AS (
    SELECT
        address,
        SUM(value) AS net_value,
        SUM(CASE WHEN type = 'in' THEN value ELSE 0 END) AS total_value_in,
        SUM(CASE WHEN type = 'out' THEN value ELSE 0 END) AS total_value_out,
        SUM(amount) AS net_amount
    FROM (
        SELECT from_address AS address, -value AS value, 'out' AS type, -amount as amount FROM WETH_Transfer_Values
        UNION ALL
        SELECT to_address AS address, value AS value, 'in' AS type, amount as amount FROM WETH_Transfer_Values
    ) AS combined
    GROUP BY address
)

SELECT
    address
FROM Address_Values
WHERE net_amount > 0
ORDER BY net_value  DESC NULLS LAST
LIMIT 5;
----
'0x000000000000000000000000a397a8c2086c554b531c02e29f3291c9704b00c7'
'0x00000000000000000000000059cd1c87501baa753d0b5b5ab5d8416a45cd71db'
'0x0000000000000000000000004d5f47fa6a74757f35c14fd3a6ef8e3c9bc514e8'
'0x0000000000000000000000009008d19f58aabd9ed0d60971565aa8510560ab41'
'0x00000000000000000000000088e6a0c2ddd26feeb64f039a2c41296fcb3f5640'

Limitations and Future

Scrooge is an experimental DuckDB extension, and there are several limitations with the Ethereum scanner. I’ll list the ones I consider most important, along with how I plan to address them in the near future.

  • uint256 values: DuckDB currently supports uint128 values, which is what I use in the scanner. I’ve added a strict option to the scanner that defaults to false. When strict is set to false, if a value can’t be cast to uint128, it will be nullified by the scanner. If set to true, the scanner will throw an error. There is a VARINT type currently under development in DuckDB. When that is released, the scanner will default to that type and will properly represent uint256 values.
  • Performance: JSON protocols are inherently slow. The server has to spend time producing a JSON, and we also have to parse it and transform it into DuckDB internal values. However, Ethereum clients run LevelDB and RocksDB systems underneath. One thing DuckDB excels at is reading storage formats of other database systems (e.g., PostgreSQL, MySQL, SQLite. I plan to integrate a direct scanner for LevelDB’s storage, which would result in orders-of-magnitude performance improvements when using a local client. JSON-RPC will still be relevant for use cases where running a local client is not desirable. This integration will also allow for proper filter and projection pushdown optimization to the lowest data level.
  • Filter pushdown: The parameters of the scanner are essentially data filters. I think they look inelegant and will most likely implement filter pushdown, allowing users to write these filters in a WHERE clause.
  • Other Events/Methods: Scrooge currently only supports eth_getLogs and Transfer, Sync events. This will be expanded to include other methods and events in the future.

Please, report any issues you might encounter in scrooge at scrooge’s issue tracker.


<
Previous Post
Scrooge: Analyzing Yahoo Financial Data In DuckDB
>
Blog Archive
Archive of all previous blog posts