I noticed that in your CREATE TABLE DDL statements, the LOCATION fields point to individual files instead of S3 folders. This will lead to the "zero records returned" issue you encountered.
This is because for distributed systems that handles large amount of data, they operate on folders of files instead of files themselves. For example, in Presto (the underlying engine of Athena), a table maps to a folder of files with the same schema. Same in Spark (the underlying engine of Glue ETL jobs), a DataFrame represents a folder of files with the same schema. Based on the way these engines operate, the underlying SerDe used to read/write the data expect the specified LOCATION to be a folder.
When you create the table in Glue Catalog, the issue won't emerge yet, because at this point, you've only created metadata in the Catalog. However, when the engine (Athena) tries to use the metadata (including the SerDe) to read the files, it'd fail to recognize them.
Testing the Silk Platform in 2024: Achieving 20 GiB/s I/O Throughput in a
Single Cloud VM
-
Hands-on technical analysis of a novel data platform for high-performance
block I/O in the cloud, tested by Tanel Poder, a database consultant and a
long-t...
Acum 2 săptămâni
Niciun comentariu:
Trimiteți un comentariu