If you have external Apache Hive tables with partitions stored in Amazon S3, the easiest way to list the S3 file paths is to query the MySQL hive metastore directly.
When running a Hive query against our Amazon S3 backed table, I encountered this error:
java.lang.IllegalArgumentException: Can not create a Path from an empty string
It was suggested to me that perhaps one of the files in S3 had been moved or deleted. After saying “Crap!”, I decided that I needed to get a list of all of the hive table partition locations so that I can see which ones, if any were no longer there.
Using the DESCRIBE FORMATTED HQL command, one can get the Location value of a single partition, but this table had over 50,000 partitions, so that meant running over 50,000 queries and then parsing the results. This wasn’t going to do.
Instead, I found where in the MySQL Hive metastore these paths were stored and exported them in order to pipe into an aws s3 ls command.
Here is the query that I used to get the data that I needed out of a MySQL metastore database:
SELECT SDS.LOCATION FROM TBLS INNER JOIN PARTITIONS ON(TBLS.TBL_ID=PARTITIONS.TBL_ID) INNER JOIN SDS ON(PARTITIONS.SD_ID=SDS.SD_ID) WHERE TBLS.TBL_NAME='table_name' INTO OUTFILE '/tmp/s3paths.csv';
Note that in my install of CDH, the default hive metastore user is hiveuser and the default password is password. However, by default, the hiveuser won’t have access to write to a file. You may need to grant it permissions or just use the MySQL root user.
mysql --user=root -p use metastore;