The Data Preview Challenge at DataByte Co.

Sun Aug 24 2025

At DataByte Co., a large tech company with a global presence, the Data Platform team has a crucial mission: to make data accessible. Their latest project is a Data Portal, an interface built by the Data Portal team to help users across the organization explore and discover the company's millions of datasets, which are distributed across development, pre-production, and production environments in various AWS accounts worldwide.

A key feature in high demand is data preview. The Data Portal team wanted to let their users quickly glance at the first few rows of a file or report to decide if a dataset was relevant for their analysis.

Their first idea was to use AWS Lake Formation, a data lake security tool. The approach seemed logical. They would create views showing only a sample of the data tagged with a specific tag in Lake Formation. Then they would give users permissions to query those views without direct access to the complete datasets if they had this specific tag. But both teams quickly ran into a problem. They discovered that for a user to query a view, Lake Formation requires them to also have permissions on the underlying tables and data. This was a huge blow because there was sensitive data. The whole purpose of the view was to act as a security barrier, but this requirement removed that benefit. To offer a preview, the Data Platform team would have had to give users broad access to the data. This went against all governance rules.


The Financial and Operational Cost of the Wrong Approach

This security flaw had a direct financial and operational impact. To work around it, the teams considered creating smaller duplicate tables just for previews. Let's look at a quick cost simulation using Amazon Athena. Amazon Athena charges based on the amount of data scanned. Currently it costs $5 per terabyte.

Consider a 10 TB dataset. To get a preview, a view would typically scan just a few megabytes to return the top 5 rows. But if a user needs permissions on the entire underlying table, every failed attempt or misconfigured query could potentially scan the full 10 TB. This would cost $50 per query. With thousands of users and countless daily preview attempts, the costs would quickly become huge and unpredictable.

This alternative approach of creating physical preview tables was not viable. These tables would have to be manually or periodically refreshed to reflect the latest changes in the source data. This creates pipeline overhead and introduces delays. Users would not have a truly live view of the data. The lack of automatic data freshness made this solution operationally unsustainable for a dynamic data environment. This approach was not only a security risk but also a financial and operational liability. It was clearly a dead end.


The Solution: A Guard Role and an API

Instead of pursuing this dead-end approach, the Data Platform team rethought their architecture. They decided to build a new solution. This solution was more aligned with their security principles and built to operate in a global, multi-account environment. The central API acts as a single point of entry for the Data Portal team. It securely communicates with various development, pre-production, and production accounts in different regions such as America, Asia, and Europe. This model ensures that data governance rules are followed globally. It solves a complex cross-account security challenge.

  • The Guard Role: The Data Platform team created a special IAM role called the "preview role." This role is not located in the central API account but is deployed in each AWS account around the world. It has minimal permissions and is the only entity that can directly query the data for a preview.
  • The Secure Gateway: The team updated their central API to preview data in all AWS accounts worldwide. Now, when the Data Portal team's application requests a preview, the central API runs as a Lambda function. It assumes the "preview role" within the specific target account. It then executes a targeted Athena query in a specific workgroup: SELECT * FROM sales_data LIMIT 5. The results are stored in an S3 bucket, retrieved, and sent back to the Data Portal team through a secure stream.

Despite Lake Formation's limitations, DataByte Co. can now provide its users with a secure, efficient, and cost-effective way to explore data across its global ecosystem. They do this while strictly adhering to their governance rules. Everyone is happy.