Which steps are recommended best practices for prioritizing cluster keys in Snowflake? (Choose two.)
Correct Answer:AD
According to the Snowflake documentation, the best practices for choosing clustering keys are:
✑ Choose columns that are frequently used in join predicates. This can improve the
join performance by reducing the number of micro-partitions that need to be scanned and joined.
✑ Choose columns that are most actively used in selective filters. This can improve
the scan efficiency by skipping micro-partitions that do not match the filter predicates.
✑ Avoid using low cardinality columns, such as gender or country, as clustering keys.
This can result in poor clustering and high maintenance costs.
✑ Avoid using TIMESTAMP columns with nanoseconds, as they tend to have very high cardinality and low correlation with other columns. This can also result in poor
clustering and high maintenance costs.
✑ Avoid using columns with duplicate values or NULLs, as they can cause skew in the clustering and reduce the benefits of pruning.
✑ Cluster on multiple columns if the queries use multiple filters or join predicates.
This can increase the chances of pruning more micro-partitions and improve the compression ratio.
✑ Clustering is not always useful, especially for small or medium-sized tables, or
tables that are not frequently queried or updated. Clustering can incur additional costs for initially clustering the data and maintaining the clustering over time.
References:
✑ Clustering Keys & Clustered Tables | Snowflake Documentation
✑ [Considerations for Choosing Clustering for a Table | Snowflake Documentation]
An Architect is integrating an application that needs to read and write data to Snowflake without installing any additional software on the application server.
How can this requirement be met?
Correct Answer:C
The Snowflake SQL REST API is a REST API that you can use to access and update data in a Snowflake database. You can use this API to execute standard queries and most DDL and DML statements. This API can be used to develop custom applications and integrations that can read and write data to Snowflake without installing any additional software on the application server. Option A is not correct because SnowSQL is a command-line client that requires installation and configuration on the application server. Option B is not correct because the Snowpipe REST API is used to load data from cloud storage into Snowflake tables, not to read or write data to Snowflake. Option D is not correct because the Snowflake ODBC driver is a software component that enables applications to connect to Snowflake using the ODBC protocol, which also requires installation and configuration on the application server. References: The answer can be verified from Snowflake??s official documentation on the Snowflake SQL REST API available on their website. Here are some relevant links:
✑ Snowflake SQL REST API | Snowflake Documentation
✑ Introduction to the SQL API | Snowflake Documentation
✑ Submitting a Request to Execute SQL Statements | Snowflake Documentation
An Architect has designed a data pipeline that Is receiving small CSV files from multiple sources. All of the files are landing in one location. Specific files are filtered for loading into Snowflake tables using the copy command. The loading performance is poor.
What changes can be made to Improve the data loading performance?
Correct Answer:B
According to the Snowflake documentation, the data loading performance can be improved by following some best practices and guidelines for preparing and staging the data files. One of the recommendations is to aim for data files that are roughly 100-250 MB (or larger) in size compressed, as this will optimize the number of parallel operations for a load. Smaller files should be aggregated and larger files should be split to achieve this size range. Another recommendation is to use a multi-cluster warehouse for loading, as this will allow for scaling up or out the compute resources depending on the load demand.
A single-cluster warehouse may not be able to handle the load concurrency and throughput efficiently. Therefore, by creating a multi-cluster warehouse and merging smaller files to create bigger files, the data loading performance can be improved. References:
✑ Data Loading Considerations
✑ Preparing Your Data Files
✑ Planning a Data Load
A healthcare company wants to share data with a medical institute. The institute is running a Standard edition of Snowflake; the healthcare company is running a Business Critical edition.
How can this data be shared?
Correct Answer:D
By default, Snowflake does not allow sharing data from a Business Critical edition to a non-Business Critical edition. This is because Business Critical edition provides enhanced security and data protection features that are not available in lower editions. However, this restriction can be overridden by setting the share_restriction parameter on the shared object (database, schema, or table) to false. This parameter allows the data provider to explicitly allow sharing data with lower edition accounts. Note that this parameter can only be set by the data provider, not the data consumer. Also, setting this parameter to false may reduce the level of security and data protection for the shared data. References:
✑ Enable Data Share:Business Critical Account to Lower Edition
✑ Sharing Is Not Allowed From An Account on BUSINESS CRITICAL Edition to an Account On A Lower Edition
✑ SQL Execution Error: Sharing is Not Allowed from an Account on BUSINESS CRITICAL Edition to an Account on a Lower Edition
✑ Snowflake Editions | Snowflake Documentation
How can the Snowpipe REST API be used to keep a log of data load history?
Correct Answer:D
The Snowpipe REST API provides two endpoints for retrieving the data load history: insertReport and loadHistoryScan. The insertReport endpoint returns the status of the files that were submitted to the insertFiles endpoint, while the loadHistoryScan endpoint returns the history of the files that were actually loaded into the table by Snowpipe. To keep a log of data load history, it is recommended to use the loadHistoryScan endpoint, which provides more accurate and complete information about the data ingestion process. The loadHistoryScan endpoint accepts a start time and an end time as parameters, and returns the files that were loaded within that time range. The maximum time range that can be specified is 15 minutes, and the maximum number of files that can be returned is 10,000. Therefore, to keep a log of data load history, the best option is to call the loadHistoryScan endpoint every 10 minutes for a 15-minute time range, and store the results in a log file or a table. This way, the log will capture all the files that were loaded by Snowpipe, and avoid any gaps or overlaps in the time range. The other options are incorrect because:
✑ Calling insertReport every 20 minutes, fetching the last 10,000 entries, will not
provide a complete log of data load history, as some files may be missed or duplicated due to the asynchronous nature of Snowpipe. Moreover, insertReport only returns the status of the files that were submitted, not the files that were loaded.
✑ Calling loadHistoryScan every minute for the maximum time range will result in too
many API calls and unnecessary overhead, as the same files will be returned multiple times. Moreover, the maximum time range is 15 minutes, not 1 minute.
✑ Calling insertReport every 8 minutes for a 10-minute time range will suffer from the
same problems as option A, and also create gaps or overlaps in the time range. References:
✑ Snowpipe REST API
✑ Option 1: Loading Data Using the Snowpipe REST API
✑ PIPE_USAGE_HISTORY
A data platform team creates two multi-cluster virtual warehouses with the AUTO_SUSPEND value set to NULL on one. and '0' on the other. What would be the execution behavior of these virtual warehouses?
Correct Answer:D
The AUTO_SUSPEND parameter controls the amount of time, in seconds, of inactivity after which a warehouse is automatically suspended. If the parameter is set to NULL, the warehouse never suspends. If the parameter is set to ??0??, the warehouse suspends immediately after executing a query. Therefore, the execution behavior of the two virtual warehouses will be different depending on the AUTO_SUSPEND value. The warehouse with NULL value will keep running until it is manually suspended or the resource monitor limits are reached. The warehouse with ??0?? value will suspend as soon as it finishes a query and release the compute resources. References:
✑ ALTER WAREHOUSE
✑ Parameters