Online Databricks-Machine-Learning-Associate Practice TestMore Databricks Products >

Free Databricks Databricks-Machine-Learning-Associate Exam Dumps Questions

Databricks Databricks-Machine-Learning-Associate: Databricks Certified Machine Learning Associate Exam

- Get instant access to Databricks-Machine-Learning-Associate practice exam questions

- Get ready to pass the Databricks Certified Machine Learning Associate Exam exam right now using our Databricks Databricks-Machine-Learning-Associate exam package, which includes Databricks Databricks-Machine-Learning-Associate practice test plus an Databricks Databricks-Machine-Learning-Associate Exam Simulator.

- The best online Databricks-Machine-Learning-Associate exam study material and preparation tool is here.

4.5

(9345 ratings)

Question 1

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

A. pandas API on Spark DataFrames are single-node versions of Spark DataFrames withadditional metadata

B. pandas API on Spark DataFrames are more performant than Spark DataFrames

C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

Correct Answer:C
The pandas API on Spark DataFrames are made up of Spark DataFrames with additional metadata. The pandas API on Spark aims to provide the pandas-like experience with the scalability and distributed nature of Spark. It allows users to work with pandas functions on large datasets by leveraging Spark??s underlying capabilities. References:
✑ Databricks documentation on pandas API on Spark: pandas API on Spark

Question 2

A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".
Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?

A. mlflow.register_model(run_id, "best_model")

B. mlflow.register_model(f"runs:/{run_id}/model??, "best_model??)

C. millow.register_model(f"runs:/{run_id)/model")

D. mlflow.register_model(f"runs:/{run_id}/best_model", "model")

Correct Answer:B
To register a model that has been identified by a specific run_id in the MLflow Model Registry, the appropriate line of code is: mlflow.register_model(f"runs:/{run_id}/model","best_model")
This code correctly specifies the path to the model within the run (runs:/{run_id}/model) and registers it under the name "best_model" in the Model Registry. This allows the model to be tracked, managed, and transitioned through different stages (e.g., Staging, Production) within the MLflow ecosystem.
References
✑ MLflow documentation on model registry: https://www.mlflow.org/docs/latest/model-registry.html#registering-a-model

Question 3

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

A. The vectorized pandas UDFs allow for the use of type hints

B. The vectorized pandas UDFs process data in batches rather than one row at a time

C. The vectorized pandas UDFs allow for pandas API use inside of the function

D. The vectorized pandas UDFs work on distributed DataFrames

E. The vectorized pandas UDFs process data in memory rather than spilling to disk

Correct Answer:B
Vectorized pandas UDFs, also known as Pandas UDFs, are a powerful feature in PySpark that allows for more efficient operations than standard UDFs. They operate by processing data in batches, utilizing vectorized operations that leverage pandas to perform operations on whole batches of data at once. This approach is much moreefficient than processing data row by row as is typical with standard PySpark UDFs, which can significantly speed up the computation.
References
✑ PySpark Documentation on UDFs:https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs

Question 4

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

A. Keras

B. Scikit-learn

C. PyTorch

D. Spark ML

Correct Answer:D
Spark MLlib is a machine learning library within Apache Spark that provides scalable and distributed machine learning algorithms. It is designed to work with Spark DataFrames and leverages Spark??s distributed computing capabilities to perform large- scale feature engineering and model training without the need for user-defined functions (UDFs) or the pandas Function API. Spark MLlib provides built-in transformations and algorithms that can be applied directly to large datasets.
References:
✑ Databricks documentation on Spark MLlib: Spark MLlib

Question 5

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.
Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

A. Theycan turn on Databricks Autologging

B. Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C. Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D. They can start each child run with the same experiment ID as the parent run

E. They can specify nested=True when starting the parent run for the tuningprocess

Correct Answer:B
To organize MLflow runs with one parent run for the tuning process and a child run for each unique combination of hyperparameter values, the data scientist can specifynested=Truewhen starting the child run. This approach ensures that each child run is properly nested under the parent run, maintaining a clear hierarchical structure for the experiment. This nesting helps in tracking and comparing different hyperparameter combinations within the same tuning process.References:
✑ MLflow Documentation (Managing Nested Runs).

Question 6

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
Hyperparameter 1: [2, 5, 10]
Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?

A. 3

B. 5

C. 6

D. 18

Correct Answer:D
To determine the number of machine learning models that can be trained in parallel, we need to calculate the total number of combinations of hyperparameters. The given hyperparameter grid includes:
✑ Hyperparameter 1: [2, 5, 10] (3 values)
✑ Hyperparameter 2: [50, 100] (2 values)
The total number of combinations is the product of the number of values for each hyperparameter:3 (values of Hyperparameter 1)??2 (values of Hyperparameter 2)=63 (v s of Hyperparameter 1)??2 (values of Hyperparameter 2)=6
With 3-fold cross-validation, each combination of hyperparameters will be evaluated 3 times. Thus, the total number of models trained will
be:6 (combinations)??3 (folds)=186 (combinations)??3 (folds)=18
However, the number of models that can be trained in parallel is equal to the number of hyperparameter combinations, not the total number of models considering cross-validation. Therefore, 6 models can be trained in parallel.
References:
✑ Databricks documentation on hyperparameter tuning: Hyperparameter Tuning

START Databricks-Machine-Learning-Associate EXAM