Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?
Correct Answer:C
The pandas API on Spark DataFrames are made up of Spark DataFrames with additional metadata. The pandas API on Spark aims to provide the pandas-like experience with the scalability and distributed nature of Spark. It allows users to work with pandas functions on large datasets by leveraging Spark??s underlying capabilities. References:
✑ Databricks documentation on pandas API on Spark: pandas API on Spark
A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".
Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?
Correct Answer:B
To register a model that has been identified by a specific run_id in the MLflow Model Registry, the appropriate line of code is: mlflow.register_model(f"runs:/{run_id}/model","best_model")
This code correctly specifies the path to the model within the run (runs:/{run_id}/model) and registers it under the name "best_model" in the Model Registry. This allows the model to be tracked, managed, and transitioned through different stages (e.g., Staging, Production) within the MLflow ecosystem.
References
✑ MLflow documentation on model registry: https://www.mlflow.org/docs/latest/model-registry.html#registering-a-model
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
Correct Answer:B
Vectorized pandas UDFs, also known as Pandas UDFs, are a powerful feature in PySpark that allows for more efficient operations than standard UDFs. They operate by processing data in batches, utilizing vectorized operations that leverage pandas to perform operations on whole batches of data at once. This approach is much moreefficient than processing data row by row as is typical with standard PySpark UDFs, which can significantly speed up the computation.
References
✑ PySpark Documentation on UDFs:https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs
Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?
Correct Answer:D
Spark MLlib is a machine learning library within Apache Spark that provides scalable and distributed machine learning algorithms. It is designed to work with Spark DataFrames and leverages Spark??s distributed computing capabilities to perform large- scale feature engineering and model training without the need for user-defined functions (UDFs) or the pandas Function API. Spark MLlib provides built-in transformations and algorithms that can be applied directly to large datasets.
References:
✑ Databricks documentation on Spark MLlib: Spark MLlib
A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.
Which of the following approaches can the data scientist use to accomplish this MLflow run organization?
Correct Answer:B
To organize MLflow runs with one parent run for the tuning process and a child run for each unique combination of hyperparameter values, the data scientist can specifynested=Truewhen starting the child run. This approach ensures that each child run is properly nested under the parent run, maintaining a clear hierarchical structure for the experiment. This nesting helps in tracking and comparing different hyperparameter combinations within the same tuning process.References:
✑ MLflow Documentation (Managing Nested Runs).
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
Hyperparameter 1: [2, 5, 10]
Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?
Correct Answer:D
To determine the number of machine learning models that can be trained in parallel, we need to calculate the total number of combinations of hyperparameters. The given hyperparameter grid includes:
✑ Hyperparameter 1: [2, 5, 10] (3 values)
✑ Hyperparameter 2: [50, 100] (2 values)
The total number of combinations is the product of the number of values for each hyperparameter:3 (values of Hyperparameter 1)??2 (values of Hyperparameter 2)=63 (v s of Hyperparameter 1)??2 (values of Hyperparameter 2)=6
With 3-fold cross-validation, each combination of hyperparameters will be evaluated 3 times. Thus, the total number of models trained will
be:6 (combinations)??3 (folds)=186 (combinations)??3 (folds)=18
However, the number of models that can be trained in parallel is equal to the number of hyperparameter combinations, not the total number of models considering cross-validation. Therefore, 6 models can be trained in parallel.
References:
✑ Databricks documentation on hyperparameter tuning: Hyperparameter Tuning