Databricks-Machine-Learning-Associate Databricks Exam Questions and Free Practice Test

Question 7

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.
Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

A. They can refactor their notebook to process the data in parallel.

B. They can refactor their notebook to use the PySpark DataFrame API.

C. They can refactor their notebook to use the Scala Dataset API.

D. They can refactor their notebook to use Spark SQL.

E. They can refactor their notebook to utilize the pandas API on Spark.

Correct Answer:E
The data scientist can refactor their notebook to utilize the pandas API on Spark (now known aspandas on Spark, formerlyKoalas). This allows for the least amount of changes to the existing pandas-based code while scaling to handle big data using Spark's distributed computing capabilities.pandas on Sparkprovides a similar API to pandas, making the transition smoother and faster compared to completely rewriting the code to use PySpark DataFrame API, Scala Dataset API, or Spark SQL.References:
✑ Databricks documentation on pandas API on Spark (formerly Koalas).

Question 8

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.
Which of the following terms is used to describe this combination of models?

A. Bootstrap aggregation

B. Support vector machines

C. Bucketing

D. Ensemble learning

E. Stacking

Correct Answer:D
Ensemble learning is a machine learning technique that involves combining several models to solve a particular problem. The scenario described fits the concept of ensemble learning, where two models, each performing well under different conditions, are combined to create a more robust model. This approach often leads to better performance as it combines the strengths of multiple models.
References
✑ Introduction to Ensemble Learning: https://machinelearningmastery.com/ensemble-machine-learning-algorithms- python-scikit-learn/

Question 9

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.
batch_dfhas the following schema: customer_id STRING
The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:
Databricks-Machine-Learning-Associate dumps exhibit
In which situation will the machine learning engineer??s code block perform the desired
inference?

A. When the Feature Store feature set was logged with the model at model_uri

B. When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark

C. When the model at model_uri only uses customer_id as a feature

D. This code block will not perform the desired inference in any situation.

E. When all of the features used by the model at model_uri are in a single Feature Store table

Correct Answer:A
The code block provided by the machine learning engineer will perform the desired inference when the Feature Store feature set was logged with the model at model_uri. This ensures that all necessary feature transformations and metadata are available for the model to make predictions. The Feature Store in Databricks allows for seamless integration of features and models, ensuring that the required features are correctly used during inference.
References:
✑ Databricks documentation on Feature Store: Feature Store in Databricks

Question 10

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.
Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

A. Change the number of compute nodes to be half or less than half of the number of evaluations.

B. Change the number of compute nodes and the number of evaluations to be much larger but equal.

C. Change the iterative optimization algorithm used to facilitate the tuning process.

D. Change the number of compute nodes to be double or more than double the number of evaluations.

Correct Answer:C
The lack of improvement in model accuracy across evaluations suggests that the optimization algorithm might not be effectively exploring the hyperparameter space. Iterative optimization algorithms like Tree-structured Parzen Estimators (TPE) or Bayesian Optimization can adapt based on previous evaluations, guiding the search towards more promising regions of the hyperparameter space.
Changing the optimization algorithm can lead to better utilization of the information gathered during each evaluation, potentially improving the overall accuracy. References:
✑ Hyperparameter Optimization with Hyperopt

Question 11

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.
They have developed this code block to accomplish this task:
Databricks-Machine-Learning-Associate dumps exhibit
The code block is returning an error.
Which of the following adjustments does the data scientist need to make to accomplish this task?

A. They need to specify the method parameter to the OneHotEncoder.

B. They need to remove the line with the fit operation.

C. They need to use Stringlndexer prior to one-hot encodinq the features.

D. They need to useVectorAssemblerprior to one-hot encoding the features.

Correct Answer:C
TheOneHotEncoderin Spark ML requires numerical indices as inputs rather than string labels. Therefore, you need to first convert the string columns to numerical indices usingStringIndexer. After that, you can applyOneHotEncoderto these indices. Corrected code:
frompyspark.ml.featureimportStringIndexer, OneHotEncoder# Convert string column to indexindexers = [StringIndexer(inputCol=col, outputCol=col+"_index")forcolininput_columns] indexer_model = Pipeline(stages=indexers).fit(features_df) indexed_features_df = indexer_model.transform(features_df)# One-hot encode the indexed columnsohe = OneHotEncoder(inputCols=[col+"_index"forcolininput_columns], outputCols=output_columns) ohe_model = ohe.fit(indexed_features_df) ohe_features_df = ohe_model.transform(indexed_features_df)
References:
✑ PySpark ML Documentation

Question 12

A machine learning engineer wants to parallelize the inference of group-specific models using the Pandas Function API. They have developed theapply_modelfunction that will look up and load the correct model for each group, and they want to apply it to each group of DataFramedf.
They have written the following incomplete code block:
Databricks-Machine-Learning-Associate dumps exhibit
Which piece of code can be used to fill in the above blank to complete the task?

A. applyInPandas

B. groupedApplyInPandas

C. mapInPandas

D. predict

Correct Answer:A
To parallelize the inference of group-specific models using the Pandas Function API in PySpark, you can use theapplyInPandasfunction. This function allows you to apply a Python function on each group of a DataFrame and return a DataFrame, leveraging the power of pandas UDFs (user-defined functions) for better performance.
prediction_df = ( df.groupby("device_id") .applyInPandas(apply_model, schema=apply_return_schema) )
In this code:
✑ groupby("device_id"): Groups the DataFrame by the "device_id" column.
✑ applyInPandas(apply_model, schema=apply_return_schema): Applies theapply_modelfunction to each group and specifies the schema of the return DataFrame.
References:
✑ PySpark Pandas UDFs Documentation

START Databricks-Machine-Learning-Associate EXAM