A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.
Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?
Correct Answer:E
The data scientist can refactor their notebook to utilize the pandas API on Spark (now known aspandas on Spark, formerlyKoalas). This allows for the least amount of changes to the existing pandas-based code while scaling to handle big data using Spark's distributed computing capabilities.pandas on Sparkprovides a similar API to pandas, making the transition smoother and faster compared to completely rewriting the code to use PySpark DataFrame API, Scala Dataset API, or Spark SQL.References:
✑ Databricks documentation on pandas API on Spark (formerly Koalas).
A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.
Which of the following terms is used to describe this combination of models?
Correct Answer:D
Ensemble learning is a machine learning technique that involves combining several models to solve a particular problem. The scenario described fits the concept of ensemble learning, where two models, each performing well under different conditions, are combined to create a more robust model. This approach often leads to better performance as it combines the strengths of multiple models.
References
✑ Introduction to Ensemble Learning: https://machinelearningmastery.com/ensemble-machine-learning-algorithms- python-scikit-learn/
A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.
batch_dfhas the following schema: customer_id STRING
The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:
In which situation will the machine learning engineer??s code block perform the desired
inference?
Correct Answer:A
The code block provided by the machine learning engineer will perform the desired inference when the Feature Store feature set was logged with the model at model_uri. This ensures that all necessary feature transformations and metadata are available for the model to make predictions. The Feature Store in Databricks allows for seamless integration of features and models, ensuring that the required features are correctly used during inference.
References:
✑ Databricks documentation on Feature Store: Feature Store in Databricks
A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.
Which change could the data scientist make to improve their model accuracy over the course of their tuning process?
Correct Answer:C
The lack of improvement in model accuracy across evaluations suggests that the optimization algorithm might not be effectively exploring the hyperparameter space. Iterative optimization algorithms like Tree-structured Parzen Estimators (TPE) or Bayesian Optimization can adapt based on previous evaluations, guiding the search towards more promising regions of the hyperparameter space.
Changing the optimization algorithm can lead to better utilization of the information gathered during each evaluation, potentially improving the overall accuracy. References:
✑ Hyperparameter Optimization with Hyperopt
A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.
They have developed this code block to accomplish this task:
The code block is returning an error.
Which of the following adjustments does the data scientist need to make to accomplish this task?
Correct Answer:C
TheOneHotEncoderin Spark ML requires numerical indices as inputs rather than string labels. Therefore, you need to first convert the string columns to numerical indices usingStringIndexer. After that, you can applyOneHotEncoderto these indices. Corrected code:
frompyspark.ml.featureimportStringIndexer, OneHotEncoder# Convert string column to indexindexers = [StringIndexer(inputCol=col, outputCol=col+"_index")forcolininput_columns] indexer_model = Pipeline(stages=indexers).fit(features_df) indexed_features_df = indexer_model.transform(features_df)# One-hot encode the indexed columnsohe = OneHotEncoder(inputCols=[col+"_index"forcolininput_columns], outputCols=output_columns) ohe_model = ohe.fit(indexed_features_df) ohe_features_df = ohe_model.transform(indexed_features_df)
References:
✑ PySpark ML Documentation
A machine learning engineer wants to parallelize the inference of group-specific models using the Pandas Function API. They have developed theapply_modelfunction that will look up and load the correct model for each group, and they want to apply it to each group of DataFramedf.
They have written the following incomplete code block:
Which piece of code can be used to fill in the above blank to complete the task?
Correct Answer:A
To parallelize the inference of group-specific models using the Pandas Function API in PySpark, you can use theapplyInPandasfunction. This function allows you to apply a Python function on each group of a DataFrame and return a DataFrame, leveraging the power of pandas UDFs (user-defined functions) for better performance.
prediction_df = ( df.groupby("device_id") .applyInPandas(apply_model, schema=apply_return_schema) )
In this code:
✑ groupby("device_id"): Groups the DataFrame by the "device_id" column.
✑ applyInPandas(apply_model, schema=apply_return_schema): Applies theapply_modelfunction to each group and specifies the schema of the return DataFrame.
References:
✑ PySpark Pandas UDFs Documentation