An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning. Tuning may be done for individual Estimators such as LogisticRegression, or for entire Pipelines which include multiple algorithms, featurization, and other steps. Users can tune an entire Pipeline at once, rather than tuning each element in the Pipeline separately. This example scores customer profiles using a "Recency, Frequency, Monetary Value" (RFM) metric.
Before you begin:
- Ensure your tenant is configured according to the instructions to setup admin
- Know your object store namespace.
- (Optional, strongly recommended): Install Spark to test your code locally before deploying.
Upload a sample CSV file to OCI object store.
- Recommended: run the sample locally to test it.
- Upload the sample CSV file to object store
- Upload
train_mllib_model.pyto an object store bucket. - Create a Python Data Flow application pointing to
train_mllib_model.py4a. Refer here
Before deploying the application set the following.
BUCKET= Enter an OCI object storage bucket here
COMPARTMENTID= Enter an OCI compartment here
NAMESPACE=Enter your oci namespace here
oci os object put --bucket-name $BUCKET --file train_mllib_model.pyoci os object put --bucket-name $BUCKET --file moviestream_subset.csvoci data-flow run submit \
--compartment-id $COMPARTMENTID \
--executor-shape VM.Standard2.1 \
--num-executors 2 \
--execute "oci://$BUCKET@$NAMESPACE/train_mllib_model.py --input oci://$BUCKET@$NAMESPACE/moviestream_subset.csv --output oci://$BUCKET@$NAMESPACE/scores.csv"Make note of the OCID that is returned in the "id" field.
oci data-flow run get --run-id <ocid>oci data-flow run get-log --run-id <ocid> --name spark_application_stdout.log.gz --file -