Extend sample model with Python SDK

In the Create sample model with Python SDK tutorial, you learn how the configuration of the monitor_placeholder_api_generic.ipynb notebook creates the sample taxi fare model. That tutorial walks through the out-of-the-box configuration and settings defined for the notebook and explains how to use the VIANOPS Python SDK to create the sample model.

In this tutorial, you learn how to extend the configurations from the sample notebook. Specifically you learn how to:

As you modify the existing sample model configurations and create new ones, make sure to modify the out-of-the-box sample model settings to support the new configurations. For example, modifying the sample model to predict a classification target rather than the default regression target requires more changes than simply setting a classification feature as the target column. Ensure the modifications you make (e.g., to model, policy, segment, etc) are supported by the provided data and other notebook configurations.

Note: The helper.py Python file included with the sample notebook provides numerous scripted client functions to make the notebook workflow easier to process and follow. In addition, all of the vianops_client SDK modules and classes are imported there.

Set a different target for the feature set

The target for the feature set is one of the input parameters for the placeholder model notebook. It must be a column in the feature set. To set a different target for your feature set, add the name of that column as the value for targetcolumn. Make sure this column name (string data type) matches a column in your feature set and provides the data needed for your model type (i.e., classification, regression, etc.). When you run this cell, the notebook picks up your custom settings.

Change the feature set for the model

Note: Column names in feature sets must contain only lowercase letters (a-z) or numbers (0-9), or the characters underscore (_) or period (.)

When you specify a different feature set for the placeholder model it likely has different columns. You need to modify the features and columns set for the model to match those provided by the feature set; for the placeholder model notebook, these are set as variables. Modify the following notebook variables as explained:

  • allcolumns—Names of all columns (including the target column) in the feature set, defined as an array (comma-separated and enclosed in quotes).
  • continuouscolumns—Names of all columns in the feature set that contain continuous data, defined as an array (comma-separated and enclosed in quotes). If there are no columns with continuous data, leave empty square brackets [].
  • categoricalcolumns—Names of all columns in the feature set that contain categorical data, defined as an array (comma-separated and enclosed in quotes). If there are no columns with categorical data, leave empty square brackets [].
  • str_categorical_cols—Names of columns in the feature set containing categorical data of string datatype. Use empty brackets [] if there are no categorical columns.
  • targetcolumn—Name of the column containing the target for the feature set, i.e., the feature that the model is predicting.
  • offset_col—Name of the feature set column containing the offset value (datetime format), used for generating time/date stamps identifying when predictions were made. Offsets the given datetime to today minus n number of days.
  • identifier_col—Name of the column containing the identifier feature that uniquely identifies predictions. (When the model sends predictions, the platform uses the identifier column to map the model’s ground truth data.)

When specifying a different feature set for the sample model, you’ll also need to create a custom inference mapping that matches the different columns and data. See Create a custom inference mapping for details.

Make sure all policies and segments configured in the placeholder model notebook also reflect the modified features and columns as needed.

Create a custom inference mapping

The sample notebook develops an inference mapping using the variables defined when setting up the notebook (cells two and three). The categorical_columns, str_categorical_cols, continuous_columns, offset_col, etc values are passed to the inference_mapping_generator() (in helper.py) to construct columns[] for the inference mapping payload. For more details, see the V1InferenceMappingModel (SDK reference).

When extending the placeholder model notebook to use your own data (i.e., from local CSV or Parquet files) you need to create a custom schema to match your model’s data. With the custom inference mapping in place, the platform can understand and process new inference data sent by the placeholder model deployment. In this topic, you learn how to develop a new custom inference mapping.

Notes: Local data filenames must contain only lowercase letters (a-z) or numbers (0-9), or the characters underscore (_) or period (.).

To create a new custom inference mapping

  1. Navigate to the notebook cell, “Load inference mapping schema”. This cell contains the default inference mapping for the placeholder model notebook.
  2. Press (+) to create a new empty cell.
  3. Create an instance of the class V1InferenceMappingJob (SDK reference) with the values for your model’s schema. To save time and ensure accuracy, Vianai recommends that you use the helper method inference_mapping_generator when creating your own schema.
    • Create an instance of the class V1InferenceMappingSchema (SDK reference) to specify the df_schema values (datetime_col, target_col, predict_proba_col (for classification models), identifier_cols[], and columns[]).
    • Create an instance of the class V1InferenceMappingColumnSchema (SDK reference) to specify the actual columns for the schema.

    See the Inference Mapping API documentation for more information about the values needed to create the Inference Mapping object. Specifying the identifier and datetime columns is recommended.

  4. Make sure to update the columns and values defined in cell 3 of the notebook to match your new inference mapping. Also, create feature importance, policies, segments, and hotspot analysis that use your data, or modify the current feature importance, policies, segments, and hotspot analysis to match the new inference mapping, as needed.
  5. When done, make sure to save your notebook.
  6. If you’ve already run the sample placeholder model notebook cells to this point, you can run this cell to create the inference mapping in the platform. (You must be logged in to the platform by running the cell two of the notebook.) Otherwise, you need to run the notebook from the top to get everything set up correctly.

Modify inference tracking to support custom inference mapping

The sample monitor notebook uses the helper.py method populate_data() to send data to the backend. This method is tied to the sample notebook dataset, “Taxi fare dataset”.

When using a custom inference mapping for a different dataset, you need to write your own code for populate_data() to provide the structure for sending your data to the backend.

Additionally, when using a custom inference mapping you need to write code to support inference tracking for your dataset so that inferences for your new dataset can be sent to the backend. To do this, you upload the data to cache using CacheV1Api.upload() (SDK reference) and then send the uploaded data to the VIANOPS backend using InferenceTrackingV1Api.create() (SDK reference).

Set different values for feature importance

The “Month-to-month drift policy” for the sample placeholder model notebook is configured to consider (or weigh) the importance of features specified in the feature_importance input payload as part of detecting drift:

    "select_features_type": "custom",
    "feature_weightage": "manual",
    "feature_weights": m_m_feature_weights,
    "drift_measure": "PSI",
    "warning_level": 0.5,
    "critical_level": 1,
    "schedule": "0 0 6 ? * *",
    "deployment_name": f"{model_name}",
    "method": "preprocess",

where m_m_feature_weights reads and applies the importance values specified in the input payload.

The feature importance values set for the placeholder model notebook are input variables specified in the feature_importance_payload. If you are using a feature set with different data/columns, you need to modify the payload for your features. Make sure to match the feature importance structure shown in the sample notebook payload, with values defined in the three sections: Feature, absolute_importance, percent_contribution.

As an option you can configure the model to get feature importance values from an uploaded file rather than including them as part of the notebook. See the Feature importance API documentation for more information.

Modify a policy to detect issues on different columns

The sample placeholder model notebook is configured to detect drift by analyzing and comparing data provided by features in the sample dataset. Based on their configurations, the four feature drift policies (i.e., “Month-to-month drift policy”, “Segment based week-to-week prior policy”, “Segment based week-to-week prior policy”, and “Hotspot based day-to-day prior policy”) detect drift equally across three features: est_trip_distance, est_travel_time, est_extra_cost.

Single policy create

PoliciesV1Api.create() (SDK reference) uses the feature_importance_payload to define the drift features and the weight configuration for each, as follows:

    "select_features_type": "custom",
    "feature_weightage": "manual",
    "feature_weights": m_m_feature_weights,

where: 

    m_m_feature_weights = {
        feat["Feature"]: feat["percent_contribution"]
        for feat in feature_importance_payload
        if feat["percent_contribution"] != 0

If you wanted the “Month-to-month drift policy” to detect drift equally using different features from the sample dataset, modify the feature_importance_payload, as explained here.

Bulk policy create

PoliciesV1Api.bulk_create() (SDK reference) defines the drift features for the three other drift policies as follows:

    features=["est_trip_distance", "est_travel_time", "est_extra_cost"],

where the specified features are given equal weightage.

Note: Currently, PoliciesV1Api.bulk_create() applies equal weightage only.

If you wanted these feature drift policies to detect drift using different features, specify those features=[] in policies_api.bulk_create().

Create a new segment for a policy

With the exception of the “Month-to-month drift policy”, each policy created by the sample model notebook uses segments to help narrow the data population:

  • “Brooklyn-Manhattan segment” (Segment1) filters to look only at data where PULocation is either Downtown Brooklyn/MetroTech, DUMBO/Vinegar Hill, or Brooklyn Heights, and DOLocation is either Little Italy/NoLiTa or Lower East Side.
  • “Williamsburg-Manhattan segment” (Segment2) filters to look only at data where PULocation is either Williamsburg (South Side) or South Williamsburg, and DOLocation is either Little Italy/NoLiTa or Lower East Side.
  • “Distance segment” (Segment 3) creates a filter to include only trips of two miles or longer based on the feature est_trip_distance. The last filter supports the hotspot analysis configuration for “Hotspot based day-to-day prior policy”.

(See the “Create Segment1”, “Create Segment2”, and “Create Segment3” cells for configuration.)

A new variable, segment_params, is an array containing the first two segments, “Brooklyn-Manhattan segment” and “Williamsburg-Manhattan segment”. Using segment_params, both segments are then added to the “MAE performance policy” via the key-value pair, "segments": segment_params.

When the policies run, they apply their configurations to look for drift across all data and also across any defined segments.

New segment for this tutorial

In this topic, you learn how to create a new segment and add it to the “Month-to-month drift policy”. This segment filters to analyze only the data where est_travel_time is less than 15.00 (minutes) or where est_trip_distance is under 2.5 (miles) and est_extra_cost is greater than 0 (cents).

We’re going to use the “Create Segment1” cell as a template and then modify for our needs. Our segment combines a simple filter and a grouped_filters[] (SDK reference) to support the complex filter conditions.

When creating the filters, you use a combination of operators and conjunctions as explained in the client SDK documentation.

  1. Under the section “Create and add segments to the policies”, press “+” to create a new cell after the “Create Segment3” cell.

  2. Paste the following into the cell:

         payload = {
             "name": f"{segment4_name}",
             "description": "Segment to filter data for particular pickup and dropoff locations in Brooklyn and Manhattan",
             "filters": [
                 {
                     "feature_name": "PULocation",
                     "operator": "=",
                     "value": [
                         "Downtown Brooklyn/MetroTech",
                         "DUMBO/Vinegar Hill",
                         "Brooklyn Heights",
                     ],
                     "conjunction": "and",
                 },
                 {
                     "feature_name": "DOLocation",
                     "operator": "=",
                     "value": ["Little Italy/NoLiTa", "Lower East Side"],
                     "conjunction": None,
                 },
             ],
             "status": "inactive",
             "model_uuid": f"{model_uuid}",
         }
    
  3. Optionally, change description to “Trips under 2.5 miles or under 15 minutes”.

  4. Replace the first “filters” object to filter on the feature est_travel_time using the operator, value, and conjunction as shown:

         {
             "feature_name": "est_travel_time",
             "operator": "<",
             "value": [
                 15.00
             ],
             "conjunction": "null"
         },
    
  5. After the first filter add an OR conjunction.

         "conjunction": OR
    
  6. Next, create grouped_filters[] to configure a complex nested filter that is joined to the first filter with the OR conjunction. Create this complex filter for the est_trip_distance and est_extra_cost features using the operators, values, and conjunctions as shown:

     "grouped_filters": [
         {
             "feature_name": "est_trip_distance",
             "operator": "<"
             "value": ["2.5"],
             "conjunction": "AND",
         },
         {
             "feature_name": "est_extra_cost",
             "operator": ">",
             "value": ["0"],
             "conjunction": null
         },
     ]
    
  7. Finally, paste the following at the bottom of the cell:

         segment4 = V1SegmentBaseModel(**payload)
         segment4_list = V1SegmentBaseModelList(__root__=[])
         segment4_list.__root__.append(segment4)
         segment4_params = segments_api.create(segment4_list)
         print(segment4_params)
    

    The payload contains the configuration for the new segment.

    We’re passing the defined payload to segment4, which is an instance of class V1SegmentBaseModel (SDK reference).

    Then, segment4_list (an instance of class V1SegmentBaseModelList (SDK reference) enables us to get a list of individual parameters for use by the policy. Use segments_api.create(segment4_list) (SDK reference) to create the segment using the values of segment4.

  8. Navigate to cell 3 and add an entry for your new segment to “policy and segment names” section:

    segment4_name = f"Short trips"

    Now, we can add the new segment to the policy.

  9. Navigate to the cell for “Create a base month-to-month policy” and update the segments parameter in the policy payload:

    segment_params = [segment4_params[0]]

  10. When done, make sure to save your notebook.

  11. Starting at cell 2, run the notebook again to apply the new variable, create the new segment, and run the preprocessing job for that segment.

Create a new policy

The sample notebook includes five policies: four that detect feature drift and one that detects model performance drift. All but one of the policies include segmentation, enabling them to run feature or performance drift detection on a focused set of data as well as across all data defined for the policies. One of these policies also runs hotspot analysis which calculates hotspots to identify areas of the dataset where events have higher values or are occurring more frequently. (You can view the resulting hotspots in the Hotspot Analysis window.) The “Month-to-month drift policy” is created by default without segments or hotspot analysis, although in the section “Create a new segment for a policy you learn how to add a segment to that policy.

New policy for this tutorial

In this topic, you learn how to add a new policy that detects distance-based drift on prediction data. Configuration for the policy specifies how it operates, the data it analyzes, the schedule it runs on, and the thresholds that indicate alert conditions. The prediction policy is running the configured metric against the prediction within a defined target window and against a defined baseline to determine when prediction drift exceeds thresholds. Our policy looks for drift in predictions generated during the current week as compared with a baseline of predictions generated two weeks ago.

For more information, see documentation for the prediction drift models and supported parameters.

  1. Navigate to the “Create a model performance policy” cell (under the section “Segments and Policies”). This cell contains the configuration for the model performance drift policy.
  2. Press (+) to create a new empty cell after the model performance policy cell.
  3. Paste the following payload “template”. We go through this payload and the steps for modifying.

        payload = [
        {
            "deployment": f"{deployment}",
            "model_name": f"{deployment}",
            "model_version": f"{model_version}",
            "model_stage": f"{model_stage}",
            "name": f"{policy4_name}",
            "description": "DESCRIPTION",
            "type": "drift",
            "policy": {
                "type": "prediction-drift",
                "drift_type": "distance",
                "select_features_type": "all",
                "feature_weightage": "equal",
                "feature_weights": {},
                "drift_measure": "MEASURE",
                    "baseline_bins": {
                        "total_amount": []
                    },
                "window_parameters": {
                    "target": {
                    "window_type": "WINDOW_TYPE",
                    "process_date": null,
                    "offset_type": null,
                    "offset": null,
                    "start_of_week": null,
                    "quarter_start": null,
                    "year_start": null
                    },
                    "baseline": {
                    "window_method": "WINDOW_METHOD",
                    "window_type": "WINDOW_TYPE",
                    "last_amount": null,
                    "process_date": null,
                    "offset_type": null,
                    "offset": null,
                    "start_of_week": null,
                    "quarter_start": null,
                    "year_start": null
                }
                },
                "warning_level": 0.1,
                "critical_level": 0.15,
                "schedule": "0 0 8 * *",
                "deployment_name": "deploymentxyz",
                "method": "preprocess",
                "hotspot_analysis": {
                    "method": "flat",
                    "value": 100,
                    "features": []
                },
            }
         },
        ]
    
  4. In the pasted payload template, leave the default settings for:

         "deployment": f"{deployment}",
         "model_name": f"{deployment}",
         "model_version": f"{model_version}",
         "model_stage": f"{model_stage}",
    
  5. Optionally, create a policy description as “Detect drift in total_amount between current week and prior week.”

  6. In policy{}, leave type and drift_type set at the template values.

  7. Under window_parameters, set target window_type to week.

  8. For baseline, set window_method to prior, window_type to week, and last_amount to 2.

  9. Set drift_measure to PSI so the policy uses the Population Stability Index to measure drift.

  10. Leave warning level and critical level set at the template values. These are default values and ensure warning alerts are signaled when the drift measure exceeds .1, and critical alerts are signaled when the drift measure exceeds .15.

  11. Leave the schedule set to "schedule": "0 0 8 * *" to ensure the policy runs daily at 8am.

    Note: The platform uses Coordinated Universal Time (UTC) format. If your local time or cloud node offset is different from UTC time, it is highly recommended that you create timestamps in UTC format.
  12. Leave method at preprocess. This specifies the method for saving policy data.

  13. Set custom baseline bins (i.e., start and end bin edges) to better view and understand prediction drift results. For our purposes, we are setting three bins for total_amount: 0 to 25, 26 to 50, and 51 to the end of the data:

        "baseline_bins":  {  
        "total_amount": [0, 25, 26, 50, 51]  
        },
    
  14. Finally, paste the following at the bottom of the cell:

         policy4 = V1PolicyRequestModel(**payload[0])
         policy4_data = V1PolicyRequestModelList(__root__=[])
         policy4_data.__root__.append(policy4)
         policy4_res = policies_api.create(policy4_data)
         print(policy4_res)
    

    payload contains the configuration for the new policy.

    We’re passing the defined payload to policy4, which is an instance of class V1PolicyRequestModel (SDK reference). Then, policy4_data (an instance of class V1PolicyRequestModelList, see (SDK reference)) enables us to get a list of individual parameters for use by the policy. Use policies_api.create(policy3_data) (SDK reference) to create the policy.

  15. Navigate to cell 3 and add an entry for your new policy to “policy and segment names” section:

    policy4 = f"Prediction drift week-over-week"

  16. When done, make sure to save your notebook.

  17. Starting at cell 2, run the notebook again to apply the new variables (cells 2 and 3), create the new policy, run the preprocessing for that policy, and then run the new policy.
Note: To run API endpoints like this cell, you must be logged in to the platform by running the second cell in the notebook.

When finished, you can find the new policy in the Policy List for your model. Select the policy to see more information, including any generated alerts, detected drift, etc. From the policy information page you can activate and run the policy. See the Explore sample taxi fare model tutorial for more details.

TABLE OF CONTENTS