This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| REGION="europe-west1" | |
| ZONE="europe-west1-b" | |
| TEMPLATE_ID="download_production_table" | |
| dev_dataproc_assets_bucket="gs://your-dataproc-assets-bucket/production/" | |
| dev_project=your-gcp-project-id | |
| upload_assets: | |
| gsutil cp main.py ${dev_dataproc_assets_bucket} --region ${REGION} --project ${dev_project} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| REGION=europe-west1 | |
| ZONE=europe-west1-b | |
| CLUSTER_NAME=dev-cluster | |
| SERVICE_ACCOUNT=your_service_account_name@your-gcp-project.iam.gserviceaccount.com | |
| BUCKET_NAME=your-dataproc-staging-bucket | |
| gcloud dataproc clusters create ${CLUSTER_NAME} \ | |
| --region ${REGION} \ | |
| --zone ${ZONE} \ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| jobs: | |
| - pysparkJob: | |
| args: | |
| - dataset | |
| - entity_name | |
| - gcs_output_bucket | |
| - materialization_gcp_project_id | |
| - materialization_dataset | |
| - output_parquet | |
| - is_partitioned, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from pyspark.sql.functions import * | |
| from pyspark.context import SparkContext | |
| from pyspark.sql.session import SparkSession | |
| import sys | |
| YES_TOKEN = "Yes" | |
| sc = SparkContext.getOrCreate() | |
| spark = SparkSession(sc) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #integer and string parameters, used with hp.choice() | |
| bootstrap_type = [{'bootstrap_type':'Poisson'}, | |
| {'bootstrap_type':'Bayesian', | |
| 'bagging_temperature' : hp.loguniform('bagging_temperature', np.log(1), np.log(50))}, | |
| {'bootstrap_type':'Bernoulli'}] | |
| LEB = ['No', 'AnyImprovement'] #remove 'Armijo' if not using GPU | |
| grow_policy = [ | |
| {'grow_policy':'SymmetricTree'}, | |
| # {'grow_policy':'Depthwise'}, | |
| {'grow_policy':'Lossguide', |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def ensemble_search(params): | |
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22) | |
| model = EnsembleModel(params) | |
| evaluation = [(X_test, y_test)] | |
| model.fit(X_train, y_train, | |
| eval_set=evaluation, | |
| early_stopping_rounds=100, verbose=False) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| class EnsembleModel: | |
| def __init__(self, params): | |
| """ | |
| LGB + XGB + CatBoost model | |
| """ | |
| self.lgb_params = params['lgb'] | |
| self.xgb_params = params['xgb'] | |
| self.cat_params = params['cat'] | |
| self.lgb_model = LGBMClassifier(**self.lgb_params) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ###### log transform these columns ########## | |
| log_cols = {'cont5':'log', 'cont8':'log', 'cont7':'log'} | |
| train_copy = FW.FE_transform_numeric_columns(train_copy, log_cols) | |
| test_copy = FW.FE_transform_numeric_columns(test_copy, log_cols) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ### create groupby aggregates of the following numerics | |
| agg_nums = ['cont1','cont3'] | |
| groupby_vars = ['cat2','cat4'] | |
| train_add, test_add = FW.FE_add_groupby_features_aggregated_to_dataframe(train[agg_nums+groupby_vars], | |
| agg_types=['mean','std'], | |
| groupby_columns=groupby_vars, | |
| ignore_variables=[] , test=test[agg_nums+groupby_vars]) | |
| # join the dataframes with the aggregated features to the main training and testing set dataframes | |
| train_copy = train.join(train_add.drop(groupby_vars+agg_nums, axis=1)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ### we create feature crosses of these categorical variables ### | |
| train = FW.FE_create_categorical_feature_crosses(train, ['cat4','cat18','cat13','cat2']) | |
| test = FW.FE_create_categorical_feature_crosses(test, ['cat4','cat18','cat13','cat2']) |
NewerOlder