George Vyshnya gvyshnya

A Data Scientist & Software Dev with blended industrial experience in software development, IT, DevOps, operation and project management, and C-level roles

23 followers · 24 following

Kyiv, Ukraine - Warsaw, Poland
www.linkedin.com/in/gvyshnya, www.kaggle.com/gvyshnya

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

gvyshnya / makefile

Created July 20, 2021 21:52

This is the makefile to automate useful Dataproc-related deployment routines

	REGION="europe-west1"
	ZONE="europe-west1-b"
	TEMPLATE_ID="download_production_table"

	dev_dataproc_assets_bucket="gs://your-dataproc-assets-bucket/production/"
	dev_project=your-gcp-project-id

	upload_assets:
	gsutil cp main.py ${dev_dataproc_assets_bucket} --region ${REGION} --project ${dev_project}

gvyshnya / create_dev_cluster.sh

Created July 20, 2021 21:50

This script automates creating a permanent Dataproc cluster with Jupyter notebook/Jupyter Lab/PySpark notebook components enabled

	REGION=europe-west1
	ZONE=europe-west1-b
	CLUSTER_NAME=dev-cluster
	SERVICE_ACCOUNT=your_service_account_name@your-gcp-project.iam.gserviceaccount.com
	BUCKET_NAME=your-dataproc-staging-bucket


	gcloud dataproc clusters create ${CLUSTER_NAME} \
	--region ${REGION} \
	--zone ${ZONE} \

gvyshnya / workflow_template.yaml

Created July 20, 2021 21:48

The yaml definition of the Dataproc workflow template

	jobs:
	- pysparkJob:
	args:
	- dataset
	- entity_name
	- gcs_output_bucket
	- materialization_gcp_project_id
	- materialization_dataset
	- output_parquet
	- is_partitioned,

gvyshnya / PySpark_Job.py

Created July 20, 2021 21:47

The source code of the PySpark script exporting data from a BigQuery dataset to a GCS bucket (reservoir)

	from pyspark.sql.functions import *
	from pyspark.context import SparkContext
	from pyspark.sql.session import SparkSession
	import sys

	YES_TOKEN = "Yes"

	sc = SparkContext.getOrCreate()
	spark = SparkSession(sc)

gvyshnya / hyperopt_params.py

Created April 21, 2021 21:14

Hyperopt search parameters dictionary

	#integer and string parameters, used with hp.choice()
	bootstrap_type = [{'bootstrap_type':'Poisson'},
	{'bootstrap_type':'Bayesian',
	'bagging_temperature' : hp.loguniform('bagging_temperature', np.log(1), np.log(50))},
	{'bootstrap_type':'Bernoulli'}]
	LEB = ['No', 'AnyImprovement'] #remove 'Armijo' if not using GPU
	grow_policy = [
	{'grow_policy':'SymmetricTree'},
	# {'grow_policy':'Depthwise'},
	{'grow_policy':'Lossguide',

gvyshnya / ensemble_search.py

Created April 21, 2021 14:14

Hyperopt Ensemble Search function

	def ensemble_search(params):
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

	model = EnsembleModel(params)

	evaluation = [(X_test, y_test)]

	model.fit(X_train, y_train,
	eval_set=evaluation,
	early_stopping_rounds=100, verbose=False)

gvyshnya / ensemble_classifier_class.py

Created April 21, 2021 14:10

Custom class for Ensemble Classifier on top of lightgbm, xgboost, and catboost

	class EnsembleModel:
	def __init__(self, params):
	"""
	LGB + XGB + CatBoost model
	"""
	self.lgb_params = params['lgb']
	self.xgb_params = params['xgb']
	self.cat_params = params['cat']

	self.lgb_model = LGBMClassifier(**self.lgb_params)

gvyshnya / fw_do_log_transform.py

Created April 19, 2021 22:11

Adding log transformed features using featurewiz

	###### log transform these columns ##########
	log_cols = {'cont5':'log', 'cont8':'log', 'cont7':'log'}
	train_copy = FW.FE_transform_numeric_columns(train_copy, log_cols)
	test_copy = FW.FE_transform_numeric_columns(test_copy, log_cols)

gvyshnya / fw_add_groupby_agg_features.py

Last active April 19, 2021 22:08

Adding groupby aggregate features with featurewiz

	### create groupby aggregates of the following numerics
	agg_nums = ['cont1','cont3']
	groupby_vars = ['cat2','cat4']
	train_add, test_add = FW.FE_add_groupby_features_aggregated_to_dataframe(train[agg_nums+groupby_vars],
	agg_types=['mean','std'],
	groupby_columns=groupby_vars,
	ignore_variables=[] , test=test[agg_nums+groupby_vars])

	# join the dataframes with the aggregated features to the main training and testing set dataframes
	train_copy = train.join(train_add.drop(groupby_vars+agg_nums, axis=1))

gvyshnya / fw_add_cat_crosses_features.py

Created April 19, 2021 21:55

Creating the category cross features with featurewiz

	### we create feature crosses of these categorical variables ###
	train = FW.FE_create_categorical_feature_crosses(train, ['cat4','cat18','cat13','cat2'])
	test = FW.FE_create_categorical_feature_crosses(test, ['cat4','cat18','cat13','cat2'])

NewerOlder