Dataflow

DataOps Performance Stats
File Type/Data Compare DataOpsEngine Time AWS S3 CSV Dataset Size: 1Billion records (rows 1B and 23 columns) with 1628 partitions Size: 200GB ...
Tue, 7 Sep, 2021 at 3:02 AM
DataOpsEMR Pricing
Service Monthly(all prices Configuration summary EMR Master(Software) 15.33$ Number of master EMR nodes (1), EC2 instance (r6g.xlarge), Uti...
Thu, 2 Sep, 2021 at 8:25 AM
Code to read csv file, skipping top and bottom rows with pipe delimiter. And replacing special characters with underscore.
import pandas as pd import databricks.koalas as ks df = pd.read_csv('/$[ReconcileDate]/Files/Sample.csv', engine='python', sep='|'...
Mon, 20 Dec, 2021 at 12:23 AM
Code to skip last record from the dataset.
df = spark.sql(f"select * from DependentDataset") ds = df.toPandas().iloc[:-1] spark.createDataFrame(ds.astype(str)).createOrReplaceTempView(...
Wed, 15 Dec, 2021 at 5:06 AM
Read xls which has html data.
import pandas as pd df = pd.read_html('/$[ReconcileDate]/Detail_$[ReconcileDate].xls', header=0) dff = pd.concat(df) dff.columns = dff.columns.st...
Wed, 15 Dec, 2021 at 5:09 AM
Export datacompare datasets to Excel
import pandas as pd sheet_names = ["DuplicatesInBase", "DuplicatesInRun", "OnlyInBase", "OnlyInRun", "Differen...
Wed, 15 Dec, 2021 at 5:14 AM
Read csv with Koalas
import pandas as pd import databricks.koalas as ks df = pd.read_csv('/sample.csv', engine='python', sep='|', skiprows=[0], skipfoo...
Wed, 15 Dec, 2021 at 5:17 AM
Code to read csv data with columns and skip top,bottom rows
import pandas as pd df = pd.read_csv('/sample.txt', header=None, skiprows=[0],skipfooter=1, delimiter = '|', names=["COL1","...
Wed, 15 Dec, 2021 at 5:30 AM
Other similar codes
import pandas as pd df = pd.read_csv('/BMO_REPORTS_REGRESSION/V16_Upgrade/UAT/$[ReconcileDate]/base2/Calypso_EOD_MPE_Deal_Level_North_America_ALL$[Reco...
Wed, 15 Dec, 2021 at 7:01 AM