Ready To Use Data Engineering Tips And Tricks : 2017

What We Wanted to Achieve....

We were looking for a generic code snippet to drop certain column after reading a data from a file using spark 2.0. All we knew, in all the files we were aiming to process, the column to be dropped was in (n-m)th position(Where n is the number of columns in a data-frame).

What We Did ....

Read data:
val df1 = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "delim").load("/Path/to/file");

Get Schema of dataframe:

df1.printSchema()

Drop Column from a dataframe:

df1.drop("column name")

Drop column based on position:

//To drop last column of dataframe

val col=df1.columns // get list of columns from a dataframe.

val n=df1.columns.length //get number of columns (n).

val ToBeDropped = n-m // get the index of the column to be dropped.

val oldDf=df1.drop(col(ToBeDropped )) //drop (n-m)th column from df.

Example to drop last but one column from the data frame:

val col=df1.columns

val n=df1.columns.length

val ToBeDropped = n-2 // to drop last column subtract 1 from n (i.e n-1) and so on..

val oldDf=df1.drop(col(ToBeDropped ))

Ready To Use Data Engineering Tips And Tricks

Saturday, 25 February 2017

Spark Drop Column From Dataframe using column index