Saturday, 25 February 2017

Spark Drop Column From Dataframe using column index

What We Wanted to Achieve....

We were looking for a generic code snippet  to drop certain column after reading a data from a file using spark 2.0. All we knew, in all the files we were aiming to process, the column to be dropped was in (n-m)th position(Where n is the number of columns in a data-frame).


What We Did ....

Read data:
val df1 = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "delim").load("/Path/to/file");

Get Schema of dataframe:
df1.printSchema()

Drop Column from a dataframe:
df1.drop("column name")

Drop column based on position:
//To drop last column of dataframe
val col=df1.columns         // get list of columns from a dataframe.
val n=df1.columns.length  //get number of columns (n).
val ToBeDropped = n-m // get the index of the column to be dropped.
val oldDf=df1.drop(col(ToBeDropped ))  //drop (n-m)th column from df.

Example to drop last but one column from the data frame:
val col=df1.columns
val n=df1.columns.length 
val ToBeDropped = n-2 // to drop last column subtract 1 from n (i.e n-1) and so on..
val oldDf=df1.drop(col(ToBeDropped ))