Spark small things

The PySpark CAST function seems to limit the size of FLOAT type.

When converting from String to Float, it could lose precision. Below are the examples.

The cast of '144.9100037' to float ends up with '144.91' losing 5 decimal points.

However the cast of '14.9100037' to float ends up with '14.910004', losing only one decimal point

May be a bug? probably not, it is just not supporting that many digits.

from pyspark.sql import functions as F

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(data=[('14.9100037', '144.9100037')], schema = ['NUM1', 'NUM2'])

df.withColumn('float1', F.col('NUM1').cast('float'))\

.withColumn('float2', F.col('NUM2').cast('float'))\

.withColumn('doubletype', F.col('NUM2').cast('double'))\

.withColumn('decimaltype', F.col('NUM2').cast('decimal(18,7)'))\

.show()

+----------+-----------+---------+------+-----------+-----------+

+----------+-----------+---------+------+-----------+-----------+

|14.9100037|144.9100037|14.910004|144.91|144.9100037|144.9100037|

+----------+-----------+---------+------+-----------+-----------+

Float is 4-byte data type so supports up to 2^32 / 2 = about 2 billion? which is 10 digits

Monotonical order

Sometimes when ordering rows by time, the rows have exactly the same time, then you want preserve the initial order by rows in the data file. Then use the monotonically_increasing_id() function.

withColumn("idx", F.monotonically_increasing_id())

Cache Dataframes

If an intermediate data frame that is a join of many things, needs to be used multiple times in downstream queries, we can cache the intermediate dataframe in memory so downstream queries can reuse the intermediate result instead of re-calculating again and again.

With a cache() and count() method it will persist the dataframe in memory.

df_intermediate = df_table1. join (df_table2, on="column", how="inner").cache()

df_intermediate.count()

Page updated

Google Sites

Report abuse