Hive

We use Hive for analysis and processing of our data warehouse. It's a SQL-like syntax. 

(Note: as of this writing, this style guide is evolving as our use of Hive is pretty new.)

Code Formatting
  • 2 space indenting, or indent left-aligned to the opening parenthesis of the previous line. Certain HIVE keywords/statements should require indenting if wrapped, while others are fine to align (TODO(benkomalo): clarify with an example).
  • 80-char line widths. Some exceptions for long URL's in comments may apply.
  • spaces before open paren (: normal programming language convention. No space before function calls (eg. get_json_object(...)) but space before syntax-supported language constructs (eg. FROM (...)).
Hive Keywords
  • UPPERCASE all Hive keywords like SELECT and JOIN
Naming
  • table_names_like_this
  • column_names_like_this
Variables
  • Declare variables like this
    set dt=2013-10-10; 
    not like this 
    dt="2013-10-10";
  • In queries this means you will typically need to reference variables as "${dt}" rather than as {dt}
Dates
  • Our convention is to be inclusive for start and exclusive for end values of date ranges (i.e. 2012-04-01 to 2012-04-08 is a 7 day period including the first and the seventh of the month. Rationale:
    • To get length of interval, subtract end from beginning. (With open interval, you'd have to add 1.)
    • Naturally express empty intervals.
    • If I want to fetch a batch of 2 weeks, I simply need to add 2 weeks to my start date. However, with open interval I have to add 2 weeks then subtract by the granularity of how dates are stored, namely a day. This means if one changes to partition by hour instead of day, I'd need to subtract an hour from my end date.
    • Similar to the above, I don't have to know month lengths if I want to fetch data for an entire month: WHERE dt >= '2012-05-01' AND dt < '2012-06-01'
    • This is the convention that people eventually realize is the right way. Eg. Python range(), Python random.randrange() (which was added to fix random.randint()).
    • Dijkstra: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html
    • http://stackoverflow.com/questions/4264947/what-is-best-practice-for-representing-time-intervals-in-a-data-warehouse
TODO(benkomalo): lots of examples

Comments