How to Make a Dashboard, or: from Hive to Website

1. Put your metrics into MongoDB

We have a report database setup at 107.21.23.204:27017. You can load your data there through any library that can talk with MongoDB (i.e. pymongo for python).

From Python to MongoDB

If your extract your numbers from a python script, or a web-page fetch, or whatever, you can use a simple library, such as pymongo, to insert the numbers into mongodb.  For one example, see analytics/src/gae_dashboard/load_usage_reports.py.  The work can be as simple as the SaveToMongo routine at analytics/src/webpagetest/run_webpagetest.py.

You'll probably want to update aws-config/analytics/crontab to run your script periodically. Don't forget to install it once you're done, by logging into the analytics machine and running cd aws-config && git pull && crontab analytics/crontab.

From Hive to MongoDB

If your metrics are generated out of Hive -- that is, you want to run a map-reduce over some data in Hive to generate summary-data for your dashboard -- we have process in place for that:

  1. Get your data where Hive can see it.  Usually this involves putting it in s3.  A number of dashboards do this via analytics/map_reduce/load_emr_daily.sh (see the "upload to S3" section).  It's common to put this unprocessed data under s3://ka-mapreduce/rawdata/.
  2. Make a table definition for the raw data.  This references the data in s3.  It should be defined in analytics/map_reduce/hive/ka_hive_init.q, in the top section of the file. Just follow one of the existing examples (make sure the table is EXTERNAL, make sure you get the types of the fields right, have the ALTER line, etc.)
  3. Make a table definition for the summary data.  This table's fields will be used by mongo-db.  It is also defined in analytics/map_reduce/hive/ka_hive_init.q, in the "Summary tables" section of the file.  The fields here don't need to match the fields from the raw-data table in any way, but it must be possible to derive each field of the summary table from the data in the raw table.  Again, just follow an existing example.  You will often want to partition by dt so users can query (mongo-db) by day.
  4. You need to manually run the summary-table definition step (ideally this should be automatic but it isn't yet).  You do this by logging into ka-hive, running hive, and at the prompt entering in the sql commands you added to ka_hive_init.q to create the summary-data table.  This should be two commands: CREATE EXTERNAL TABLE ..., and ALTER TABLE ....
  5. Create a script that takes data from the raw table and writes it to the summary table.  These scripts live in analytics/map_reduce/hive/ and perform sql queries.  They look like, at a minimum,
    INSERT OVERWRITE TABLE <summary_table> PARTITION (dt = '${dt}')
    SELECT
    <fields in summary table>
    FROM <raw_table> WHERE dt = '${dt}';
    A relatively small example is at analytics/map_reduce/hive/daily_request_log_url_stats.q.
  6. Modify the report-generation tools so it will import the summary data into mongo (via running the above script).  You do this by adding a two pieces to analytics/cfg/daily_report.json:
    1. Add your raw-data table to the "wait for" section at the top of the file.
    2. Add your summary table to the "steps" section at the bottom of the file.
    The fields in the "steps" section are used by analytics/src/report_generator.py -- you can look there to understand what they're for.
  7. Now you need to get this live on the site.  You do this by committing it and pushing (after getting it reviewed first!), then logging into the analytics machine and running:
    cd analytics && git pull && cd map_reduce && make upload
  8. Wait for the daily-report script to run (via load_emr.sh), which happens in the nighttime.  Come back the next day and verify the script worked by logging into analytics and looking at kalogs/load_emr/<last_file>.  Look for your summary-table name.

2. Make python functions and end points for retrieving the data

Once the data is in the database, you need to create a web end point to retrieve the data points. It's pretty straight forward to do it. For example, I wrote a python function topic_summary to get data from the report database. I then created a web end point to get the json serialized result from the topic_summary function.   

3. Come up with frontend html + javascripts

Now to put this all together, we need an actual html to host the dashboard and a javascript to handle all the user interactions. Take a look at video-topic-summary.html and video-topic-summary.js It's all pretty self explanatory.

4. Test your dashboard locally

BANG! Your dashboard is ready. If you want to test your dashboard locally, just go to analytics/webapps/dashboard/ and run ./main.py -d . You can then debug your dashboard on http://localhost:5000/

5. Install your dashboard globally

Commit and push your code to the analytics repo if you haven't already, then log into the analytics machine, run cd analytics && git pull, then run
sudo service dashboards-daemon restart
Then you can connect to http://dashboards.khanacademyorg/ and see your new dashboard in action!
 
Comments