Some useful background...

Several years ago we (Venatus Media) created a big-data platform to dynamically segmenting users for the purposes of better targeting our adverts. For this we used Amazon's Redshift. We tried several architectures to ingest our then 20m records per day. The data itself was pulled directly from AppNexus via their log-level data (LLD) service. The data being ingested was fully-normalized and to re-arrange this data to suit a flat denormalized structure for a NoSQL database was incredibly compute intensive. It required millions of atomic update/insert operations.

And more recently...

With the introduction of header-bidding we found ourselves having to ingest data from a large number of partners and provide a holistic view our network. We needed to load our aggregated data (~1.2 billion records and growing fast) into a database that would allow us to run daily reports and enable visualization of our data.

We had thousands of questions to ask. Some were simple, others required complex architecture and plenty of resource allocated to crunch the data quick enough for the end user.

We again looked at Amazon's Redshift, but the cost of running a cluster big enough to compute the data we needed at speed would set us back many thousands of dollars each month. On top of this we needed a business intelligence tool that could plug into our chosen databases and create charts.

Our COO stumbled onto Metabase, a free-to-use BI tool which came to market towards the end of last year (Oct 2015). Having used Chart.io previously when their pricing was sensible, there is a distinct lack of good BI tools on the market. All the platforms out there like Looker, Tableau, Chart.io et al. are overpriced and inflexible. We wanted to run and operate on our own hardware, something nobody else seemed to offer.

Metabase

Metabase has enabled us to see our data like never before. All our graphs were previously done by creating jobs, designing pages manually in our platform and then using Flot, ChartJS or D3JS to visualize the data.

With Metabase we simply just connect the data source in the admin. All major databases are supported including MySQL, Druid, H2, Crate, PostgreSQL, MongoDB, Redshift, SQL Server, SQLite and BigQuery.

Metabase will automatically analyse your database, understand the structure of tables and allow you to annotate, rename and alter this all within the UI.

We opted to shift our aggregated MongoDB data directly to Google's BigQuery. The two main reasons were cost and speed, both incredibly important for this project. Google's BigQuery is fairly inexpensive depending on frequency of use, which for us is relatively small. There are some headaches caused by the architecture which prevents updates and deletes to documents/rows. The database itself is append only, but made easy enough to connect to via Python G-Cloud Library.

We wrote a MongoDB -> BigQuery push script in Python. It automatically detects the structure in MongoDB and for smaller tables does a WRITE_TRUNCATE, and for the much larger raw and aggregated sets, a WRITE_APPEND operation.

def get_key_schema(key, value):  
  key_type = None
  if type(value) is Int64:
    k = (str(key))
    #special internal rule
    if k.startswith('app') and k.endswith('_id'):
      key_type = 'STRING'
    else:
      key_type = 'INTEGER'
  elif type(value) is ObjectId:
    key_type = 'STRING'
  elif type(value) is bool:
    key_type = 'BOOLEAN'
  elif type(value) is unicode:
    key_type = 'STRING'
  elif type(value) is datetime.datetime:
    key_type = 'TIMESTAMP'
  elif type(value) is float:
    key_type = 'FLOAT'
  ...

Report scheduling

Metabase has built-in scheduled reports to deliver a custom report, created by adding questions directly to specified team members. This can be used to send a daily P&L report to Cxx/Finance or actionable insight to operations teams.

Next few quarters

We expect Metabase to resolve a number of the problem issues listed in their Github. They have done an amazing job so far and this platform is a real contender to paid equivalents. BigQuery is already fairly cheap to run, but in a race to the bottom and a need to be competitive, prices will no doubt fall further. There are a number of feature requests for Metabase and we have been active on GitHub in reporting issues and suggesting improvements.