TIQView Blog RSS

TIQ Solutions - Spend Quality Time with your Data!

Contact

Impressum

Archive

Member of LXQ - The League of eXtraordinary Qliketeers

Ralf Becher on QlikCommunity

Ralf Becher on GitHub

DZone Most-Valuable-Blogger

Profil von Ralf Becher auf LinkedIn anzeigen

Google+

Add this blog to my Technorati Favorites!

My Blogroll

Apr
22nd
Mon
permalink

Using Impala for Self Service BI

The Situation

Large amounts of daily delivered raw data lead into a constantly increasing load and transformation time in enterprise data warehouse environments realized with classical databases. In a current client project (telco) we encounter the problem that it already exceeds the daily processing timeframe and can only be handled by dropping older data every day. Also, ad hoc queries will take hours even on optimized R-OLAP models with daily partitioning and bitmap indexing (Oracle in this case) which will prevent users from doing self service analysis. The assumption is to replace the classical database with a Hadoop cluster to provide adequate load and query time for a large growing (maybe unlimited) data set.

The Use Case

  • raw data from multiple delivery systems as csv files
  • time slots (granularity 15 minutes) as partition criteria
  • long term inspection within big data (one source data amount per day: 400 million rows, 120 GB volume, 96 partitions )
  • possibility to create ‘free’ ad hoc queries through a well know query language (SQL)
  • JDBC drivers to use raw data with third party tools like SQuirrel or QlikView
  • extraction of data sets with more than 1 million rows for further in-memory analysis

Our Decision for Hadoop and Hive / Impala

  • Hadoop HDFS can store an unlimited amount of data
  • massive parallel processing through unlimited amount of nodes
  • Hive query language (HQL) as SQL dialect with less learning
  • Hive meta store for schema definition, used by Impala
  • improved query performance through Impala (omit Map/Reduce overhead and job execution sequence)
  • Capex and Opex (this would fill a long thread indeed..)

Pitfalls and Experiences

The transforming of an existing table into a partitioned one may be complicated by the limitation of Hive partitions and memory space. We tried to import 3 days (96 slices by 15 minutes per day) with time slots and device ids as partition and sub-partition criteria which would create about 14,400,000 partitions (3 * 96 * 50,000).

If one partition criteria equals to a column in the csv file you must either transform this file during import to remove this column or be careful to use the right one in your select to get Impala not to scan the whole table.

As Impala 0.6 does not support timestamp or date types you have to store the time as a big integer (eg. Unix epoche). Also there is no chance to create a where clause with human readable (literal) date and time values. You would need to use Hive function unix_timestamp but this would omit the use of partitions.

Impala Partitioning

Also be careful on querying multiple partitions with an ‘or’ conjunction. It does not use the partitions any more, you have to transfer your query into a form like ‘select … where condition1 union all select … where condition2’.

Conclusion

We have set up an 8 node Hadoop cluster with Cloudera distribution CDH 4.2 over Cloudera Manager 4.5 and Impala 0.6. The node hardware configuration is:

  • 4 x 3.3 GHz Intel Xeon E3-1230v2 (Ivy Bridge)
  • 16 GB RAM
  • 2 x 1,000 GB RAID 1

We have loaded three days of raw data from one source:

  • 20,832 files
  • 1.2 billion records
  • 370 GB file size in total
  • 289 partitions

Typical analysis query:

  • extract raw data slice of 3 hrs per day (for the given 3 days) for one dimension (device), maybe additional filter conditions

Query execution time including transport of result set to client application via JDBC:

  • classical database: some hrs
  • Cloudera Impala: < 2 min

Impala QlikView Self Service BI

These results are based on Impala 0.6; Impala 0.7 contains several improvements to consider in the next steps..

One more thing..

As we noticed the Apache Thrift project used by Hive and Impala for communication between client and server has deficits on transporting big data volumes. We are currently working on a solution (eg. JDBC driver) which will bypass the protocol on large result sets.

Mar
7th
Thu
permalink

Going Qonnections 2013?

Hi Folks,

it’s only 4.5 weeks to Qonnections 2013. As in the last two years I release my QlikView Qonnections application “QonnView” with the actual Qonnections Session schedule, the Twitter search, a Bahamas Weather Forecast and NAS Flight Information:

Download QonnView2013.zip

QonnView2013 screenshot

QonnView 2013 now comes with a Gantt chart for better session planning..

Hope to see you over there. Have fun on Paradise Island!

- Ralf

Feb
10th
Sun
permalink

QVD Converter - Create and Process QlikView Data Files everywhere

The QVD Converter is a reusable Java library from TIQ Solutions. The library enables Java applications to create and process QlikView data files (QVD format) from any size independently (e.g.. in Unix environments).

image

The offered Java classes can be used directly in any Java implementation (via import) and are a useful starting point for the development of plug-ins for ETL tools like Pentaho Kettle. With this, QlikView data files are integratable in any data integration process.

image

At this moment, TIQ Solutions develops a QVD JDBC driver based on the QVDReader class. This JDBC driver will make QlikView QVD data files available as queryable SQL data sources for a broad range of applications implementing the JDBC standard.

Read more about it on TIQ Solutions QVD Converter website:

http://www.tiq-solutions.de/display/enghome/QVD+Converter

I’ve pushed my latest QVDReader examples on GitHub:

https://github.com/ralfbecher/QlikView_QVDReader_Examples

Update: I’v added new features for version 1.3:

  • select a subset of fields, skip unwanted data (in preparation for QVD JDBC driver and for better performance)
  • cache of last records symbol values to prevent lookups (performance)
  • more example applications for batch processing (e.g. search over a bunch of QVD files with wildcards for files and search pattern)

Feb
4th
Mon
permalink

Graph Data Visualization in QlikView

Since I wrote the post about “Working with Graph Data from Neo4j in QlikView” here I had the idea in mind to develop a QlikView Extension Object do replace the static Google Chart API graph display I used in the example application.

Now I’ve finished a first version of an extension using the Graph Dracula JavaScript Library (based on Raphaël.js): http://www.graphdracula.net

Here are a few impressions:

TIQ Graph Dracula screenshot 1

TIQ Graph Dracula screenshot 2

You can download the extension and QlikView demo application here.

I will do some further developments in the future related to QlikView graph data integration because I think relational data structures and models will disappear in the future. Reality is different than a table..

TIQ Solutions will provide professional services in this realm, also development and customization of QlikView extension objects.