Hi All,
I’m sorry but this isn’t going to be an exciting blog about some of HANA’s cool features.
It's a bit boring but the Bread & Butter of all databases is having tables with data, that can be queried and joined.
For those of you just starting or those that are more adventurous and want a large volume of data to test out performance tuning options, then TPC Benchmark™H (TPC-H) might be for you.
TPC-H provides 8 tables and a data generation tool that enables you to create as much data as you want.
Generate the test data, create the tables, import and you’re away.
The 8 tables are all designed to be joined. They simulate Sales Order transactions:
The tables include (using SAP style terminology):
- Customers (Customer Master)
- Supplier (Vendor Master)
- Part (Material Master records)
- Partsupp (Inventory Balances)
- Nation (Country)
- Region
- Orders (Sales Order Header)
- Lineitem (Sales Order line items)
NOTE: The tables are NOT in the same format as the similar SAP ECC tables but offer similar information as what you might find in ECC.
The Table Schema is as follows:
NOTE: SF is the Scaling factor used to generate the data.
For example a scaling factor of 1 will generate 6 Million records for the lineitem table, SF 2 will generate 12 Million, 10 will generate 60M,
100 will generate 600M, etc.
As a quick summary you will need to :
- Log into Linux with the appropriate authorization
- Install a C/C++ compiler if you’ve not already
- Download/Extract the TPC-H files
- Modify the ‘makefile’ to be relevant for Linux
- Compile the data generation tool (DBGEN)
- Run DBGEN to create data files based on the scaling factor
- Create the tables and import them in HANA studio
If you’re still reading then I’ve not scared you off, so now for the technical bit.
Log into Linux with the appropriate authorization
Use Putty or equivalent tool to log into Linux box running HANA , at the OS level.
For those using a AWS HANA developer box refer to http://scn.sap.com/docs/DOC-28294
Install a C/C++ compiler if you’ve not already
sudo zypper install gcc gcc-c++ gcc-fortran
cd /usr/bin/
sudo ln -s gcc g
Download/Extract the TPC-H files
Create a new directory to store TPC-H files, ensure there is enough space on the mounted folder then download and extract the files.
On AWS I have choosen the following:
cd /sap
mkdir tpc-h
cd tpc-h
wget http://www.tpc.org/tpch/spec/tpch_2_15.0.zip
unzip tpch_2_15.0.zip
Modify the ‘makefile’ to be relevant for Linux
Use 'vi' or equivalent to modify the supplied 'makefile' which is used to compile the data generator tool.
In the 'makefile' ensure the following parameters are set
DATABASE=SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH
NOTE: We use SQLSERVER above as it doesn't recognise HANA.
For an intro to 'vi' you could try http://heather.cs.ucdavis.edu/~matloff/UnixAndC/Editors/ViIntro.html
At the command prompt:
cd tpch_2_15.0/dbgen
vi makefile.suite
Edit the file and save.
Compile the data generation tool (DBGEN)
Now we need to compile the generation tool, to generate the executable file 'dbgen'
make -f makefile.suite
The dbgen executable should now appear in the folder.
Check with.
ls -l
Run DBGEN to create data files based on the scaling factor
Now you are ready to create 'tbl' files for each of the 8 tables.
The important parameter of the 'dbgen' program is the scaling factor.
A scaling factor of 1, will generate 1Gb of uncompressed test data(e.g. 6 million rows in the 'lineitem table')
A scaling factor of 2, will generate 2Gb of uncompressed test data(e.g. 12 million rows in the 'lineitem table')
A scaling factor of 10, will generate 60Gb of uncompressed test data(e.g. 60 million rows in the 'lineitem table')
A scaling factor of 100, will generate 600Gb of uncompressed test data(e.g. 600 million rows in the 'lineitem table')
etc.
Excute the tool with one of following commands withe different scaling factors (or your own):
#Scaling factor 1
./dbgen -vf -s 1
#Scaling factor 2
./dbgen -vf -s 2
#Scaling factor 10
./dbgen -vf -s 10
[Note: I suggest your try scaling factor 1 first, till you are comfortable with the load process]
Create the tables and import them in HANA studio
createschema"TPCH";
setschema"TPCH";
droptable nation;
droptable region;
droptable supplier;
droptable part;
droptable partsupp;
droptable customer;
droptable orders;
droptable lineitem;
create columntable nation (
n_nationkey decimal(3,0) notnull,
n_name char(25) notnull,
n_regionkey decimal(2,0) notnull,
n_comment varchar(152)
);
createcolumntable region (
r_regionkey decimal(2,0) notnull,
r_name char(25) notnull,
r_comment varchar(152)
);
createcolumntable part (
p_partkey decimal(10,0) notnull,
p_name varchar(55) notnull,
p_mfgr char(25) notnull,
p_brand char(10) notnull,
p_type varchar(25) notnull,
p_size decimal(2,0) notnull,
p_container char(10) notnull,
p_retailprice decimal(6,2) notnull,
p_comment varchar(23) notnull
);
createcolumntable supplier (
s_suppkey decimal(8,0) notnull,
s_name char(25) notnull,
s_address varchar(40) notnull,
s_nationkey decimal(3,0) notnull,
s_phone char(15) notnull,
s_acctbal decimal(7,2) notnull,
s_comment varchar(101) notnull
);
createcolumn table partsupp (
ps_partkey decimal(10,0) notnull,
ps_suppkey decimal(8,0) notnull,
ps_availqty decimal(5,0) notnull,
ps_supplycost decimal(6,2) notnull,
ps_comment varchar(199) notnull
);
createcolumntable customer (
c_custkey decimal(9,0) notnull,
c_name varchar(25) notnull,
c_address varchar(40) notnull,
c_nationkey decimal(3,0) notnull,
c_phone char(15) notnull,
c_acctbal decimal(7,2) notnull,
c_mktsegment char(10) notnull,
c_comment varchar(117) notnull
);
createcolumntable orders (
o_orderkey decimal(12,0) notnull,
o_custkey decimal(9,0) notnull,
o_orderstatus char(1) notnull,
o_totalprice decimal(8,2) notnull,
o_orderdate datenotnull,
o_orderpriority char(15) notnull,
o_clerk char(15) notnull,
o_shippriority decimal(1,0) notnull,
o_comment varchar(79) notnull
);
createcolumntable lineitem (
l_orderkey decimal(12,0) notnull,
l_partkey decimal(10,0) notnull,
l_suppkey decimal(8,0) notnull,
l_linenumber decimal(1,0) notnull,
l_quantity decimal(2,0) notnull,
l_extendedprice decimal(8,2) notnull,
l_discount decimal(3,2) notnull,
l_tax decimal(3,2) notnull,
l_returnflag char(1) notnull,
l_linestatus char(1) notnull,
l_shipdate datenotnull,
l_commitdate datenotnull,
l_receiptdate datenotnull,
l_shipinstruct char(25) notnull,
l_shipmode char(10) notnull,
l_comment varchar(44) notnull
);
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/lineitem.tbl'INTO"TPCH"."LINEITEM"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/customer.tbl'INTO"TPCH"."CUSTOMER"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/nation.tbl'INTO"TPCH"."NATION"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/orders.tbl'INTO"TPCH"."ORDERS"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/part.tbl'INTO"TPCH"."PART"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/partsupp.tbl'INTO"TPCH"."PARTSUPP"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/region.tbl'INTO"TPCH"."REGION"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
IMPORTFROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/supplier.tbl'INTO"TPCH"."SUPPLIER"WITH RECORD DELIMITED BY'\n' FIELD DELIMITED BY'|';
Congratulations!
You are now ready to start running queries.
Section 2.4 of the following document http://www.tpc.org/tpch/spec/tpch2.15.0.pdf has plenty of query examples for you to try.
You may need to translate these to HANA SQL or be more adventurous and create them with Attribute, Analytic & Calculation views to optimise performance.
NOTE: For those that successfully translate the queries to HANA SQL then please consider adding your translated SQL as a comment to this blog, to share with others.
Have fun bench-marking your results with others.
------------------------------------------------------------------------------------------
BTW: I used TPC-H data in my first blog comparing HANA to HADOOP IMPALA.