Getting Started with Big Data Query using Apache Impala 9781716108396, 171610839X

This book is designed for anyone who learns how to get started with Apache Impala. The book covers SQL queries and data

217 33 8MB

English Pages [120] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Getting Started with Big Data Query using Apache Impala 9781716108396, 171610839X

This book is designed for anyone who learns how to get started with Apache Impala. The book covers SQL queries and data

202 100 8MB Read more

Scala Programming for Big Data Analytics: Get Started With Big Data Analytics Using Apache Spark [1st ed.] 978-1-4842-4809-6;978-1-4842-4810-2

Gain the key language concepts and programming techniques of Scala in the context of big data analytics and Apache Spark

2,330 286 7MB Read more

Getting Started with Streamlit for Data Science 9781800565500

1,521 217 12MB Read more

Frank Kane's Taming Big Data with Apache Spark and Python

4,414 725 143KB Read more

Getting Started with Haskell Data Analysis [1 ed.] 9781789808605

Put your Haskell skills to work and generate publication-ready visualizations in no time at all Key Features Take your

1,542 68 16MB Read more

Getting Started with Python Data Analysis 1785285114, 9781785285110

1,346 143 4MB Read more

Getting Started with Javascript 9782490275427

1,019 133 10MB Read more

Make - Getting Started with Drones 9781457183300

1,160 232 76MB Read more

Pascal programming: Getting Started with Pascal

1,149 172 7MB Read more

Getting Started with tmux 178398516X, 9781783985166

Maximize your productivity by accessing several terminal sessions from a single window using tmuxAbout This BookDiscover

1,067 179 4MB Read more

Getting Started with Big Data Query using Apache Impala
9781716108396, 171610839X

Author / Uploaded
Agus Kurniawan

Table of contents :
Start

Citation preview

Getting Started with Big Data Query using Apache Impala

© 2021 Agus Kurniawan

PE Press

ISBN: 978-1-716-10839-6

Preface

This book provides alternative approach to get started with big data query using Apache Impala. This book describes how to work with Apache Impala and to perform queries inside Apache Impala.

Agus Kurniawan

Depok, February 2021

Table of Contents

Getting Started with Big Data Query using Apache Impala Preface 1. Introduction to Apache Impala 1.1 Introduction 1.2 Installing Apache Impala 1.3 Setting up Lab Demo 2. Working with Apache Impala Shell 2.1 Introduction 2.2 Connecting to Apache Impala Service 2.3 Performing SQL Query with Apache Impala Service 2.4 Executing SQL Query on Apache Impala Shell in NonInteractive Mode 2.5 Executing A SQL Query File with Apache Impala Shell 2.6 Quit from Apache Impala Shell 3. SQL Querying with Apache Hue and Apache Impala 3.1 Setting up Apache Hue 3.2 Connecting Apache Hue to Apache Impala 3.3 Performing SQL Query for Apache Impala 3.4 Working Apache Hue with GetHue Demo Website 4. Loading Dataset to Apache Impala 4.1 Introduction 4.2 Creating Table for Delimited Files 4.3 Testing Query 5. Basic SQL Query for Apache Impala 5.1 Introduction 5.2 Creating and Deleting Databases

5.3 Creating and Deleting Tables 5.4 Inserting and Selecting Data

5.5 Updating and Deleting Data 5.6 Truncating Table Data 5.7 Filtering Data 5.8 Calling Built-in Functions 5.9 Distinct 5.10 Ordering Data 5.11 Grouping 5.12 Having 5.13 Limit and Offset 5.14 Creating and Selecting Views 6. Joining Query and Subquery on Apache Impala 6.1 Introduction 6.2 Joining Query 6.2.1 Inner Join 6.2.2 Left Join 6.2.3 Right Join 6.2.4 Outer Join 6.3 Subquery 6.4 Union and Union All 6.5 With 7. Partition Data on Apache Impala 7.1 Introduction 7.2 Creating Partition Table 7.3 Exploring Partition Table Files on HDFS 8. Apache Impala Database Programming with Java 8.1 Introduction 8.2 Creating A Project

8.3 Connecting to Apache Impala 8.4 Getting All Data

8.5 Inserting Data 8.6 Completed Program Source Code Contact

1. Introduction to Apache Impala

1.1 Introduction

Apache Impala is a modern, open source, distributed SQL query engine for Apache Hadoop. With Impala, we can query data, whether stored in HDFS, Apache Hive or Apache HBase – including SELECT, JOIN, and aggregate functions. You can find the official project on this link, In this book, we learn how to perform queries on Apache Impala.

1.2 Installing Apache Impala

In this section, I use Cloudera Manager to install Apache Impala. You can install Apache Impala to Linux manually. You can see my Cloudera Manager in Figure below.

To add Hadoop service using Cloudera Manager, you can can click Add Server on a context menu as shown in Figure below.

After clicked, you can install Apache Impala. Make sure you also install HDFS, HBASE and Hue.

Once installed, we can start to work with Apache Impala.

1.3 Setting up Lab Demo

You can set up Apache Impala with Cloudera Manager or own Linux. For demo, I use Apache Impala on Cloudera environment. I deployed Apache Impala on Ubuntu Linux.

2. Working with Apache Impala Shell

2.1 Introduction

Apache Impala provide a service and a shell. In this chapter, we learn how to work with Apache Impala shell. To show Impala shell version, you can type this command.

$ impala-shell --version

You will see Impala shell on your Terminal. You can see my Impala shell version is shown in Figure below.

Next, we will work with Impala shell.

2.2 Connecting to Apache Impala Service

To start Impala shell, you open a Terminal on your Apache Impala server. Then, type this command.

$ impala-shell

This will connect to your local Impala shell. After that, you wull see Impala shell as shown in Figure below.

If your Impala service is off, you will obtain error message as shown in Figure below.

How to connect to Impala shell from remote machine ?. Firstly, your Impala machine server already opened Impala port. It usually use port 21000. You can open Impala shell by IP Address as below.

$ impala-shell -i

You can see my remote Impala server is accessed from another Apache Impala machine in Figure below.

After connected, we can work on Impala shell. To show all commands on Impala shell, you press TAB on your keyboard. After pressed, you should see a list of Impala commands as shown in Figure below.

Next, we will learn how to perform queries on Apache Impala shell.

2.3 Performing SQL Query with Apache Impala Service

We can show all databases in Apache Impala using this command.

show databases;

Type this command on Impala shell. Then, you will see a a list of Impala databases as shown in Figure below.

We also print all tables on an Impala database. Firstly, we navigate to a database and then type show tables command.

use ak_testdb;

show tables;

Here is my program output. You can see a list of tables in ak_testdb Impala database.

We can list all data from a table using SELECT..FROM statement. For instance, we print all data from employees table. You can type this SQL script.

select * from employees;

You should see a list of employees table in Impala shell as shown in Figure below.

You can perform standard SQL queries on Impala shell.

2.4 Executing SQL Query on Apache Impala Shell in Non-Interactive Mode

On previous, we can perform SQL queries after we entry Impala shell. We also can run SQL query without entering Impala shell. We can pass -q parameter and type your SQL query. We set -d parameter for the Impala database. For instance, we want to run a SQL query "show tables" on ak_testdb database. You can type this command on Linux shell.

$ impala-shell -i localhost -d ak_testdb -q "show tables"

If succeed, you should see a list of Impala tables in Linux terminal as shown in Figure below.

Another sample, we can execute a SQL query "select * from employees;". We pas this query on -q parameter. You can type this command. Please change -d parameter value by your database name.

$ impala-shell -i localhost -d ak_testdb -q "select * from employees"

You should see a list of employees table on Linux Terminal. You can see my program output in Figure below.

2.5 Executing A SQL Query File with Apache Impala Shell

If you have SQL queries in file, we can run it with Impala shell. For instance, we have the following queries

use ak_testdb;

select * from employees;

In this case, I use a nano editor to write SQL queries on a file as shown in Figure below.

Save these queries into a file, called demo.sql. Then, we can run this SQL file by passing it on -f parameter. Make sure you set Impala database on -d parameter.

$ impala-shell -i localhost -d ak_testdb -f demo.sql

You should see a result of queries from a file. You can see my program output from demo.sql query file in Figure below.

2.6 Quit from Apache Impala Shell

We can quit from Impala shell by typing exit;. You can see my program output on Impala shell in Figure below.

exit;

3. SQL Querying with Apache Hue and Apache Impala

3.1 Setting up Apache Hue

Apache Hue is a web tool that can be used to perform queries on Apache Impala. We can say Apache Hue like MySQL Workbench in MySQL or SQL Server Management Studio in SQL Server. We can use Apache Hue to write queries to Apache Impala easily. This tool has a form in web application so we only need a browser to access. If you have Cloudera platform, you can install Apache Hue using Cloudera Manager. Add a new service on your existing Cloudera Manager. Click Hue and the install as shown in Figure below.

After completed, you can open Apache Hue. For the first time to open Apache Hue, you will be asked to entry username and

password for Admin. You also can add additional users who will access this Apache Hue.

3.2 Connecting Apache Hue to Apache Impala

After installed, you can open Apache Hue. You will see SQL editor as shown in Figure below.

Apache Hue supports various SQL database engine editor. You can click Query button to see a list of supported SQL editor. Select Impala option to work with Impala SQL editor as shown in Figure below.

Now you can see SQL editor for Impala as shown in Figure.

Now you can write SQL queries on this editor.

3.3 Performing SQL Query for Apache Impala

We can write any SQL query on Apache Hue for Impala. For instance, we write the following query.

show databases;

Then, click a blue arrow button (Run button) to execute your query. You also can run specific query lines by highlighting your query text on editor and then click a blue arrow button. You will see the output program from your query as shown in Figure below.

You can change database on your editor. You can click Impala database and click your database that you want to work for.

After selected, you will see your database on Query editor as shown in Figure below.

All queries are written in Query editor will be executed on your selected database unless you use database explicitly using "use your_database;" statement.

For testing, you write the following query to show all tables on current database.

show tables;

Run this query so you will see the output of program as show in Figure below.

You write any SQL query on Query editor. For instance, we perform a query to retrieve all data on employees table.

select * from employees;

Query output:

3.4 Working Apache Hue with GetHue Demo Website

You probably don't have Impala server and Apache Hue on your computer but you want to learn Apache Impale and Hue. You can use Apache Hue on GetHue. You can visit on You will see the websiste in Figure below.

You can click TRY HUE NOW button. Then, you will have a login form as shown in Figure below.

Type demo for username and password. Click Sign In button. After that, you will have Hue editor. To work with Impala, you can change Editor. Select Table (Impala) option.

Now you can write SQL queries on this editor. This website is useful when you don't want to install Apache Impala and Hue.

4. Loading Dataset to Apache Impala

4.1 Introduction

We can create a table by loading dataset files. In this chapter, we will create a table with mapping to dataset file. For demo, we use expense dataset, expense.csv. This file consists of the following headers

Transaction ID Age Items Monthly Income Transaction Time Record Gender City Tier Total Spend

In this demo, we remove a header line from expense.csv file so we have only data on the file content.

4.2 Creating Table for Delimited Files

To implement this demo, we should have copy expense.csv file to HDFS. For instance, we copy expense.csv file to HDFS folder, /user/datasci/ is user folder on HDFS. You can change it with your HDFS account. We can create /demo/exp/ folder if you don't have it. To copy a file from local server, we can use -copyFromLocal parameter on hdfs command. You can see the following my bash commands.

$ hdfs dfs -mkdir -p /user/datasci/datasets/demo/exp/

$ hdfs dfs -chmod -R 777 /user/datasci/expense

$ hdfs dfs -copyFromLocal ./expense-data.csv /user/datasci/datasets/demo/exp/

Now we can create a table that is pointed to HDFS folder. For instance, we have CSV files on this HDFS folder, You creat a table with following SQL query. Change /user/datasci/datasets/demo/exp/ by your HDFS folder.

drop table if exists expense;

create table expense(

id string,

age int,

items int,

Monthly_Income int,

Transaction_Time string,

Record int,

Gender string,

City_Tier string,

Total_Spend double

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/datasci/datasets/demo/exp/';

You can type these SQL queries on Apache Hue as shown in Figure below.

After executed, our Impala table is mapped to folder.

4.3 Testing Query

Now we can perform SQL query to retrieve expense table. You can type this SQL query on Apache Hue, for instance.

select * from expense;

You should see a list of expense data on output as shown in Figure below.

5. Basic SQL Query for Apache Impala

5.1 Introduction

In this chapter, we learn how to build SQL queries on Apache Impala. This is a basic SQL query so you can perform data processing with Apache Impala. Let's start.

5.2 Creating and Deleting Databases

Firstly,we can create a database using create database statement. For instance, we create database You can type this SQL query.

-- creating a database

create database if not exists ak_aa01;

if not exists statement is used to skip on creating a database when the database was already created. To delete a database, you can use drop database statement. You type the following SQL query.

-- creating a database

drop database if exists ak_aa01;

5.3 Creating and Deleting Tables

We can create a table using create table statement. Make sure you set working database using use statement. In creating a table, we define some columns with certain type. For instance, we create a table, with columns: id, first_name, last_name, email, created.

-- use database

use ak_demo;

-- creating a table

create table if not exists employees(

id int,

first_name string,

last_name string,

email string,

created timestamp

);

After created a table, we can verify a table using describe statement.

-- describe table

describe employees;

You should table information on the query output as shown in Figure below.

-- drop a table

drop table if exists employees;

5.4 Inserting and Selecting Data

We can insert data using insert into statement. For instance, we entry 5 data into Employees table. You can type these queries.

-- insert data

insert into Employees(id,first_name,last_name,email,created) values(1,'employee','1','[email protected]',now());

insert into Employees(id,first_name,last_name,email,created) values(2,'employee','2','[email protected]',now());

insert into Employees(id,first_name,last_name,email,created) values(3,'employee','3','[email protected]',now());

insert into Employees(id,first_name,last_name,email,created) values(4,'employee','4','[email protected]',now());

insert into Employees(id,first_name,last_name,email,created) values(5,'employee','5','[email protected]',now());

Now we can perform a query to retrieve all data from employees table. You can run this query.

-- select data

select * from employees;

Program output:

We also can limit a number of data that will be retrieved using limit statement. For instance, we want to obtain 3 data from employees table.

-- select data with limit

select * from employees limit 3;

Query output:

5.5 Updating and Deleting Data

We can update data using update statement. For instance, we want to update employees table data with id=3. You can type this query.

-- update

update Employees set first_name='updated_emp',last_name='update_3'

where id = 3;

We also can delete data using delete from statement. For instance, we want to delete employees table data with id=3. You can type this query.

-- delete

delete from Employees where id=3;

5.6 Truncating Table Data

We can delete all data using delete from statement. We also can use truncate statement to delete data. Truncate can be used to delete the entire data of the table without maintaining the integrity of the table. For instance, we delete all data on employees table.

truncate employees;

Then, we can verify to retrieve all employees table data. You should obtain empty data.

select * from employees;

5.7 Filtering Data

While we are retrieving data from a table, we can filter the data result. We can use where statement for filtering data. For instance, we retrieve data from expense gender with gender 'Female' only.

-- filter

select * from expense where gender = 'Female' limit 5;

Query output:

On filtering data, we can construct some filtering criteria-based table columns. We use AND and OR operations while constructing filtering

data.

select * from expense where gender = 'Female' and monthly_income > 15000 limit 5;

Here is a query sample for constructing filtering criteria.

select * from expense where (gender = 'Female' or city_tier like 'Tier 2') and monthly_income > 15000 limit 5;

Query output:

5.8 Calling Built-in Functions

Apache Impala has built-in functions such as min(), max(), count(). You can write this query for counting a number of data.

-- select count

select count(id) as total from employees;

Program output:

You also can find min and max of data in expense table. You can type this query.

select max(age) as max, min(age) as min from expense;

Query output:

5.9 Distinct

We can get a list of unique data from table column using distinct statement. You can type this query for demo.

-- distinct

select distinct city_tier from expense;

Program output:

5.10 Ordering Data

We can order our data based on table columns. Ordering data can be ascending or descending mode. For instance, we order expense table data by monthly_income column with ascending and descending.

-- order

select * from expense order by monthly_income asc limit 5;

select * from expense order by monthly_income desc limit 5;

You can see the query output as shown in Figure below.

We also can order data with two columns or more. For instance, we order data on expense table by monthly_income and record columns.

select * from expense order by monthly_income desc, record asc limit 5;

select * from expense order by monthly_income desc, record desc limit 5;

You can see the query output in Figure below.

You can order data by other column such as items and record columns.

select * from expense order by items desc, record desc limit 5;

5.11 Grouping

We can group data by certain column. We should perform aggregation on grouping data. For instance, we want to calculate to sum all data on monthly_income column and are grouped by You can see the following query sample.

-- group by

select city_tier, sum(monthly_income) from expense group by city_tier;

Query output:

We can perform grouping and ordering data on the same query. For instance, we group data by city_tier with ordering by total of

select city_tier, sum(monthly_income) as total from expense group by city_tier order by total desc;

Query output:

5.12 Having

We can filter our data when we perform grouping data. We can use having statement. For instance, we use previous query and set filtering with having sum(monthly_income) >

select city_tier, sum(monthly_income) as total from expense group by city_tier having sum(monthly_income) > 12400000;

Query output:

5.13 Limit and Offset

We already learned about limit. For instance, we show 5 data from the following query.

-- offset

select id,age,monthly_income from expense order by id desc limit 5;

Query output:

By default, we use offset 0 by default when we use limit statement. You can see the following query with same output from previous query.

select id,age,monthly_income from expense order by id desc limit 5 offset 0;

Now we can use offset 3 when we use limit statement. We will have 5 data wit starting on index 3.

select id,age,monthly_income from expense order by id desc limit 5 offset 3;

5.14 Creating and Selecting Views

Apache Impala supports for view. We can create a view from a table. A view does not store data. It only keeps a schema. You can create a view using create view statement. For instance, we create a view from selecting employees table.

-- view

Create View if not exists myview as select * from employees;

Now we can perform a query to retrieve all data from myview view.

select * from myview;

You should see myview data on a query editor as shown in Figure below.

We can drop a view using drop view statement. For instance, we want to drop myview view. You can type this query.

drop view if exists myview;

6. Joining Query and Subquery on Apache Impala

6.1 Introduction

On previous chapter, we have learned to perform basic SQL queries on Apache Impala. Now we learn more about SQL query. We involve two table or more in our query statement. We learn the following topics

Joining Subquery Union

Next, we learn about joining query in Apache Impala.

6.2 Joining Query

There are some options to implement joining query for two tables. We can perform joining queries on Apache Impala as follows

Inner join Left join Right join Union

You can see the illustration form for these joining query models in Figure below.

For demo, we use userorder and userorderdetail tables. You can use describe statement to show table schema.

---- joining

describe userorder;

describe userorderdetail;

6.2.1 Inner Join

For the first joining query model, we will implement inner join. We use inner join .. on statement. For instance, we implement inner join on userorder and userorderdetail. We map userorder id column to userorderdetail orderid column. You can see this sample query.

-- inner join

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

inner join userorderdetail t2 on t1.id = t2.orderid;

You can see the output of inner join query as shown in Figure below.

6.2.2 Left Join

Now we implement left join with the same case on previous section. We change inner join statement to left joint statement.

-- left join

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

left join userorderdetail t2 on t1.id = t2.orderid;

Program output:

We also can display userorder data only with ignoring data from userorderdetail.

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

left join userorderdetail t2 on t1.id = t2.orderid where t2.orderid is null;

Program output:

6.2.3 Right Join

We also can implement right join. We can change a previous query with right join statement. You can write this query.

-- right join

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

right join userorderdetail t2 on t1.id = t2.orderid;

Program output:

We can obtain userorderdetail data only with ignoring data from userorder. We set id=null on userorder table. You can write this query.

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

right join userorderdetail t2 on t1.id = t2.orderid where t1.id is null;

Program output:

6.2.4 Outer Join

We can merge two tables like union using full outer join statement. You can run this query sample.

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

full outer join userorderdetail t2 on t1.id = t2.orderid;

Program output:

We also can remove duplication data with setting id and orderid to null. You can write this query.

select t1.code, t1.username, t2.product, t2.quantity*t2.price as total from userorder t1

full outer join userorderdetail t2 on t1.id = t2.orderid where t1.id is null or t2.orderid is null;

Program output:

6.3 Subquery

We can perform a query inside a query. This is called a subquery. You can see the following query sample for subquery.

-- subquery

select * from userorder where id in

(select distinct orderid from userorderdetail where (quantity*price) > 10);

Program output:

6.4 Union and Union All

We can merge two tables using union statement. If there are the same data on both tables,the data will be picked up only one. No duplication data with union statement.

-- union

select * from userorder limit 3

union

select * from userorder limit 2;

You can see my program output. We can see that we merge three data and two data. since there are duplication data, the query result show three data only.

We can merge two tables with ignoring duplication data using union all statement. You can see the following query for sample.

-- union all

select * from userorder limit 3

union all

select * from userorder limit 2;

You can see merging data for 3 data and 2 data. The query result shows 5 data.

6.5 With

We can use with statement to create an alias that performs a query. For instance, we create t1 and t2 as query statements. Then, we can perform a query on t1 and t2. You can see the following sample query.

-- with

with t1 as (select * from userorderdetail where price > 5),

t2 as (select * from userorderdetail where price < 2) (select * from t1 union select * from t2);

This is my query output.

7. Partition Data on Apache Impala

7.1 Introduction

If you have a big growth data on a table, we perform partition on our table. In this chapter, we will create a partition table on Apache Impala. Let's start!

7.2 Creating Partition Table

We can create a partition table in Impala. For demo, we use three parameters on partition table: year, month and day. In this case, we create news table with the following partition table.

create table if not exists news (title string, content string)

partitioned by (year int, month int , day int)

row format delimited fields terminated by ',';

After created, we can verify to check existence of news table. You can type this command.

show tables;

You will see a list of tables as shown in Figure below.

For demo, we create some data with passing three parameters (year, month, day). You can type the following SQL scripts.

insert into news partition(year=2021,month=1,day=8)

values('News 1','Lorem ipsum dolor sit amet, consectetur adipiscing elit.');

insert into news partition(year=2021,month=1,day=8)

values('News 2','Aenean erat eros, aliquam eu posuere at, placerat at justo');

insert into news partition(year=2021,month=1,day=7)

values('News 3','Nullam laoreet accumsan leo eu tristique');

insert into news partition(year=2020,month=12,day=5)

values('News 4','Proin auctor augue eget dictum interdum');

insert into news partition(year=2020,month=12,day=6)

values('News 5','In fermentum accumsan laoreet.');

Now we can verify by querying news table. You should see a list of data from news table.

Next, we explore our data on partition table by querying.

7.3 Exploring Partition Table Files on HDFS

Now we can explore how our data was stored into Impala table. Impala database storage usually locates on /user/hive/warehouse/ HDFS folder. You show all databases files using HDFS command as below.

$ hdfs dfs -ls /user/hive/warehouse

We also can show all tables inside a database. For instance, we show tables on ak_demo.db database file. Type this command.

$ hdfs dfs -ls /user/hive/warehouse/ak_demo.db/

After executed, you can see a list of tables on ak_demo.db database. You can see my output in Figure below.

You can see a list of tables in ak_demo.db database file.

$ hdfs dfs -ls /user/hive/warehouse/ak_demo.db/news/

You will see a list of data on news table. You can see partition parameters: year=2020 and year=2021. For instance, we perform query on partition year=2021.

$ hdfs dfs -ls /user/hive/warehouse/ak_demo.db/news/year=2021

You can see see data with partition month. Let us to perform query with partition month=1.

$ hdfs dfs -ls /user/hive/warehouse/ak_demo.db/news/year=2021/month=1

You will see a list of data with partition day parameter. Now we perform a query for partition day=8.

$ hdfs dfs -ls /user/hive/warehouse/ak_demo.db/news/year=2021/month=1/day=8

Now you can see the real data on partition data year=2021, month=1, day = 8. Another sample, we execute query with partition data year=2021, month=1, day = 7.

$ hdfs dfs -ls /user/hive/warehouse/ak_demo.db/news/year=2021/month=1/day=7

You should see a list of data under partition year=2021, month=1, day = 7. You can do practices by creating tables with your own partition data.

8. Apache Impala Database Programming with Java

8.1 Introduction

In this chapter, we learn how to access Apache Impala from a program. We will use Java application for a sample of client application. To access Apache Impala, we can use ODCB and JDBC drivers from Cloudera. You can see them on https://www.cloudera.com/downloads.html as shown in Figure below.

For demo, we use JDBC driver for Impala. You can download it. Then, extrat zip file. You should see some JDBC driver files as shown in Figure below.

We will JDBC 4.2 driver for Java application. Next, we create Java application project.

8.2 Creating A Project

You can create Java application using any editor tool. In this book, I use Jetbrain IntelliJ IDEA. This tool is available for community edition. You can download it on

Now we can create a new project using IntelliJ IDEA. You can select Java application with project template as shown in Figure below.

Then, click Next button. You should see a dialog as shown in Figure below.

Checked Create project from template option. After that, click Next button. You should obtain a form as shown in Figure below.

Fill project name and project folder. You can see my project name and folder in Figure above. Click Finish button to complete for creating a project. Then, you will get editor with template codes as follows.

package id.ilmudata;

public class Main {

public static void main(String[] args) {

}

}

Now we open the project structure. Then, add JDBC driver file for Impala into our project. Select Libraries and add JDBC driver file for Impala. You can see my project structure as shown in Figure below.

Click Apply and OK buttons to close. Next, we write Java codes to connect Apache Impala server.

8.3 Connecting to Apache Impala

To connect to Apache Impala server, we create Connection object from JDBC API. We pass JDBC url with format You can change for Impala IP server or hostname.You also can change database and Impala port for your resources.

We create getConnection() function to connect Impala server. We use com.cloudera.impala.jdbc.Driver class name for Clouder Impala JDBC driver. You can write the following codes for implementation.

package id.ilmudata;

import java.sql.*;

public class Main {

static String CONNECTION_URL = "jdbc:impala://:21050/database";

public static void main(String[] args) {

try {

Connection conn = getConnection();

if (conn != null) {

System.out.println("Connected");

}else {

System.out.println("Not connected");

return;

}

conn.close();

} catch (Exception ex) {

ex.printStackTrace();

}

}

public static Connection getConnection() throws Exception {

Connection connection;

Class.forName("com.cloudera.impala.jdbc.Driver");

connection = DriverManager.getConnection(CONNECTION_URL);

return connection;

}

}

If we obtain Connection object, it means our program already connected to Apache Impala server. Next, we can retrieve data from Apache Impala.

8.4 Getting All Data

In this section, we retrieve all data from Apache Impala. For implementation, we create showAllData() method with Connection as parameter. To retrieve data, we perform Select query. We create Statement object and then pass our query text into Statement object. Call executeQuery() to obtain a cursor. After that, we perform a looping to retrieve data from Apache Impala. We print the data to Terminal using println() method.

The following is codes implementation for showAllData() method. You can write these codes.

public static Integer showAllData(Connection conn) throws Exception {

System.out.println("------------show data-----------");

String sql = "SELECT * FROM employees";

Statement statement = conn.createStatement();

ResultSet result = statement.executeQuery(sql);

Integer total = 0;

while (result.next()){

Integer id = result.getInt("id");

String first_name = result.getString("first_name");

String last_name = result.getString("last_name");

String email = result.getString("email");

Timestamp created = result.getTimestamp("created");

String output = "Employee #%d: %s - %s - %s %s";

System.out.println(String.format(output, id, first_name, last_name, email, created));

total++;

}

result.close();

System.out.println("--------------------------");

return total;

}

Save these codes. Next, we perform to insert data into Apache Impala.

8.5 Inserting Data

We can insert data into Apache Impala using SQL query, INSERT INTO. Since we input parameters, we create PreparedStatement object for parameter inputs. We use now() function from Apache Impala for inserting date and time now. We call executeUpdate() method to execute our INSERT query.

We implement inserted data in insertData() method. The following is the completed codes for insertData() method.

public static void insertData(Connection conn, Integer id, String first_name,

String last_name, String email) throws Exception{

System.out.println("------------inserting data-----------");

String sql = "INSERT INTO employees (id, first_name, last_name, email, created) VALUES (?, ?, ?, ?, now())";

PreparedStatement statement = conn.prepareStatement(sql);

statement.setInt(1, id);

statement.setString(2, first_name);

statement.setString(3, last_name);

statement.setString(4, email);

statement.executeUpdate();

System.out.println("--------------------------");

}

Save these codes. Next we can call in our main program.

8.6 Completed Program

In this scenario, we open a connection to Apache Impala. Once connected, we retrieve all data by calling showData() method. A number of data is saved to total variable. We use total variable for next inserted data. Next, we insert data with an employee Id = total + 1. After that, we show all data. You can see the complete program to connect to Apache Impala below. Then, the program performs inserting and showing data.

The following is the completed codes for our program.

package id.ilmudata;

import java.sql.*;

public class Main {

static String CONNECTION_URL = "jdbc:impala://:21050/database";

public static void main(String[] args) {

try {

Connection conn = getConnection();

if (conn != null) {

System.out.println("Connected");

}else {

System.out.println("Not connected");

return;

}

// show data

Integer total = showAllData(conn);

// insert data

insertData(conn,total+1,"new-first","newlast","[email protected]");

// show data

showAllData(conn);

conn.close();

} catch (Exception ex) {

ex.printStackTrace();

}

}

public static Connection getConnection() throws Exception {

Connection connection;

Class.forName("com.cloudera.impala.jdbc.Driver");

connection = DriverManager.getConnection(CONNECTION_URL);

return connection;

}

public static Integer showAllData(Connection conn) throws Exception {

System.out.println("------------show data-----------");

String sql = "SELECT * FROM employees";

Statement statement = conn.createStatement();

ResultSet result = statement.executeQuery(sql);

Integer total = 0;

while (result.next()){

Integer id = result.getInt("id");

String first_name = result.getString("first_name");

String last_name = result.getString("last_name");

String email = result.getString("email");

Timestamp created = result.getTimestamp("created");

String output = "Employee #%d: %s - %s - %s %s";

System.out.println(String.format(output, id, first_name, last_name, email, created));

total++;

}

result.close();

System.out.println("---------------------------");

return total;

}

public static void insertData(Connection conn, Integer id, String first_name,

String last_name, String

email) throws Exception{

System.out.println("------------inserting data-----------");

String sql = "INSERT INTO employees (id, first_name, last_name, email, created) VALUES (?, ?, ?, ?, now())";

PreparedStatement statement = conn.prepareStatement(sql);

statement.setInt(1, id);

statement.setString(2, first_name);

statement.setString(3, last_name);

statement.setString(4, email);

statement.executeUpdate();

System.out.println("---------------------------");

}

}

Now we can run this program. Make sure you already change IP address of Apache Impala server. You probably use localhost for Apache Impala server.

After run, you can see the program show a list of employees. Then, the program inserts data and shows all data. You can see my program output in Figure below.

Source Code

You can download source code for this book on http://www.makers.id/ak/impala2361.zip

Contact

If you have question related to this book, please contact me at [email protected] . My blog: http://blog.aguskurniawan.net .