Skip to content

Upload data from Azure blob storage to Hive tables

January 5, 2016

This section describes how to ingest data into Azure blob target environments where the data can be stored and processed during the Analystics Process

What is Azure Blob Storage?

Azure Blob storage is a service for storing large amounts of unstructured data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately.

Common uses of Blob storage include:

  • Serving images or documents directly to a browser
  • Storing files for distributed access
  • Streaming video and audio
  • Performing secure backup and disaster recovery
  • Storing data for analysis by an on-premises or Azure-hosted service

Blob1

  • Storage Account: All access to Azure Storage is done through a storage account
  • Container: A container provides a grouping of a set of blobs. All blobs must be in a container. An account can contain an unlimited number of containers. A container can store an unlimited number of blobs.
  • Blob: A file of any type and size. Azure Storage offers three types of blobs: block blobs, page blobs, and append blobs.

    Block blobs are ideal for storing text or binary files, such as documents and media files. Append blobs are similar to block blobs in that they are made up of blocks, but they are optimized for append operations, so they are useful for logging scenarios. A single block blob or append blob can contain up to 50,000 blocks of up to 4 MB each, for a total size of slightly more than 195 GB (4 MB X 50,000).

    Page blobs can be up to 1 TB in size, and are more efficient for frequent read/write operations. Azure Virtual Machines use page blobs as OS and data disks.

Naming and referencing containers and blobs

You can address a blob in your storage account using the following URL format:

http://<storage-account-name>.blob.core.windows.net/<container-name>/<blob-name>  

For example, here is a URL that addresses one of the blobs in the diagram above

http://sally.blob.core.windows.net/movies/MOV1.AVI

Container naming rules

A container name must be a valid DNS name and conform to the following rules:

  • A container name must be all lowercase.
  • Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character.
  • Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names.
  • Container names must be from 3 through 63 characters long.

Every blob in Azure storage must reside in a container. The container forms part of the blob name. For example, mycontainer is the name of the container in these sample blob URIs:

https://storagesample.blob.core.windows.net/mycontainer/blob1.txt
https://storagesample.blob.core.windows.net/mycontainer/photos/myphoto.jpg

lob naming rules

A blob name must conform to the following rules:

  • A blob name can contain any combination of characters.
  • A blob name must be at least one character long and cannot be more than 1,024 characters long.
  • Blob names are case-sensitive.
  • Reserved URL characters must be properly escaped.
  • The number of path segments comprising the blob name cannot exceed 254. A path segment is the string between consecutive delimiter characters (e.g., the forward slash ‘/’) that corresponds to the name of a virtual directory.

The Blob service is based on a flat storage scheme. You can create a virtual hierarchy by specifying a character or string delimiter within the blob name to create a virtual hierarchy. For example, the following list shows some valid and unique blob names:

/a
/a.txt
/a/b
/a/b.txt

You can use the delimiter character to list blobs hierarchically.

Create Hive database and tables

CREATE EXTERNAL HIVE TABLE

create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
…,
fieWIldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘<field separator>’ lines terminated by ‘<line separator>’
STORED AS TEXTFILE LOCATION ‘<storage location>

TBLPROPERTIES(“skip.header.line.count”=”1”);

Here are the descriptions of the fields that users need to plug in and other configurations:

  • <database name>: the name of the database users want to create. If users just want to use the default database, the querycreate database… can be omitted.
  • <table name>: the name of the table users want to create within the specified database. If users want to use the default database, the table can be directly referred by <table name> without <database name>.
  • <field separator>: the separator that delimits fields in the data file to be uploaded to the Hive table.
  • <line separator>: the separator that delimits lines in the data file.
  • <storage location>: the Azure storage location to save the data of Hive tables. If users do not specify LOCATION <storage location>, the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. If a user wants to specify the storage location, the storage location has to be within the default container for the database and tables. This location has to be referred as location relative to the default container of the cluster in the format of‘wasb:///<directory 1>/’ or ‘wasb:///<directory 1>/<directory 2>/’, etc. After the query is executed, the relative directories will be created within the default container. Otherwise, to create External hive table on an additional container of the cluster, the <storage location> can be a configuration value ‘${hiveconf:<defined path>}’ and defined the <defined path> in the Activity of Azure Data Factory’s Pipeline with format ‘wasb://<addional container name>@<storage account name>.blob.core.windows.net/<directory 2>’.
  • TBLPROPERTIES(“skip.header.line.count”=”1”): If the data file has a header line, users have to add this property at the end of the create table query. Otherwise, the header line will be loaded as a record to the table. If the data file does not have a header line, this configuration can be omitted in the query.

LOAD DATA TO HIVE TABLES

Here is the Hive query that loads data into a Hive table.

LOAD DATA INPATH '<path to blob data>' INTO TABLE <database name>.<table name>;

<path to blob data>:

If the blob file to be uploaded to the Hive table is in the default container of the HDInsight Hadoop cluster, the <path to blob data> should be in the format ‘wasb:///<directory in this container>/<blob file name>’.

The blob file can also be in an additional container of the HDInsight Hadoop cluster. In this case, <path to blob data> should be in the format ‘wasb://<container name>@<storage account name>.blob.core.windows.net/<blob file name>’.

Partitioned table and store Hive data in Optimized Row columnar (ORC)

If the data is large, partitioning the table is beneficial for queries that only need to scan a few partitions of the table. For instance, it is reasonable to partition the log data of a web site by dates.

In addition to partitioning Hive tables, it is also beneficial to store the Hive data in the Optimized Row Columnar (ORC) format. 

Partitioned table

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<table name>
(field1 string,
...
fieldN string
)
PARTITIONED BY (<partitionfieldname> vartype) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
     lines terminated by '<line separator>' 
TBLPROPERTIES("skip.header.line.count"="1");

LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.<partitioned table name> 
    PARTITION (<partitionfieldname>=<partitionfieldvalue>);

When querying partitioned tables, it is recommended to add the partition condition in the beginning of the where clause as this improves the efficacy of searching significantly.

select 
    field1, field2, ..., fieldN
from <database name>.<partitioned table name> 
where <partitionfieldname>=<partitionfieldvalue> and ...;

Store Hive data in ORC format

Users cannot directly load data from blob storage into Hive tables that is stored in the ORC format. Here are the steps that the users need to take in order to load data from Azure blobs to Hive tables stored in ORC format.

  1. Create an external table STORED AS TEXTFILE and load data from blob storage to the table.

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
( field1 string, field2 int,
fieldN date ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘<field separator>’ lines terminated by ‘<line separator>’ STORED AS TEXTFILE LOCATION ‘wasb:///<defaul container directory in Azure blob>’
TBLPROPERTIES
(“skip.header.line.count”=“1”); LOAD DATA INPATH ‘<path to the source file>’ INTO TABLE <database name>.<table name>;

2. Create an internal table with the same schema as the external table in step 1, with the same field delimiter, and store the Hive data in the ORC format.

CREATE TABLE IF NOT EXISTS <database name>.<ORC table name> ( field1 string, field2 int, fieldN date ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘<field separator>’ STORED AS ORC;

3.  Select data from the external table in step 1 and insert into the ORC table

INSERT OVERWRITE TABLE <database name>.<ORC table name>
    SELECT * FROM <database name>.<external textfile table name>;

NOTE:

If the TEXTFILE table <database name>.<external textfile table name> has partitions, in STEP 3, theSELECT * FROM <database name>.<external textfile table name> command will select the partition variable as a field in the returned data set. Inserting it into the <database name>.<ORC table name> will fail since <database name>.<ORC table name> does not have the partition variable as a field in the table schema. In this case, users need to specifically select the fields to be inserted to <database name>.<ORC table name> as follows:

INSERT OVERWRITE TABLE <database name>.<ORC table name> PARTITION (<partition variable>=<partition value>)
   SELECT field1, field2, ..., fieldN
   FROM <database name>.<external textfile table name> 
   WHERE <partition variable>=<partition value>;

4. It is safe to drop the <external textfile table name> when using the following query after all data has been inserted into<database name>.<ORC table name>:

DROP TABLE IF EXISTS <database name>.<external textfile table name>;

 

https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/#what-is-blob-storage

Advertisements

From → Data Science

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: