SQOOP file format

 How many file formats in Sqoop? 

Sqoop supports the following 4 file formats for import operation.  

1. Text file format.  

2. Avro file format.  

3. Sequence file format.  

4. Parquet file format.  


1st is text file format and other three are binary file format. 

 Binary formats are Avro, Sequence and Parquet file. 

 

Text file format in Sqoop:  

1. Text file format: Apache Sqoop uses text file format as the default file format for importing the data from SQL to HDFS. 

 If we want to import data into text file then we need to specify --as-textfile in sqoop command. 


 1.1. Using with --as-textfile text:  

Example :  

sqoop import  --connect jdbc:mysql://localhost:3306/database-name  

--username root --password mypassword  

--table cities --target-dir /user/YT/textfile.txt - 

-as-textfile  

1. 2. Without using --as-textfile text:  

Example :  

sqoop import --connect jdbc:mysql://localhost/database-name --username root --password mypassword --table cities  


2. Avro File Format : 

Apache Avro is a generic data serialization system. Avro uses a concept called schema to describe what data structures are stored with in the file. The schema is usually encoded as a JSON string. Sqoop Generate schema automatically based on the metadata information retrieve from the database. Avro file can be enabled by specifying the –as-avrodatafile parameter. 


Example: 

sqoop import  --connect jdbc:mysql://localhost:3306/database-name  

--username root --password mypassword  

--table cities  

--target-dir /user/YT/textfile.txt - 

--as-avrodatafile 


3. Sequence file format in sqoop: 

Sequence Files are binary format that store individual records in custom record-specific datatypes. This format supports exact storage of the all data in binary representations, and is appropriate for storing binary data. 

This file type can be enabled by specifying the –as-sequencefile parameter. For MapReduce, 

Reading from sequence File is higher performance than reading from text files, as record do not need to be parsed. 

The Sequence File is a special hadoop file format that’s used for storing objects and implements the writable interface. For Mapreduce, record will consist of(key,value) pair thus uses as empty object called NullWritable in place of the value. 


sqoop import  --connect jdbc:mysql://localhost:3306/database-name  

--username root --password mypassword  

--table cities  

--target-dir /user/YT/textfile.txt - 

--as-Sequencefile 

--where city=”usa 


4. Parquet file format in Sqoop : 

Parquet is column-oriented binary file format intended to be highly efficient for the types of large-scale queries. It is similar to the other columnar-storage file format as RC file and ORC file format. Its is developed by joint effort of twitter and cloudera. Importing data into paraquet file we need to specify –as-paraquetfile. 

Sqoop import  --connect jdbc:mysql://localhost:3306/database-name  

--username root --password mypassword  

--table cities  

--target-dir /user/YT/textfile.txt - 

--as-parquetfile 

--where city=”usa 

 

 

 

 

 

 

 

Comments

Popular posts from this blog

Why do we use $CONDITIONS in Apache Sqoop?

Sqoop where condition , Sqoop join two tables