AWS GLUE Crawler

Introduction to AWS Glue Crawler :

Glue Crawler is a feature of AWS Glue which goes to one or more than one data source and extracts the metadata. It inspects the sources and generates the column name, types, size and other information. Hence, it got its name Crawler.

Click here for youtube tutorial.

Why AWS Glue Crawler is important :

Now a time being, just think that crawler is not there! Then we had to manually get into all the data sources and extract the column names, types and size etc. If a table is small, then its quite achievable but if a table has hundreds of columns then its not possible manually. Hence, we need crawler!

Crawler goes into specific data sources and creates metadata. In other words, this called as data catalog. Data catalog is a kind of a meta data store house. Once the data catalog is ready, we can build ETL job.

How to create crawler manually:

For this experiment, I have kept a csv file in a bucket. This csv file will be used by the Crawler to generate metadata. The csv data has three columns: firstname, lastname, address. Sample data shown below:

firstname,lastname,address

Aarav,Sharma,"123 Green St, Mumbai"

Aisha,Khan,"456 Blue Rd, Delhi"

Rohan,Verma,"789 Red Ave, Bangalore"

Sneha,Patel,"101 Yellow Ln, Hyderabad"

Vikram,Gupta,"202 Orange Blvd, Chennai"

Isha,Nair,"303 Pink St, Kolkata"

Kabir,Mishra,"404 Purple Rd, Pune"

Tanya,Aggarwal,"505 Brown Ave, Jaipur"

Manish,Jain,"606 Silver Ln, Lucknow"

Priya,Saxena,"707 Gold Blvd, Chandigarh"

Crawler Classifier :

We nee to create a classifier with each crawler because only crawler without classifier, sometimes can’t differentiate between the data and the column names. Therefore, we need to create a classifier before we create a crawler!Now we shall move to Crawler page and build our configuration. We shall start with Classifiers.

Click on add Classifier. Fill the Classifier details as shown in the image below and click on create.

Once your classifier is created, lets move to Crawler creation page and click on Create crawler.

When we click on next, we move to “choose data sources and classifier” as shown in pic below.

Not yet: In this case, We do not have any tables in our data catalog. In other words, we have not created database and table manually in Athena. if we had, then we would have selected “yes”.

Add a data source:

Most important setting, the S3 path must always point to a folder, not to a file.

Then select the student classifier.

Select the Glue role:

The IAM role that I have selected, it has permissions for : S3 full access, AWS Glue Console full access, Cloud Watch Logs full access, Glue Update table DB crawler and passRole.

Output and Scheduling:

Target Database : toy, I had an old database. Therefore, I had selected it. In your case, if no database is present, create one database for test purpose!

Crawler schedule: On demand, this mean cralwer will run only when we click the run button!

Go to review and click on “Create crawler”. Then go to Crawler page and click on run as shown in the picture below.

After crawler run is successfull, you will see one table has been created.

After the Crawler run, it will create a table. Lets inspect that. Now you can observe that under data catalog there is a database name toy. Under toy database there is new table student. Here, you can match the column names, its same as the csv file.

Conclusion of Glue Crawler:

Therefore, we can understand that, crawler is a program that can automatically create table schema. Here onwards, we can create multiple crawler and point it to different sources and it will finally fill the data catalog.

In my upcoming blog I shall show you the process to create ETL job using visual tool.