2-7 Get Data from a Relational Database
The NiFi support for relational databases has some issues with flexibility and scaleability. There is no single processor that can execute an arbitrary SQL SELECT statement and page through the results efficiently.
There are the available processors and their issues:
- ExecuteSQL Executes any query against a database. Does not support paging. Gets the entire resultset as a single Avro result that needs to be split.
- GenerateTableFetch Designed for paging. Executes a SELECT COUNT(*), then generates SQL queries to page over the rows of a table in chunks, but does not execute them. Limitations: 1) only supports one table, 2) does not always get the paging syntax right for every database.
- QueryDatabaseTable As of NiFi 1.7.0, supports arbitrary queries through the "Custom Query" property. Polls for additional rows by storing and querying with an increasing column. (timestamp, incrementing ID, etc.).
- ExecuteSQLToColumnMaps MarkLogic Community alternative to the built-in SQL processors
All of the examples below were tested with MySQL 8 with the sample employee database located here: https://github.com/datacharmer/test_db There are about 300,000 employees in the database, so I will assume that that is too big for ExecuteSQL without paging. This is the SQL query from which my documents will be generated:
SELECT
employees.emp_no, employees.birth_date, employees.first_name, employees.last_name,
employees.gender, employees.hire_date, employees.dept_emp.dept_no, employees.departments.dept_name,
employees.salaries.salary, employees.titles.title
FROM
employees.employees, employees.dept_emp, employees.departments, employees.salaries, employees.titles
WHERE
employees.employees.emp_no = employees.dept_emp.emp_no AND
employees.dept_emp.to_date > CURDATE() AND
employees.dept_emp.dept_no = employees.departments.dept_no AND
employees.employees.emp_no = employees.salaries.emp_no AND
employees.salaries.to_date > CURDATE() AND
employees.employees.emp_no = employees.titles.emp_no AND
employees.titles.to_date > CURDATE()