Enabling Spark SQL MERGE via optimized ACID Data Source v0.6.0 | Qubole

  1. Multi-Engine Transactional Updates and Deletes: Use Apache Spark and Apache Hive to perform write operations including Updates and Deletes with multi-engine Atomicity, Concurrency, Isolation, and Durability guarantees.
  2. Multi-Engine Transactional Reads: Use any of the significant open-source Data Lake engines — Apache Spark, Apache Hive, and Presto — to read from your data lake under multi-engine transactional guarantees. Snapshot Isolation is the isolation level offered.
  1. SQL MERGE: Users can now use SQL MERGE [8] using Data Source in their Apache Spark pipelines to efficiently Upsert data for various use cases like Change Data Capture (aka CDC [7]) or General Data Protection Regulation (aka GDPR [6]). Here are the details of the documentation.
  2. Performance Improvements: Users can now benefit from improved write performance of ACID Data source by up to 50% and split computation times up to 8x.
  3. Feature Enhancements: Several improvements around update/delete (92, 83, 59) have been made. Improvements around support for task retries (43) and handling Partition predicates (21) have also been made. In total, 24 JIRAs are fixed in this release.

MERGE Command:

Users need efficient Upsert operation for multiple use cases in Data Lake like:

  1. Capture and Sync Change Data Capture (aka CDC [7]): CDC captures the changes in ‘source’ data, typically an enterprise OLTP system, and updates the data in ‘destination (s)’. The data can be your cloud object stores like AWS S3 using multiple engines for Ad-hoc analytics, data engineering, streaming analytics, and ML.
  2. General Data Protection Regulation (aka GDPR [6]): With compliance such as GDPR, specifically Right to Forget or CCPA’s Right to Erasure, enterprises have built workflows to delete user records on demand from the Data lake

SQL Syntax

MERGE INTO <target table> [AS T] USING <source table> [AS S] ON <boolean merge expression> WHEN MATCHED [AND <boolean expression1>] THEN <match_clause> WHEN MATCHED [AND <boolean expression2>] THEN <match_clause> WHEN NOT MATCHED [AND <boolean expression3>] THEN INSERT VALUES ( <insert value list> ) Possible match clauses can be <match_clause> :: UPDATE SET <set clause list> DELETE Possible insert value list can be <insert value list> :: value 1 [, value 2, value 3, ...] [value 1, value 2, ...] * [, value n, value n+1, ...] Possible update set list can be <update set list> :: target_col1 = value 1 [, target_col2 = value 2 ...]

Illustration

Consider a simplified CDC (change data capture) use case for a ride-hailing cab business, where a table on object store S3 named ‘driver(id: int, city: string, ratings: string)’ partitioned by ‘city’ are getting updates/deletes/inserts from upstream OLTP system once every hour. For the above use case, every hour, the ‘drivers’ with city ‘closed’ needs to be deleted, and other matching drivers need to be updated and new drivers need to be inserted.

  1. Figure out all the partitions based on ‘city’ with updates/deletes for drivers.
  2. Rewrite those partitions applying those updates/deletes.
  3. Insert the new driver records.
MERGE INTOdriver as t USING source as s ON t.id = s.id WHEN MATCHED AND t.city = 'closed' THEN DELETE WHEN MATCHED THEN UPDATE t.city = s.city, t.ratings = s.ratings WHEN NOT MATCHED THEN INSERT VALUES (*)

Optimized Write Performance:

Write performance of Data Source has improved in 0.6.0 mainly by optimizing the number of objects created in the writer code. Runs show insert performance into an unpartitioned table of size 100 GB enhanced by 46% and performance on size 1TB improved by 50%

Optimized Split Computation:

Split Computation time has significantly improved for partitioned ACID tables, as shown in Figure 4 below. Benchmarking against tables with 10,000 partitions saw the improvement of around 8x in Split Computation

Release Notes and Acknowledgement:

Release notes for Open Sourced version can be found here: https://github.com/qubole/spark-acid/releases/tag/v0.6.0

References:

  1. https://www.qubole.com/blog/qubole-open-sources-multi-engine-support-for-updates-and-deletes-in-data-lakes/
  2. https://www.infoworld.com/article/3534516/is-your-data-lake-open-enough-what-to-watch-out-for.html
  3. https://www.qubole.com/blog/building-a-data-lake-the-right-way/
  4. https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
  5. https://discover.qubole.com/whats_new/167
  6. https://gdpr-info.eu/
  7. https://en.wikipedia.org/wiki/Change_data_capture
  8. Andrew Eisenberg, Jim Melton, Krishna Kulkarni, Jan-Eike Michels, and Fred Zemke. 2004. SQL:2003 has been published. SIGMOD Rec. 33, 1 (March 2004), 119–126. DOI:https://doi.org/10.1145/974121.974142

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store