Mulesoft Batch Processing. Steps to Decode Batch Block Size


MuleSoft provides a batch job scope for processing messages in batches, a crucial component for various extract, transform, and load (ETL) processes. This guide will help you understand how to define batch processing, determine batch block size, and discover the movement of batches between batch steps.



What is Batch Processing?

Batch processing refers to executing application programs and processing their data in separate batches, with each batch being completed before the next one starts. This planned processing method is typically used for tasks like preparing payrolls and maintaining inventory records.

In MuleSoft, you can initiate a batch job scope within an application. This scope is a block of code that divides messages into individual records, performs specific actions on each record, reports the results, and potentially pushes the processed output to other systems or queues.

Batch Job Examples in MuleSoft

1. Customer Data Cleansing and Enrichment

Scenario: An organization needs to cleanse and enrich its customer data periodically to ensure data quality and completeness.

  • Batch Input: Customer data from a CRM system (e.g., Salesforce) in CSV format.
  • Batch Step 1: Validate each customer record (e.g., check for missing fields, format errors).
  • Batch Step 2: Enrich customer records by adding additional information from an external service (e.g., address verification, geolocation data).
  • Batch Step 3: Update the enriched and cleansed customer data back into the CRM system.
  • Output: A report summarizing the number of records processed, errors encountered, and updates made.

2. Inventory Reconciliation

Scenario: A retail company needs to reconcile inventory levels across multiple warehouses to ensure accuracy.

  • Batch Input: Inventory data from multiple warehouse management systems.
  • Batch Step 1: Normalize data formats from different systems into a standard format.
  • Batch Step 2: Compare inventory levels between warehouses to identify discrepancies.
  • Batch Step 3: Generate and send discrepancy reports to warehouse managers for review.
  • Output: Updated inventory records and discrepancy reports.

3. Financial Transaction Processing

Scenario: A bank processes large volumes of financial transactions at the end of each day.

  • Batch Input: Daily transaction records from various banking systems.
  • Batch Step 1: Validate each transaction (e.g., check for required fields, validate account numbers).
  • Batch Step 2: Calculate transaction fees and apply them to each transaction.
  • Batch Step 3: Aggregate transactions by account and update the account balances.
  • Output: Daily summary reports of all processed transactions and updated account balances.

Understanding a Batch Job

Batch processing in MuleSoft is handled by the powerful batch job scope. This scope breaks down large messages into individual records, which Mule processes asynchronously. Just as flows process messages, batch jobs process these records.

A batch job consists of several batch steps that operate on the records as they progress through the job. Each batch step includes various processors that modify the payload and pass the processed payload to the subsequent steps according to the configuration. These batch steps provide different functionalities to handle the payload. After all records have been processed through the batch steps, the batch job concludes, and a report is generated detailing which records were successfully processed and which encountered errors.

3 core batch job phases  

There are three phases in batch job processing:

  • Load and dispatch
  • Process 
  • On complete

Process Phase in Batch Jobs

In this section, we will focus on the process phase where the actual processing of the payload or records occurs within the batch job. During the process phase, records are pulled from the queue and grouped into blocks according to the batch block size defined for the batch job. These blocks of records then pass through each batch step asynchronously, as configured.

In each batch step, a block of records (i.e., batches) is processed in parallel. Once a batch is completed in one step, it is pushed back to the queue to be processed by the next step. This parallel processing within each step ensures efficient handling of records and maximizes throughput.

Records within a batch block are processed sequentially inside each batch step.

Consider a payload array of records: [2, 3, 4, 5, 6, 7] and a batch block size of three, resulting in two blocks of records. Here’s how these blocks move through each step:

  1. Batch Block Creation:

    • Block 1: [2, 3, 4]
    • Block 2: [5, 6, 7]
  2. Step One:

    • Even numbers undergo a wait with some transformation.
  3. Step Two:

    • Records are passed as they are with some transformation.

After running this flow with the array size of six and batch block size of three, we got two blocks of records, Now suppose we have six records with batch block size three. 


From the example above, we can infer how batch block size can impact the performance of a batch job since it processes records sequentially. While our example involved only six small records, in actual ETL scenarios, we often deal with millions of records in large XML or JSON files. Thus, setting the batch block size carefully is crucial to avoid running out of heap memory, as records get loaded into memory during processing.

To ensure optimal performance of your batch jobs, you should conduct comparative tests with different batch block sizes and evaluate the performance of each before deploying the code to production. While a standard batch block size of 100 works for most use cases, there may be scenarios where adjusting the block size can yield better performance, especially when dealing with varying payload sizes.

Here are a few scenarios to consider:

  1. High Number of Records with Small Payloads:

    • If you are processing millions of records where each record's payload size is in KBs, you can use larger block sizes without encountering memory issues. Increasing the block size in this case can significantly improve batch job completion time.
  2. Large Payloads:

    • When processing heavy payloads, such as files that are several MBs in size, it is advisable to use smaller block sizes. This approach helps distribute the load more evenly and prevents memory overloads.


 

Comments