Many modern software systems are created by forking an existing codebase. While most forks are short-lived and serve temporary collaboration needs, a smaller but more impactful subset evolve into long-lived forks, referred to here as software variants, that follow independent development trajectories. These variants allow customization and organizational control, but over time their structural and semantic divergence complicates the reuse of bug fixes or enhancements from each other Prior work with PaReco (paper) shows that reuse gaps often appear as missed opportunities (patch present in the source but absent in the target) or effort duplication (similar patch re‑implemented manually). Our extended tool, GACPD, builds on the PaReco tool with broader language support, faster token normalization, and developer-facing outputs to surface candidate patches worth reusing. When selecting what to integrate, apply Most Valuable First (OORP, p.29) to prioritize the highest‑impact patches, and Keep It Simple (OORP, p.37) to avoid unnecessary complexity in your plan.

Despite automated detection, structural divergence caused by refactorings (renames, moves, interface changes) can block direct reuse. RePatch (paper) performs refactoring‑aware patch integration by aligning the source and target around detected refactorings, applying the patch, then replaying those refactorings so the target keeps its structure. Guard your work with Tests: Your Life Insurance (OORP, p.149), use Write Tests to Understand (OORP, p.179) when behavior is unclear, and Grow Your Test Base Incrementally (OORP, p.159) to expand coverage only where the integration touches. Keep adaptations minimal (again, Keep It Simple, p.37), focus on the immediate collaborators of the change, and document any manual decisions for review.

Materials & Tools Used for this Session

Lecture sides here IDEs

PyCharm
IntelliJ IDEA – The Ultimate Edition is recommended, as it includes built-in support for generating UML diagrams directly from your code.

Repositories

Source repository: apache/kafka
Target repository: linkedin/kafka

Setup / Preparation

In this lab you will play with two very related tools: GACPD and RePatch

Task 1 – Identifying a Missed Opportunity with GACPD

In this task, you will use GACPD to detect and validate a Missed Opportunity (MO) patch between two related repositories:

Source (Mainline): apache/kafka
Target (Divergent fork): linkedin/kafka

We have already identified a candidate patch from Apache Kafka Pull Request #13386 that appears in the mainline repository but not in the divergent fork.
Your goal is to:

Run GACPD to detect the patch.
Read and understand the patch’s intent.
Verify the results by inspecting both repositories.

Following the Most Valuable First pattern (p.29), we begin with a high-impact MO that, if integrated, will improve parity and stability between variants.

Step-by-Step Instructions

1. Install GACPD

Follow the installation instructions in the GACPD README.
Make sure you can run Jupyter notebooks in your environment.

2. Run the analysis

Open the OnePR.ipynb file in the root of the GACPD repository.
Execute the notebook cells as instructed.
Ensure the notebook is configured to analyze PR #13386 from apache/kafka.

example.individual_pr_check(13386)

This step applies the Detecting Duplicated Code pattern (p.223) — GACPD uses similarity analysis to find duplicated or nearly duplicated changes across repositories.

3. View the GACPD output

When the analysis finishes, navigate to the output folder:
Results/Repos_files/1/<PR-number>/MO/<path-to-source-file>/results.txt This file contains the file paths in both repositories that you can use for manual inspection.

This file contains the file paths in both repositories that you can use for manual inspection. Here, we apply the Compare Code Mechanically pattern (p.227) — use automated, tool-driven comparison before diving into manual review.

4. Read and understand the patch

Open Apache Kafka Pull Request #13386

Read the PR description and discussion to understand:

What the patch does (functional change or improvement).
Why it was made (e.g., bug fix, performance, metrics/observability).
Why it matters for both the source (apache/kafka) and the target (linkedin/kafka).

Pattern cue — Most Valuable First (OORP, p.29): articulate the value/impact of this change before investing further effort, so you focus integration work on the highest‑benefit patches first.

5. Verify the Missed Opportunity

Open the source repository (apache/kafka) and navigate to the file path shown in the report or results.txt.
- Confirm that the hunk appears in the PR’s changes.
Open the target repository (linkedin/kafka) and navigate to the same file path.
- Check whether the hunk exists in the code.
- If it is missing, it confirms the MO classification from GACPD.

Task 2 – Integrating the Missed Opportunity with RePatch

In this task, you will use RePatch to integrate MO patch identified in Task 1 between the two related repositories:

Your goal is to:

Run RePatch to integrate the patch.
Verify the results by inspecting the backend database.

Step-by-Step Instructions

1. Install RePatch

Follow the installation instructions in the RePatch README.

Here we apply Keep It Simple (OORP, p.37) — avoid unnecessary setup complexity, stick to the minimal environment needed to run the integration.

2. Run the Integration Pipeline

Open and build the project in IntelliJ IDEA.
For this lab, the pipeline is already preconfigured to analyze PR #13386 from apache/kafka. Simply run the project and wait for the integration to finish.
For the project assignments, you will need to configure RePatch to analyze different PRs (four in total).
- The PR configuration is stored in the src/main/resources/complete_data directory (see README, section Configuring PRs).
- To switch to a new PR, update the configuration file(s) in the src/main/resources/sample_data with the PR number and repository details provided in your project instructions.
- Rebuild and rerun RePatch with the updated settings.

This step is guided by Most Valuable First (OORP, p.29) — start with a preselected high-impact PR before generalizing to others.

3. View RePatch output

When the integration finishes, inspect the results in MySQL.

How to connect: Use any MySQL client—e.g., MySQL Workbench (GUI), the MySQL CLI, or phpMyAdmin (included in the Docker setup for RePatch).
What to look at: The key table is merge_result, which records how RePatch reduced or resolved merge conflicts when git cherry-pick failed. Other tables include useful metadata and diagnostics (e.g., projects, patches, refactorings, conflicting files, conflict blocks).
For setup details, see the lab project’s README.

Pattern cue — Compare Code Mechanically (OORP, p.227): rely on automated tool-driven alignment before diving into manual inspection.
Pattern cue — Tests: Your Life Insurance (OORP, p.149): even after automated integration, always confirm correctness through testing.

Quick start (SQL):

-- List databases
SHOW DATABASES;

-- Switch to the RePatch database
USE refactoring_aware_integration_repatch;

-- See available tables
SHOW TABLES;

-- Inspect the structure of a table
DESCRIBE merge_result;

-- Preview rows from a specific table
SELECT * FROM merge_result LIMIT 50;

Template query (replace the placeholder):

SELECT * FROM <table_name> LIMIT 50;

Task 3 – Writing Tests for Integrated Changes

Up to now, we have used RePatch to integrate a missed opportunity patch from Apache Kafka PR #13386 into the target repository. However, integration at the syntactic level is not enough — we also need to ensure that the change behaves correctly in its new context.

In this task, you will write a missing test for the integrated patch.

Step-by-Step Instructions

Inspect the code hunk from PR #13386 that was integrated into the target repository.
Identify the behavior or functionality that the hunk is meant to enforce (e.g., offset commit logic, logging, or metrics).
Write a unit test or integration test in the target repository that exercises this functionality.
- Use existing test files as templates.
- Place the new test alongside related tests in the appropriate test suite.
Run the test suite and verify that your new test passes.
- If it fails, investigate whether the integration requires additional adaptations.

Pattern cue — Tests: Your Life Insurance (OORP, p.149): tests protect against regressions and confirm the correctness of integration.
Pattern cue — Write Tests to Understand (OORP, p.179): writing tests clarifies what the patch is supposed to do.
Pattern cue — Grow Your Test Base Incrementally (OORP, p.159): only add tests around the integrated patch instead of rewriting the whole suite.
Pattern cue — Study the Exceptional Entities (OORP, p.107): focus tests on edge cases and unusual conditions where bugs are more likely.

Reflection Questions (for quiz prep)

To deepen your understanding, read the RePatch paper and use it to guide your answers:

Did your test confirm that the patch works correctly in the target repository? If not, what additional adaptations might be needed?
How does patch technical lag (as discussed in PaReco) affect the reliability of tests when integrating long-delayed patches?
RePatch found that many cherry-pick failures stem from refactorings such as Rename Method or Rename Parameter. How might such refactorings influence the kinds of tests you need to write?
How do unit and integration tests together strengthen confidence in patch reuse across software variants? Where might tests still fall short?
Imagine your integrated patch passes syntactic checks and unit tests but fails in production. Based on the research papers, what variant-specific factors could explain this outcome?