Whether you are a data scientist, machine learning engineer, or software developer, you will probably encounter legacy code in your projects. This type of code can be challenging to work with, as it often lacks proper documentation and may not adhere to modern coding standards. However, understanding and maintaining legacy code is crucial for the long-term success of any software project. It is usually not easy to make major changes particularly if the unit tests do not provide a good coverage. The following actions can help with the quality of the project in the long run:
Start with small changes: When working with legacy code, it’s important to make small, incremental changes rather than large, sweeping modifications. This approach reduces the risk of introducing new bugs and makes it easier to identify the source of any issues that arise.
Add docstring: Whenever you modify a function or a class, take the time to check the docstring. If it does not exists, create one. If it does, try to check and see if any information needs to be revised. This practice helps improve the overall documentation of the codebase and makes it easier for others (and your future self) to understand the purpose and behavior of the code. It does NOT need to be perfect! So, even if you do not know the description of a variable or its type, just leave it blank for now. Your docstring can serve as a valuable reference point for anyone working with the code in the future!
Use linting tools: Linting tools such as black can help you maintain a consistent code style. Having a clean code can help with readability and maintainability, making it easier for you and others to work with the code in the future. You can take one step further by integrating linting into your development workflow, such as by using pre-commit hooks or continuous integration pipelines.
Cross-reference code changes and tickets: When making changes to the codebase, it’s important to cross-reference those changes with any relevant tickets or issues in your project management system. This practice helps ensure that all changes are properly tracked and documented, making it easier to understand the context and rationale behind each change. Suppose this change is related to a business request where the business user raises concern regarding the accuracy of model and its impact on their decision making process. In this case, we can follow these steps: First, we get clarification from the business user on their specific concerns and requirements. Next, we can review the relevant code and tests to identify any potential issues. Finally, we can implement the necessary changes, and submit a PR. In the ticket, we can point to the PR, and similarly, in the PR, we can point to the ticket. This way, we can follow the process later.
Document your assumptions: When working with legacy code, it’s important to document any assumptions you make about the code’s behavior or the data it operates on. This documentation can help you and others understand the code better and can serve as a reference point if you need to revisit the code in the future. You can add comments to the code itself or create separate documentation that outlines your assumptions and the reasoning behind them. For instance, let’s say you are performing batch processing on a dataset coming from an upstream job. As you process each batch, you write the analyzed (curated, processed) data to a certain location. For the sake of simplicity, let’s say you write it to a CSV file, and the first column is ID. Let’s say the goal is to have unique records in the output. If the code does not check for any duplication, then the (hidden) assumption here is that the batch does not contain any duplicates. In some cases, you can add some extra checks to reduce the assumptions. The check can throw an error, or warning, depending on the specific requirements of your project.
Avoid making changes that can break backward compatibility: When working with legacy code, it’s important to be mindful of backward compatibility. This means that any changes you make should not break existing functionality or cause issues for users who rely on the current behavior of the code. Suppose you encounter a function and consider modifying its signature. Before making such changes, you should carefully evaluate the impact on existing code that depends on this function. If necessary, you can introduce new functions or parameters while keeping the old ones intact to ensure that existing users are not affected. Even changing the function name can become a problem if a user’s application depends on that function.
Try to distinguish between “software” and “application”: While the terms “software” and “application” are often used interchangeably, it’s important to recognize the differences between them, especially when working with legacy code. Software is a broad term that encompasses all types of programs and systems, while an application is a specific type of software designed to perform a particular task or set of tasks for users. By understanding this distinction, you can better navigate the complexities of legacy code and make more informed decisions about how to modify and maintain it. Let’s say you are working on different ETL applications that process data from various sources. However, the core process might be the same in those applications. So, you can extract the common logic into a shared library or module that can be reused across all applications. Now if the user neeeds your support for the application as well, then you can write an application that uses the module and share it with the user. This way you are trying to separate data and the process that is applied to it.
Avoid try-except unless you know what you are doing: While try-except blocks can be useful for handling exceptions and preventing crashes, they can also hide underlying issues and make debugging more difficult. When working with legacy code, it’s important to use try-except blocks judiciously and to have a clear understanding of the potential consequences. If you do need to use a try-except block, make sure to log the exception and provide enough context to help future developers understand the issue.