Home (old version)

Motivation

Major challenges for large coding projects:

Ability to understand and remember code when project becomes large
Ability to expand functionality with ease
Making code understandable and usable by others

Useful hypothetical test: will you be able to easily understand, use, and expand your code if you don’t touch it for a year?

A useful way to approach scientific programming is by thinking in terms of various programming modes. Most scientific programmers concentrate on prototyping and optimizing modes, while software engineers - on maintaining, sharing, and standardizing ones. When coding projects are large, maintaining mode is much more efficient than the prototyping one, so it is beneficial for scientists to learn and apply corresponding techniques. Occasionally, scientists would like to share their code with other researchers. Then sharing and standardizing modes come into play and require their own techniques.

General tip: don’t spend a lot of time figuring something out on your own if you know of someone who can explain the answer to you. Often there are community-accepted solutions to problems that are difficult to find online but that someone experienced with the programming language or field would find obvious. Know who you can go to for help (and return the favor).

Object-oriented programming

In software engineering the most common approach to fight code complexity is object-oriented programming. It is well-suited for most large software projects but can have disadvantages for some scientific programming applications. It is good to know differences between OOP and good modular programming and choose an approach suitable for your needs. Here is a wiki section going into the details of the issue:

OOP in scientific programming

Languages

Select a good language for scientific programming. Different languages have different strengths and weaknesses. Your specific task would suggest the best tool. Choose wisely. Here are several options successfully used by many scientists:

Make sure to use a smart editor with code highlighting and auto-completion.

Text formatting / Commenting

Use UpperCamelCase, lowerCamelCase, or lower_underscore_case notation; give variables meaningful, easily understandable names
Keep code lines short and readable, use tabs to offset loops/scopes, and leave plenty of white space
Add at least minimal comments while you write code (modularity and good notation greatly reduce amount of required comments):
- a short description of each function’s purpose at the beginning
- a few words in front of each block of meaningful computation
- reference sources for your methodology/notation when applicable (articles, Wikipedia, etc)
- too many comments are better than too few
- (bonus) specify function’s inputs and outputs thoroughly

Variables / Data structures

Use structure/dictionary data types when appropriate; they have powerful methods
Use cell array/list data types when appropriate; they have powerful methods
Use set data type when appropriate; it has powerful methods
Put all parameters which can be used by more than one function into one separate parameter structure/dictionary and pass it along; initialize it in a separate function setParams()
Don’t use any numbers in code directly, put them into parameters or parameter or state structure

Data, files and folders hierarchy

Before starting or expanding a major project, put best effort into creating a general structure/format to organize your data and code. Design and follow a strict naming schema for files and folders. Ideally, maintain a design document with a flowchart of how the functions and data files interact. This will greatly speed up coding process and minimize need for refactoring.

Organize data so that every piece of it has a single representation in the system (no repetitions)
Use meaningful data, file, and folder names and structure
Use sub-folders when a folder has over 20 functions in it
Use "YYMMDD" format for dates: files in a folder will sort in a chronological order and you will avoid possible MMDD/DDMM confusion with American/European notations

Functions / Modularity

Cap function length at 100 lines of code (less than 1 screen length is even better)
Make each function solve only one meaningful task
Separate code that is repeated at least once into its own function
In the beginning of a function check for input data is consistent with what you expect
Make functions carry no side effects
Use higher level functions whenever they are available instead of implementing them yourself
Scale code vertically using higher and higher levels of abstraction: this is the only way to work with a large code-base
Pass variables between functions only via:
- direct inputs (data itself and function specific parameters)
- parameters structure (general parameters)
- state structure (flags and parameters showing the current state of analysis)
Abstract/generalize code whenever it is possible and fast (write for a more general case if it won’t take much time and effort), stop for a few moments to think about abstracting before you start each new function

Debugging / Testing

Use breakpoints
Have fast feedback loop to develop, test, and debug code (no more than 10 seconds of running time):
- run small chunks of code at a time
- save and load results of long computations to/from disk
Test each new function on couple different "toy" inputs to make sure it runs correctly; don't assume that if code does not throw an error, it executes correctly
Refactor code when needed; don’t get attached to your code

Version Control / Collaboration

Learn and use version control system. A great popular choice is Git + GitHub. These wiki and code repository are hosted on GitHub.

Commit after every small but meaningful change
Use code reviews and pair programming when possible

Plotting data

Label axes and indicate measurement units when available
Give your figures meaningful titles
Avoid low contrast color combinations (such as yellow, pink, cyan, and bright green on white) for better display over projector and on printing
Use very large font when you want to show figures over projector or on paper
Maximize data-to-ink ratio on your plots
When comparing several lines in one plot use different colors and line types (solid, dotted, dash) and add an explanatory legend
Scale axes so that data takes most of the space
Add error bars
Export data to .eps format and edit in Abode Illustrator/Inkscape for publication quality

Speed and memory

Pre-allocate large data (for example, arrays) for speed
Avoid loops by code vectorization

Packages / Libraries

Do not "reinvent the wheel". Aggressively search for and recycle other people’s code. This is especially useful if you are in the prototyping and not optimizing mode. Remember: almost everything you will come across has already been encountered, addressed, and documented; the biggest challenge is selecting the right keywords to find this information.

Unfortunately, currently many scientific packages written by scientists (not professional software engineers) can break or produce wrong results on new data sets, parameter values, or systems. Thus, the need for this wiki! When using other scientists' code test it thoroughly and, if possible, try to understand its details.

Language specific best practices

Code examples

Code examples are in the repository.

Other resources for best practices

Software Carpentry - online courses and offline bootcamps in programming techniques for scientists organized by a dedicated non-profit
General list of good practices, not specific to scientific programming
Good MATLAB Programming Practices for the Non-Programmer
MATLAB Programming Style Guidelines
Best Practices for Scientific Computing
Code Complete: A Practical Handbook of Software Construction, Second Edition
Clean Code: A Handbook of Agile Software Craftsmanship
Best Coding Practices Wikipedia page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly