Skip to content
Mikhail Panko edited this page Aug 31, 2013 · 1 revision

Motivation

Major challenges for large coding projects:

  • Ability to understand and remember code when project becomes large
  • Ability to expand functionality with ease
  • Making code understandable and usable by others

Useful hypothetical test: will you be able to easily understand, use, and expand your code if you don’t touch it for a year?

A useful way to approach scientific programming is by thinking in terms of various programming modes. Most scientific programmers concentrate on prototyping and optimizing modes, while software engineers - on maintaining, sharing, and standardizing ones. When coding projects are large, maintaining mode is much more efficient than the prototyping one, so it is beneficial for scientists to learn and apply corresponding techniques. Occasionally, scientists would like to share their code with other researchers. Then sharing and standardizing modes come into play and require their own techniques.

General tip: don’t spend a lot of time figuring something out on your own if you know of someone who can explain the answer to you. Often there are community-accepted solutions to problems that are difficult to find online but that someone experienced with the programming language or field would find obvious. Know who you can go to for help (and return the favor).

Object-oriented programming

In software engineering the most common approach to fight code complexity is object-oriented programming. It is well-suited for most large software projects but can have disadvantages for some scientific programming applications. It is good to know differences between OOP and good modular programming and choose an approach suitable for your needs. Here is a wiki section going into the details of the issue:

OOP in scientific programming

Languages

Select a good language for scientific programming. Different languages have different strengths and weaknesses. Your specific task would suggest the best tool. Choose wisely. Here are several options successfully used by many scientists:

Make sure to use a smart editor with code highlighting and auto-completion.

Text formatting / Commenting

  • Use UpperCamelCase, lowerCamelCase, or lower_underscore_case notation; give variables meaningful, easily understandable names
  • Keep code lines short and readable, use tabs to offset loops/scopes, and leave plenty of white space
  • Add at least minimal comments while you write code (modularity and good notation greatly reduce amount of required comments):
    • a short description of each function’s purpose at the beginning
    • a few words in front of each block of meaningful computation
    • reference sources for your methodology/notation when applicable (articles, Wikipedia, etc)
    • too many comments are better than too few
    • (bonus) specify function’s inputs and outputs thoroughly

Variables / Data structures

  • Use structure/dictionary data types when appropriate; they have powerful methods
  • Use cell array/list data types when appropriate; they have powerful methods
  • Use set data type when appropriate; it has powerful methods
  • Put all parameters which can be used by more than one function into one separate parameter structure/dictionary and pass it along; initialize it in a separate function setParams()
  • Don’t use any numbers in code directly, put them into parameters or parameter or state structure

Data, files and folders hierarchy

Before starting or expanding a major project, put best effort into creating a general structure/format to organize your data and code. Design and follow a strict naming schema for files and folders. Ideally, maintain a design document with a flowchart of how the functions and data files interact. This will greatly speed up coding process and minimize need for refactoring.

  • Organize data so that every piece of it has a single representation in the system (no repetitions)
  • Use meaningful data, file, and folder names and structure
  • Use sub-folders when a folder has over 20 functions in it
  • Use "YYMMDD" format for dates: files in a folder will sort in a chronological order and you will avoid possible MMDD/DDMM confusion with American/European notations

Functions / Modularity

  • Cap function length at 100 lines of code (less than 1 screen length is even better)
  • Make each function solve only one meaningful task
  • Separate code that is repeated at least once into its own function
  • In the beginning of a function check for input data is consistent with what you expect
  • Make functions carry no side effects
  • Use higher level functions whenever they are available instead of implementing them yourself
  • Scale code vertically using higher and higher levels of abstraction: this is the only way to work with a large code-base
  • Pass variables between functions only via:
    • direct inputs (data itself and function specific parameters)
    • parameters structure (general parameters)
    • state structure (flags and parameters showing the current state of analysis)
  • Abstract/generalize code whenever it is possible and fast (write for a more general case if it won’t take much time and effort), stop for a few moments to think about abstracting before you start each new function

Debugging / Testing

  • Use breakpoints
  • Have fast feedback loop to develop, test, and debug code (no more than 10 seconds of running time):
    • run small chunks of code at a time
    • save and load results of long computations to/from disk
  • Test each new function on couple different "toy" inputs to make sure it runs correctly; don't assume that if code does not throw an error, it executes correctly
  • Refactor code when needed; don’t get attached to your code

Version Control / Collaboration

Learn and use version control system. A great popular choice is Git + GitHub. These wiki and code repository are hosted on GitHub.

Plotting data

  • Label axes and indicate measurement units when available
  • Give your figures meaningful titles
  • Avoid low contrast color combinations (such as yellow, pink, cyan, and bright green on white) for better display over projector and on printing
  • Use very large font when you want to show figures over projector or on paper
  • Maximize data-to-ink ratio on your plots
  • When comparing several lines in one plot use different colors and line types (solid, dotted, dash) and add an explanatory legend
  • Scale axes so that data takes most of the space
  • Add error bars
  • Export data to .eps format and edit in Abode Illustrator/Inkscape for publication quality

Speed and memory

  • Pre-allocate large data (for example, arrays) for speed
  • Avoid loops by code vectorization

Packages / Libraries

Do not "reinvent the wheel". Aggressively search for and recycle other people’s code. This is especially useful if you are in the prototyping and not optimizing mode. Remember: almost everything you will come across has already been encountered, addressed, and documented; the biggest challenge is selecting the right keywords to find this information.

Unfortunately, currently many scientific packages written by scientists (not professional software engineers) can break or produce wrong results on new data sets, parameter values, or systems. Thus, the need for this wiki! When using other scientists' code test it thoroughly and, if possible, try to understand its details.

Language specific best practices

Code examples

Code examples are in the repository.

Other resources for best practices