… and how to avoid it.
According to a recent scientific paper “Gene name errors are widespread in the scientific literature” (authors: Mark Ziemann, Yotam Eren1, and Assam El-Osta). The paper says that “approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions”, and that the errors are caused by Excel converting certain gene names into dates, and others into numbers in scientific notation. A search finds this is nothing new. A 2004 paper states “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. The more recent paper has an interesting graph of the frequency of these errors over time:
It seems that the errors were at a low level after the 2004 paper, but have since risen substantially.
Errors of this type are of course not limited to gene names. Problems with dates being interpreted differently in different regions are widespread, and in the engineering context fractions may be converted to dates, or left as a fraction in text format.
Although widespread, these problems are reasonably easy to avoid. This post will look at the built in Excel methods, and the next will present some VBA solutions. As an example, we will import a table from a pdf file. The table is copied to the clipboard:
When pasted in Excel all the text goes in one column:
The text can be split into columns with the Text to Columns Wizard, under the Data tab, which has three steps. First select the “delimited” option:
For delimiters select “space” and “treat consecutive delimiters as one”:
Finally select “Text” as the data format for all columns. To do this in one operation scroll to the right of the data preview, hold down the Ctrl key, and click on the right hand column:
The text is split into columns with text format, so the fractions display as formatted in the original document:
If the same clipboard data is now pasted into another range, Excel remembers the text to columns with space delimiters settings, but not the text format setting, so the integers are pasted as numbers, but the fractions are pasted as either dates or text, depending on whether the fraction can be interpreted as a valid date or not:
To paste the data as text all the cells in the paste range must be formatted as text before pasting. All the fraction cells will then be pasted in their original format:
The settings selected in the Text to Columns Wizard will remain in place so long as the spreadsheet is open, even if another workbook is opened. Options for resetting, so that pasted text will go into a single column again are:
- Save and re-open the spreadsheet
- Or select a single cell containing text, and go through the Text to Columns process, selecting “delimited” but deselecting all delimiters.