Sometimes I have had a need to turn an un-formatted or unstructured data set into a database table or more recently into a set of JSON objects. The source data can be anything from log files, unusual/quirky CSV files, plain text lists or HTML chunks. There may be a few ways of doing this – and in the past I have sometimes created throw-away programs specifically designed to read the data set and reformat to something else – but there is a surprisingly easy and quick way of doing this with a spreadhseet program like Excel or OpenOffice/LibreOffice Calc. I will attempt to illustrate this with my most recent example – turning a table copy & pasted from Wikipedia into a set of SQL insert statements.
Prelude: Data formatting
Although my example for this article will use a somewhat-structured data set (the existing table from Wikipedia), you will want to create some sort of separation between your values in your source data if it doesn’t already exist. In this Wikipedia example the data will already be tabulated when pasted which saves some time separating the values. This may be the only lengthy process in your work. Tabs work best when pasting data into a Spreadsheet – but any separation character/codes are what matters. You can always use a text editor to string-replace your separation characters for something like tabs or commas. It is best to first decide how you will load in the data to Excel to dictate which separation character to use.
Step 1: Grab your source data
Go to Wikipedia’s article on ISO 639-2 Language Codes. Highlight the entire table of codes, copy & paste into your spreadsheet application. You should see something like the following:
Great! Your data is looking good. Don’t worry about the hyperlinks if they are there – these will be removed later. If you need to do anything like remove bits of data (in this example, you could decide you don’t want column C, E and F), you can delete the data easily and continue on.
Step 2: Add empty columns
Next add empty columns between data columns – you can do this easily by inserting before each data column.
You should end up with something akin to:
Step 3: Enter your code/markup
Your data is now spaced – so in the empty cells on Row 1, enter the type of text that encases your intended data set. We will use SQL as the main illustration first…
Enter the following text into the following columns on Row 1:
A1 | INSERT INTO languages (code, name, type, state) VALUES (' |
C1 | ',' |
E1 | ',' |
G1 | ',' |
I1 | '); |
Once this is done – for each column, fill all remaining rows with the values from Row 1 – you can do this by grabbing the bottom-right selector on each column and dragging to the end of your data set, or by auto-filling.
Step 4: Copy & Paste to a text editor
Once you have completed all your data markup, select all your data and copy & paste into a text editor. Your data will look as it did in Excel but it will have tabs between each column.
Use the Find/Replace of your text editor to find all instances of the tab character (you may need to copy & paste a tab character into the find box), and replace with an empty string.
Success!
You now have usable SQL – you should do any additional validation that any special characters are escaped, such as the apostrophe ( ‘ ) character, but you now have a block of viable SQL statements to create a usable, ISO-compliant languages table in your database. This whole process from start to finish takes no time at all when you know what to do.
This also works for potentially any type of data structure. You can use it for SQL, JSON, HTML, native language arrays, enums – whatever you can imagine.
JSON
Try separating each column with:
A1 | ,{"code":" |
C1 | ","name":" |
E1 | ","type":" |
G1 | ","state":" |
I1 | "} |
Copy & paste to a text editor, replace the tabs and you have viable JSON. You may also need to string replace the quotations (sometimes the spreadsheet application might substitue the characters for 66 & 99 style quotes).
HTML (for a Select List)
Use the following (assuming that you keep the last 2 data columns, you could drop these for a select list):
A1 | <option value=" |
C1 | "> |
E1 | (type: |
G1 | , state: |
I1 | )</option> |
Summary
This is by far the easiest and quickest way I know to turn any data set into the code I need. If you think there is a better way or have any questions about this process, please feel free to share or ask. If this is horrifically inefficient or just doesn’t make sense, let’s talk about it!