Level : Advanced
Requirements : Knowledge of Linux shell, OpenRefine, jQuery and some selectors.
Just after publishing my first post converting Candidates lists in the 2018 Tunisian municipal elections from PDF to CSV, ISIE published candidates names lists ! More data ! This time it’s all in XLSX format ! Sounds good ? Not at all, I need to get all these files into a single document to be able to do any verification. Be aware, the process is not as easy as you may think ! Welcome to a new #OpenData challenge 🙂
Here is how the page looks like :
First thing, I was not sure that there are the right number of files as it is organized in a tree. So let’s verify that there are 350 files :
Sounds good, I downloaded the 350 files. As usual you will find in the xlsx files title, sub-title, logo … just useless metadata before finding the data !
So let’s convert everything to csv (sudo apt install gnumeric) if you don’t have ssconvert binary :
$ for i in *.xlsx; do ssconvert "$i" "$i.csv" ; done
Now we have list of csv, I will just move them to a separate folder :
$ mkdir csv $ mv *.csv csv/
As I mentioned before the first 7 lines are not part of the data (in most files), so it should be removed (you will see later that it’s more than 7):
$ cd csv $ for i in *.csv; do sed -i 1,7d "$i" ; done
Now I will try to combine them. Combining them with miller should fail if the files don’t have the exact format so it’s recommended to use instead of cat :
$ mlr --rs lf --csv sort -f date,code *.csv > combined.csv mlr: unacceptable empty CSV key at file "%D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv" line 1
Thank you, first error ! Let’s have a look on the error :
$head -5 %D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv ,,,,,,, ,,,,,,, "الإدارة الفرعية","الدائرة البلدية","تسمية القائمة","طبيعة القائمة","لقب المترشح","إسم المترشح","رتبة
And you will notice that the two first lines are empty, I will remove the extra lines manually :
$ sed -i 1,2d %D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv
Then repeat miller command again, until getting the combined file. After three files miller still fail with “mlr: syntax error : unwrapped double quote at line 0”. Thanks miller we just hit your limitation, let’s move to something else :
$ cat *csv > combined.csv
Combined file show more than 24 thousands records (that’s not the exact number), a quick look in openrefine and I found the files causing error. How to do it ? Easily look for known types like numeric, then facet/filter and look for Non-numeric/blank/errors :
Finally the error is due to files having more than 8 columns. As we are working with files created manually, always expect to find such issues.
But there is still one more problem to fix before copying the 8 columns. Notice in the result of the “head” command, the first line which should contain the header is in four or more lines ! We need to remove the newline character inside the columns. I did it my own way, copy header in a separate file, remove all first 4 lines from each csv file, then combine again using the header that we have already created. I will use any file since they are all the same, or almost.
$ head -4 %D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv > header
I will edit the header file manually here, then :
$ for i in *.csv; do sed -i 1,4d "$i" ; done
Normally this should be okay, we did not delete any useful data. If any data will be deleted here by mistake we will notice that in the index not starting with 1. Now I will loop again on all files and copy the first 8 columns only :
$ mkdir copy $ for i in *.csv; do cut -d "," -f 1-8 "$i" > "copy/$i" ; done $ cd copy $ cat *csv > combined.csv $ cat ../header combined.csv > newcombined.csv
The last part is self explanatory, I just combined csv files and added the header on the top. Openrefine show again two more files creating issues, manual edit then my csv should be final :
We have 45345 rows/records, which mean 45345 candidates ! Huge, how did the ISIE verified the candidature of each one of them ? Second thing I noticed is that some lists have up to 61 names ! And we can see in the next screenshot that there are 25 lists having between 40 and 61 candidates :
Now that’s a file we can work with, feel free to download !