{"id":5684,"date":"2018-04-25T11:51:11","date_gmt":"2018-04-25T08:51:11","guid":{"rendered":"http:\/\/www.hbyconsultancy.com\/?p=5487"},"modified":"2018-04-25T11:51:11","modified_gmt":"2018-04-25T08:51:11","slug":"opening-up-2018-tunisian-municipality-elections-data-part-2","status":"publish","type":"post","link":"https:\/\/hbyconsultancy.com\/2018\/04\/opening-up-2018-tunisian-municipality-elections-data-part-2.html","title":{"rendered":"Opening Up 2018 Tunisian Municipal Elections Data &#8211; Part 2"},"content":{"rendered":"<p><strong>Level : Advanced<br \/>\nRequirements : Knowledge of Linux shell, OpenRefine, jQuery and some selectors.<\/strong><\/p>\n<p>Just after publishing <a href=\"http:\/\/www.hbyconsultancy.com\/blog\/opening-up-2018-tunisian-municipality-election-data.html\">my first post<\/a> converting Candidates lists in the 2018 Tunisian municipal elections from PDF to CSV, <a href=\"http:\/\/www.isie.tn\/elections\/elections-municipales-2018\/candidatures\/\">ISIE published candidates names lists<\/a> ! More data ! This time it&#8217;s all in XLSX format ! Sounds good ? Not at all, I need to get all these files into a single document to be able to do any verification. Be aware, the process is not as easy as you may think ! Welcome to a new #OpenData challenge \ud83d\ude42<!--more--><\/p>\n<p>Here is how the page looks like :<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5506\" src=\"https:\/\/i0.wp.com\/wordpress-948180-3301344.cloudwaysapps.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-05-01-09-29-33.png?resize=471%2C548&#038;ssl=1\" alt=\"\" width=\"471\" height=\"548\" data-recalc-dims=\"1\" \/><\/p>\n<p>First thing, I was not sure that there are the right number of files as it is organized in a tree. So let&#8217;s verify that there are 350 files :<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5489\" src=\"https:\/\/i0.wp.com\/wordpress-948180-3301344.cloudwaysapps.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-04-25-00-27-06.png?resize=460%2C459&#038;ssl=1\" alt=\"\" width=\"460\" height=\"459\" srcset=\"https:\/\/i0.wp.com\/hbyconsultancy.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-04-25-00-27-06.png?w=460&amp;ssl=1 460w, https:\/\/i0.wp.com\/hbyconsultancy.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-04-25-00-27-06.png?resize=200%2C200&amp;ssl=1 200w\" sizes=\"(max-width: 460px) 100vw, 460px\" data-recalc-dims=\"1\" \/><\/p>\n<p>Sounds good, I downloaded the 350 files. As usual you will find in the xlsx files title, sub-title, logo &#8230; just useless metadata before finding the data !<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5495\" src=\"https:\/\/i0.wp.com\/wordpress-948180-3301344.cloudwaysapps.com\/wp-content\/uploads\/2018\/04\/sample.png?resize=959%2C353&#038;ssl=1\" alt=\"\" width=\"959\" height=\"353\" srcset=\"https:\/\/i0.wp.com\/hbyconsultancy.com\/wp-content\/uploads\/2018\/04\/sample.png?w=959&amp;ssl=1 959w, https:\/\/i0.wp.com\/hbyconsultancy.com\/wp-content\/uploads\/2018\/04\/sample.png?resize=768%2C283&amp;ssl=1 768w\" sizes=\"(max-width: 959px) 100vw, 959px\" data-recalc-dims=\"1\" \/><\/p>\n<p>So let&#8217;s convert everything to csv (sudo apt install gnumeric) if you don&#8217;t have ssconvert binary :<\/p>\n<pre class=\"lang:sh decode:true\">$ for i in *.xlsx; do ssconvert \"$i\" \"$i.csv\" ; done<\/pre>\n<p><span style=\"font-weight: 400;\">Now we have list of csv, I will just move them to a separate folder :<\/span><\/p>\n<pre class=\"lang:sh decode:true\">$ mkdir csv\n$ mv *.csv csv\/<\/pre>\n<p><span style=\"font-weight: 400;\">As I mentioned before the first 7 lines are not part of the data (in most files), so it should be removed (you will see later that it&#8217;s more than 7):<\/span><\/p>\n<pre class=\"lang:sh decode:true\">$ cd csv\n$ for i in *.csv; do sed -i 1,7d \"$i\" ; done<\/pre>\n<p><span style=\"font-weight: 400;\">Now I will try to combine them. Combining them with miller should fail if the files don&#8217;t have the exact format so it&#8217;s recommended to use instead of cat :<\/span><\/p>\n<pre class=\"lang:sh decode:true\">$ mlr --rs lf --csv sort -f date,code *.csv &gt; combined.csv\nmlr: unacceptable empty CSV key at file \"%D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv\" line 1\n<\/pre>\n<p><span style=\"font-weight: 400;\">Thank you, first error ! Let\u2019s have a look on the error :<\/span><\/p>\n<pre class=\"lang:sh decode:true \">$head -5 %D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv\n,,,,,,,\n,,,,,,,\n\"\u0627\u0644\u0625\u062f\u0627\u0631\u0629 \n\u0627\u0644\u0641\u0631\u0639\u064a\u0629\",\"\u0627\u0644\u062f\u0627\u0626\u0631\u0629\n \u0627\u0644\u0628\u0644\u062f\u064a\u0629\",\"\u062a\u0633\u0645\u064a\u0629 \u0627\u0644\u0642\u0627\u0626\u0645\u0629\",\"\u0637\u0628\u064a\u0639\u0629 \u0627\u0644\u0642\u0627\u0626\u0645\u0629\",\"\u0644\u0642\u0628 \u0627\u0644\u0645\u062a\u0631\u0634\u062d\",\"\u0625\u0633\u0645 \u0627\u0644\u0645\u062a\u0631\u0634\u062d\",\"\u0631\u062a\u0628\u0629 \n<\/pre>\n<p><span style=\"font-weight: 400;\">And you will notice that the two first lines are empty, I will remove the extra lines manually :<\/span><\/p>\n<pre class=\"lang:sh decode:true \">$ sed -i 1,2d %D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv\n<\/pre>\n<p><span style=\"font-weight: 400;\">Then repeat miller command again, until getting the combined file. After three files miller still fail with \u201cmlr: syntax error : unwrapped double quote at line 0\u201d. Thanks miller we just hit your limitation, let&#8217;s move to something else :<\/span><\/p>\n<pre class=\"lang:sh decode:true \">$ cat *csv &gt; combined.csv<\/pre>\n<p><span style=\"font-weight: 400;\">Combined file show more than 24 thousands records (that&#8217;s not the exact number), a quick look in openrefine and I found the files causing error. How to do it ? Easily look for known types like numeric, then facet\/filter and look for Non-numeric\/blank\/errors :<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5490 size-full\" src=\"https:\/\/i0.wp.com\/www.hbyconsultancy.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-04-25-08-58-11-e1524642896937.png?resize=599%2C177\" alt=\"\" width=\"599\" height=\"177\" data-recalc-dims=\"1\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Finally the error is due to files having more than 8 columns. As we are working with files created manually, always expect to find such issues.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But there is still one more problem to fix before copying the 8 columns. Notice in the result of the \u201chead\u201d command, the first line which should contain the header is in four or more lines ! We need to remove the newline character inside the columns.\u00a0\u00a0<\/span><span style=\"font-weight: 400;\">I did it my own way, copy header in a separate file, remove all first 4 lines from each csv file, then combine again using the header that we have already created. I will use any file since they are all the same, or almost.<\/span><\/p>\n<pre class=\"lang:sh decode:true \">$ head -4  %D8%A8%D9%84%D8%AF%D9%8A%D8%A9%20%D8%A7%D9%84%D8%AC%D8%B1%D9%8A%D8%B5%D8%A9.xlsx.csv &gt; header<\/pre>\n<p><span style=\"font-weight: 400;\">I will edit the header file manually here, then :<\/span><\/p>\n<pre class=\"lang:sh decode:true \">$ for i in *.csv; do sed -i 1,4d \"$i\" ; done\n<\/pre>\n<p><span style=\"font-weight: 400;\">Normally this should be okay, we did not delete any useful data. If any data will be deleted here by mistake we will notice that in the index not starting with 1.\u00a0<\/span><span style=\"font-weight: 400;\">Now I will loop again on all files and copy the first 8 columns only :<\/span><\/p>\n<pre class=\"lang:sh decode:true\">$ mkdir copy\n$ for i in *.csv; do cut -d \",\" -f 1-8 \"$i\" &gt; \"copy\/$i\" ; done\n$ cd copy\n$ cat *csv &gt; combined.csv\n$ cat ..\/header combined.csv &gt; newcombined.csv\n<\/pre>\n<p>The last part is self explanatory, I just combined csv files and added the header on the top. Openrefine show again two more files creating issues, manual edit then my csv should be final :<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5491\" src=\"https:\/\/i0.wp.com\/wordpress-948180-3301344.cloudwaysapps.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-04-25-11-03-55.png?resize=1151%2C359&#038;ssl=1\" alt=\"\" width=\"1151\" height=\"359\" data-recalc-dims=\"1\" \/><\/p>\n<p>We have 45345 rows\/records, which mean 45345 candidates ! Huge, how did the ISIE verified the candidature of each one of them ? Second thing I noticed is that some lists have up to 61 names ! And we can see in the next screenshot that there are 25 lists having between 40 and 61 candidates :<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5496\" src=\"https:\/\/i0.wp.com\/wordpress-948180-3301344.cloudwaysapps.com\/wp-content\/uploads\/2018\/04\/Screenshot-from-2018-04-25-11-21-13.png?resize=299%2C515&#038;ssl=1\" alt=\"\" width=\"299\" height=\"515\" data-recalc-dims=\"1\" \/><\/p>\n<p>Now that&#8217;s a file we can work with, feel free to download !<\/p>\n\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Level : Advanced Requirements : Knowledge of Linux shell, OpenRefine, jQuery and some selectors. Just after publishing my first post converting Candidates lists in the 2018 Tunisian municipal elections from PDF to CSV, ISIE published candidates names lists ! More data ! This time it&#8217;s all in XLSX format ! Sounds good ? Not at [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5488,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,8],"tags":[82,178],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"https:\/\/i0.wp.com\/hbyconsultancy.com\/wp-content\/uploads\/2018\/04\/LOGO-BALADIYA2018-FB-Cover-1.jpg?fit=3405%2C1281&ssl=1","_links":{"self":[{"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/posts\/5684"}],"collection":[{"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/comments?post=5684"}],"version-history":[{"count":0,"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/posts\/5684\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/media\/5488"}],"wp:attachment":[{"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/media?parent=5684"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/categories?post=5684"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hbyconsultancy.com\/wp-json\/wp\/v2\/tags?post=5684"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}