screaming modems: Taxpayers in Pakistan

Pakistan's FBR, the tax collectors, recently published a PDF file with a list of all tax payers, both corporate and individual. (There's also a third category of "AOPs" -- whatever that is.) The file is linked off of the front page of the website.

I downloaded a copy of the file and used pdftotext (from xpdf) to convert it to a more pliable format, and then went to town. The conversion process wasn't perfect (tabular data doesn't always convert to one PDF row to one text row; see also the disclaimer at the bottom), but it's very easy to figure out the format by visually comparing it with the source.

(Side note: if I'd tried to import it into Google Sheets or Excel, maybe the conversion would've been better.)

Read the file from the string "COMPANY Taxpayer" until you see "332024095867"at the beginning of the line. That's the CNIC for the first individual listed in the file (listed as "#OHD ,NAWAZ"-- what does that say about their data validation procedures?).

Extract all strings in the format <digits>-<digit>. Those are company NTNs.

There are 64,960 companies in the file.

Extract all strings that are 13 digits long. Those are CNICs.

There are 779,077 taxpayers in the file. That's an average of 11.993 taxpayers per company.

How about taxes paid? (There are so many Rs 0 entries!) This is a little harder to extract. For companies: find a line that has NTNs in it, skip the next blank line, then read the taxes lines until you reach another blank line. Unfortunately, there's a problem with this approach: I should read 64,960 entries, but I read 64,886. Close enough!

Companies paid a total of Rs 376,174,769,305. 25,464 paid nothing.

People next, using a similar approach. I found 766,791 entries.

Individuals paid Rs 113,663,076,659 and 294,475 paid nothing.

DISCLAIMER: Because the conversion process isn't completely accurate my numbers are off by a bit. Also there's a lot of garbage in the file -- name fields with numbers before the names; CNIC fields that are empty or are six or seven digits long, etc.

screaming modems

Monday, April 13, 2015

Taxpayers in Pakistan

1 comment: