html-excel bleg

Techo-nerds, can you help? A student of mine downloaded about thousand spreadsheets from a public site using the “Excel file” option that saved themselves as .xls files and will open in Excel but are REALLY HTML files and, as such, cannot be imported or even parsed by Stata. Any ideas for automating the file translation? We estimate that opening each file in Excel and saving it as an Excel file at 30 seconds each will take 35 hours. Hoping for a programming solution.

My university’s class rosters ALSO download with .xls extensions but are really html files. Hmmm.

Edit: I think I can crack this. I’ve learned that I can read each file into Stata as lines of text this way:

import delimited “census_Tract101.xls”, delimiter(“^”) varnames(nonames) clear

from there, I’m pretty sure I can fairly easily extract the information needed with string functions, as all the files have identical formats. This may be more elegantly done in R or a programming langauge, but I think I can do it in Stata faster. We’ll see. Thanks for the fast responses.

Edit #2: That did it. My Stata-fu is strong and once I could get each file into Stata as a long string per row, I was good to go as regards writing the code to parse each file inside loops and combine all them into one big file. If you happen to want clues on how to do this kind of arcane task, let me know.



Author: olderwoman

I'm a sociology professor but not only a sociology professor. I keep my name out of this blog because I don't want my name associated with it in a Google search. Although I never write anything in a public forum like a blog that I'd be ashamed to have associated with my name (and you shouldn't either), it is illegal for me to use my position as a public employee to advance my religious or political views, and the pseudonym helps to preserve the distinction between my public and private identities. The pseudonym also helps to protect the people I may write about in describing public or semi-public events I've been involved with. You can read about my academic work on my academic blog --Pam Oliver

7 thoughts on “html-excel bleg”

    1. +1. I don’t think Stata has any way to read HTML tables. You could write one use the file command but that might take more than 35 hours.

      You might also be able to write a loop in Stata and use a converter tool, such as table2csv


  1. Possibly StatTransfer will do the trick as well. It has a command line syntax which can be initiated from Stata code via a Stata addin. R is probably more efficient if you’re already conversant in that language.


      1. Stat transfer was one of my first ideas. What Gabriel said: it is write-only for html.

        However, I HAVE made progress. I have discovered that I can read the html lines into Stata as one long string variable with import delimited without having to change the file name:
        import delimited “census_Tract101.xls”, delimiter(“^”) varnames(nonames) clear

        I’m pretty sure I can write the Stata code to pluck the info she needs out of it with less trouble than it would take me to learn R. In this case, the files all have the same 16 pieces of information (a crime category and a frequency) in exactly the same format.


  2. You might try writing a macro in excel. Run a loop through each file in the folder, opening and saving each file in turn.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s