Every month, StackOverflow, the brilliant and completely open question and answer site for programmers, releases a Creative-Commons licensed dump of their database. You can download this huge dump yourself, but it’s typically a very large file, and then comes the hassle of getting the XML files into a database of your choice. Now, there is a way to get around those inconveniences, by simply bypassing the whole download procedure and querying the data dump in the cloud! That’s what StatOverflow is for. Just go to the sandbox, enter an SQL query, and voila! Also, there’s a quite helpful meta.stackoverflow post which explains the anatomy of the dump, so that querying it is much easier. Enjoy!
Tag Archive: Programming
Querying the StackOverflow Data Dump Online with StatOverflow
The Twitpocalypse Returns
In my last post, I explained what truly went on in the Twitpocalypse, which dealt with tweet IDs passing the limits of the 32-bit signed integer (from -2,147,483,647 to +2,147,483,647), which is the most common datatype in use in computer applications today. Yet, whilst computer science has limits embedded into its structure, Twitter does not – the number of users and tweets is growing, and FAST! In September, the Twitter API team estimates, we will pass the limits of the 32-bit unsigned integer which is up to 4,294,967,295. Thus, applications and libraries that utilise that datatype in conjunction with tweet identification numbers in the Twitter API will fail, once again. In the post where I explained the first Twitpocalypse, I recommended that all applications switch to the 64-bit signed (or unsigned even) datatype, and then maybe even consider storing the IDs as strings, with variable length (no arithmetic between IDs is really needed, so this should be extremely easy). The Twitter API team is doing the same thing – it has now informed developers that it is best to switch to a 64-bit implementation for storing those integers. Hopefully, this time, we will be prepared for the upcoming Twitpocalypse, and even less confusion and problems will arise when we pass that huge number of tweets. I myself am switching to a string implementation for the IDs in the Twitter client that I am developing (more info coming soon!), so that no matter how many tweets have been posted, the application will not fail. Good luck, everyone, in surviving this next round of this problem. [@twitterapi 's tweet; Twitter API Announcements (via TechCrunch)]
An Explanation of the Averted Twitpocalypse
Numerous friends have asked me about this whole Twitpocalypse thing, as they just don’t get it. “What’s this whole deal with signed integers?”, they ask. Well I’m here to clear that up, as I do programming, and in programming, if you don’t know such basics of computer science, you’re doomed.
What the predicters of the Twitpocalypse were referring to when they mentioned the limit of 2,147,483,647, is the limits of the signed integer data type. The signed integer is a 32-bit data type, so it can hold only so large (or so small) a number – in this case, the highest number the data type can preserve in 32-bits is 2,147,483,647, and the lowest number is -2,147,483,647, respectively. This is what causes the Twitpocalypse problem, as well as the foreseen Unix Year2038 problem (where Unix epoch time will cross the limit, and distros using the 32-bit data type will fail.)
Each tweet in the Twitter infrastructure has its own numeric ID. Thus, programs that invoke the Twitter API’s ID interface into a 32-bit variable fail, as the number of tweets has passed the limit. When this happened, some applications failed, others continued working, depending on their implementation of this feature. Of course, updates were quickly rushed to the failing products, so that users wouldn’t be discouraged or affected for too long. Most applications now use the 64-bit integer datatype (commonly referred to as long or int64) for the ID variable, which supports numbers from −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807 – in other words, pretty large. Until then, we’re safe, but if we manage to cross that 9 quintillion mark (well, it’s certainly possible!
), we may be forced to convert to the 128-bit integer time, or even storing numbers as strings (the datatype for any types of characters, mostly letters), but I think computing will have changed a bunch by then, so that we might not even think of such a problem. That is, if Twitter lasts till then, or maybe the world will have been sucked into a black hole. We’ll see.
(On a side note, last year, I created a VeryLargeInt class for storing huge numbers (100 digits long can be stored with this, or even more!) by using the string data-type, and I also built algorithms for computation with strings, including most common mathematical operators. I may publish this project here in the future.)
I’ve always wanted to experiment with creating “sessions” on websites to mimic real users browsing and using a site, and I’ve finally gotten around to a project that involves this: my goal is to “wrap” an online database that is queried through a series of HTML forms, returns a map, and then allows you to open a popup with an HTML table outlining all the points displayed on the map. A formidable challenge, indeed, but in the awesome language of C#, no big deal.
To successfully establish a user session with a website, you must preserve the correct authentication and session cookies. The web database was built using ASP.NET, so I needed to make sure that I had an ASP.NET Session ID. One of the best ways to find out where cookies are sent to you is using an HTTP proxy to analyze requests and responses. The program of choice is called Fiddler, developed by Microsoft. (You can get the latest version at http://fiddler2.com/.) Unfortunately, I was forced to use Firefox to track packets because the old (straight out of 1997) interface recognizes the new and revolutionary browser Google Chrome as being Safari 1.3 (how sad), and doesn’t allow you to use their database. and you must set special proxy settings in Firefox for Fiddler to function correctly. After visiting the site and filling out the series of HTML forms, I could see how the report page was functioning – it was using the ASP.NET Session ID along with referrer HTTP header attributes to find the parameters of my original query and retrieve information from the database.
That might sound like an easy interface to hack, but the cookies are deceptive as you don’t exactly know when they are set. Fortunately, after scouring the internet for an hour or so, I happened to find out about the CookieContainer datatype (HttpWebRequest.CookieContainer). All you have to do to create a cookie “jar” that carries over from one HTTP session to another is writing the following code:
CookieContainer cookieJar = new CookieContainer();
var request=(HttpWebRequest) WebRequest.Create("http://google.com");
request.CookieContainer = cookieJar;
Now you can access the cookies by looping through the CookieContainer in a simple foreach statement. To apply this cookie container to your next HttpWebRequest, all you have to do is:
request.CookieContainer = cookieJar;
Yay!
Using .NET Reflector to Improve Code Efficiency
If you’re a .NET developer, admit it: you’ve at least once felt curious about how the underlying APIs behind the .NET Framework are coded, how certain functions are implemented, or whether one core function is more efficient that another. Well, this is where the trusty program named .NET Reflector comes in. As the title suggests, Reflector uses .NET reflection and ILDASM processing. It is the decompiler of the .NET world, and allows you to view source code of DLLs or executables in multiple languages (including C#, VB, C++, Delphi, IL, and more) with ease. It is very interesting to examine how the brilliant programmers at Microsoft sculpted the inner workings of the programming framework that drives many applications today. With this tool, you can accomplish just that and much much more. The Reflector was originally created by Lutz Roeder, but has been acquired by Red Gate Software. The program is free for all to download and use, and it can even Reflect upon itself! Now that is true irony. You can snag a copy over at the Red Gate site.
