First, I loaded the data and visualized the distribution in order to analyze any noticeable trend. First, I visualized the demand versus temperature, windspeed, and humidity. There were some noticeable and understandable trend found in all three of them, showing people ride more bicycle when the weather condition is suitable.
Then, I looked into dates. First graph shows that the usage of bike increased significantly in 2012 compared to 2011. This shows that the company enlarged. In the month graph, I noticed that the demand is high from spring to fall, during which the weather is more likely to be nice for bike riding. Fun fact is that the demand in January and December are significantly different even though those two are consecutive. This can be explained from the year graph again, saying that the company enlarged through the year. Therefore, January 2011 + January 2012 is much smaller than December 2011 + December 2012. Not much significance found in the date, but hour graph tells a lot about the trend. from 7am to 9am, the demand is high since it is when people all go to school or work. So are between 5pm and 7pm, where people all go back home. Needless to explain, demand is low during nighttime.
These are the graphs based on hour. First one has two trends, blue one is no-workday, which includes holidays and weekends, and the orange one is workday. As explained above, demand is high during 7-9am and 5-7pm during workday. However, during no-workday, the trend is relatively gradual, having high demand during daytime. Second graph is drawn by days : Monday as 0 to Sunday as 6 respectively. interestingly, Friday(4) has relatively lower demand during 5-7pm compared to other workdays–maybe people tend to go drink more instead of going back home right away on Friday night.
Preprocessing : I added some features that would increase the result of my prediction model. Based on my analysis in time, I added “rushhour” and “dayhour” features. “Rushhour” feature shows True if hour is in 7am – 9am or 5pm – 7pm period during workdays. Similarly, “dayhour” feature shows True if hour is in 8am-8pm period during no-workdays.
For windspeed, I had noticed that there are many 0s in the data, as shown above. It is very unlikely that the wind suddenly stopped completely. I believed the anemometer malfunctioned during that period of time. There were total of 831 0s out of 10,887 data, which is too many to ignore. Therefore, I designed a model to predict the missing windspeed, based on the all the features except windspeed:
Then, I trained the model with selected features :
I made two separate models for casual users and registered users, because I believed the trend of casual user’s demand must be different from that of the registered users. At the end, I simply added the two numbers to get the total count.
Cross validation was for my personal use, to find out whether the model improved without submitting it to kaggle.com every time.
My final submission scored 0.38887, which is top 4.77% in the competition.