So far, we have been using small, well-structured data in our examples. However, the real world is messy and we need to spend a significant amount of effort to make the data usable. In fact, don’t be surprised if you end up spending the main chunk of your time collecting and curating data. In this post, I just want to talk about some random things related to this tedious phase.
Be a sceptic
This may be a bit too basic but I think it’s important to be a sceptic about any data at the beginning. Unless you are very confident about the source, just start with the assumption that it’s wrong. Find various ways to dissect and verify it. Just remember .. garbage in, garbage out.
Data is everywhere but …
Yes, data is lying all around us, however, in most cases, it’s not readily usable. Let’s say, one day, while you were chatting with a friend who is having trouble selling her house, you have one of those light-bulb moments. You recall the examples you had seen here and figure it would be trivial to ascertain the “sellability” of a house based on past data. This is indeed a great idea, you get all amped up about it and start planning your next steps.
You start digging, you find a nicely structured CSV file with a lot of features .. bingo! But soon you realise that you need something in that dataset that tells you how long it took for the house to actually sell. So you already see the sale date in the CSV but you don’t have the date when the house went on sale.
Hmm .. surely someone has already done the hard work for us, let’s google it. Nope, no luck! You keep digging and see that you can get the data for most of the recent sales from your city’s most popular real-estate website. Great!
So let’s see what does that involve.
- Figure out a way on the real-estate website to search the address
- Parse to locate the right match in the search results and follow the link
- Parse the HTML on the page and extract the listing date
- Write a script to automate the above for every property in the CSV
- Append the data in the CSV
- Make sure you don’t get blocked by the real-estate website for crawling their data
The data is there but it’s a lot of work getting it in a usable format. And this is when you realise that it’s not so trivial after all!
You need a lot of data
Small datasets are pretty much useless when it comes to machine learning, the more data you have the better. Although, after a certain point the marginal utility to bring in more data starts going down, however, those situations are pretty rare.
To understand this better, I encourage you to train the same models with different volumes of data and see how that impacts the outcomes. You will see there is a significant difference between models trained on small and large datasets.
As we need lots of data, the best and fastest way is to just get it from the folks who have it. In some cases, vendors like the real-estate website might be willing to sell you the data at a cost. However, it’s more likely that you will just have to rely on old-fashioned hard-work and patience to extract the data, either manually or through bots. If we are talking about millions of houses in the dataset, it might not be practical to do this work manually so you will need to write scripts (or find someone to write them for you) to scrape the data.
Also, as you don’t want to get blocked for crawling, you will need to do it rather slowly to not piss-off the real-estate website server admins. So millions of records to scrape, slowly? It’s going to take time .. plenty of time!
I encourage you to start small and simple. No need to go all out and bring in the 300 features, start with the few that have the clearest and highest value. Also, no need to start with all the 200 million data rows you have available, start with a few thousand. Build your model incrementally – bring in new features, test, validate with a bigger dataset, repeat.
Not all data is useful (for you)
Generally, it’s good to have more data, both in terms of volume and characteristics. However, keep in mind that you don’t need to use all of the data, all the time. For instance, you might find a column with the name of the seller in the house prices dataset, now you must ask yourself .. is it useful? Not really, it doesn’t practically impact the outcome of the sale (or at least it shouldn’t!). So we can safely remove that column from our data and celebrate a smaller file size that we need to load up in our model.
Before starting the work on modelling, always take a quick look at your data and see if it’s evenly distributed. What do I mean by that? Well, let’s say that you want to predict the outcome of a blood test that applies equally to all genders. Unfortunately, however, your data contains more past results from females than males, leading your ML algorithm to be skewed towards females. You can counter this situation by removing some data OR gathering more data about males to even things out.
If you have experience working with databases, you already would know how dealing with numbers is much faster than text. As ML is all maths in the background, you can well imagine how it becomes tricky to work with text. Given that situation, be prepared to figure out ways to incorporate strategies to convert text data into the numerical format.
One simple example could be a male/female gender column in the data. The simplest way would be to turn that binary data into an on/off switch. Let’s just have one column called Female and use 1 when it’s a female and 0 when it’s a male. Nice and easy!
What if instead of just the binary male/female options, you had male/female/other options? In that case, I would recommend having 3 columns: Male, Female and Other. Use 1 for the relevant column and 0 for the other two.
You might be wondering why didn’t we just use female = 1, male = 2 and other = 3 in just one column. Well, doing it that way has undesirable implications. Although the assignment of those numbers is totally arbitrary and meaningful only to us, the ML algorithm will see it differently “Hmm. It’s a 3, a bigger number, it means something!”. And to avoid that situation, we just stick with binary options where the options are limited to 0 and 1. By the way, the technical term for this type of transformation is one-hot encoding.
Having said that, modern machine learning frameworks have the ability to deal with text data as well but we will get to that in due course.
Numbers lie too!
Just because a feature is represented in numbers, doesn’t automatically make it good. Case in point, here in Australia, the postal codes are represented by numbers e.g. Sydney CBD is 2000, Melbourne CBD is 3000, Adelaide CBD is 5000. That in no way means that Adelaide is better than Melbourne, in fact, quite the opposite. But then that would mean Sydney is better than Melbourne .. nope, Sydney sucks. In fact, let’s just move the Opera house to Melbourne and be done with Sydney.
Anyway, it’s possible that the number of unique postcodes in your data is in the thousands. It’s still possible to use the one-hot technique here but what if you want a prediction on a postcode that isn’t already in your training dataset? In this case, you should first think really hard about why you need the postcode in there, is it a proxy for some other attribute? Maybe you want to use the postcode to identify affluent areas .. in which case, it might be better to just use the average income of that area.
If you really want to use the postcode, you could potentially use a simple approach of transforming into a variable that shows the likelihood of that postcode occurring in the dataset. For instance, if out of 10 records, postcode 3000 occurs 6 times, you would have 0.6 as the value in that new column. There are other strategies too, I am hoping to explore them in the future.
Continuous vs Discrete
I personally like to think of continuous variables as measures e.g. distances, time, temperature etc. On the other hand, I think of discrete values as counting e.g. number of students, number of coins etc.
Another way to think about continuous variables is to think of divisibility of the unit, can I divide it into further parts? For instance, the distance between your home and office could be 1.1km or 1.101km or 1.10101km, the options are infinite.
Discrete data, on the other hand, would have values that can be counted, so a finite set. For instance, the world might have a lot of coins, it would take a lot of time to count them but they are countable and that value can’t be subdivided into smaller parts e.g. you can’t have 1.1 or 1.101 coins etc.
It is important to understand this distinction because you will deal with them differently in your modelling. As a general rule of thumb, discrete values can be turned into columns (even if the column count becomes very high) with one-hot encoding, it’s not possible to do that with continuous values (as you can’t have infinite columns).
An interesting question that helped me understand the distinction better – is “age” continuous or discrete? Well, kinda both, depending on how to treat it. If you are using just counting the years, then it’s discrete, if you are measuring the actual time since birth, then it’s continuous.
Time in itself is meaningless
No, I am not going to go all philosophical on you. I just think timestamps in the data are meaningless on their own unless we assign some meaning to them. For instance, in our house “sellability” example above, we took 2 dates, subtracted the listing date from the sale date and got a directly useful variable .. “days on sale”. Similarly, having a “date of birth” column in there might not be directly useful until we turn that into “age in years” etc.
Digital data – 0s and 1s
When I say digital data, I am referring to things like images, videos, sound files etc. For instance, all those image-classification or dystopian facial recognition systems that we all keep reading about are using digital data.
These are intimidating, complex systems indeed but you should understand that digital data is just a sequence of 0s and 1s, which means, you can think of the data as features/matrix/array. Next, you can feel better at the fact that the data in there is already one-hot encoded! Of course, this is an oversimplification but hopefully gives you the core idea.
However, keep in mind, dealing with digital data will require a bit of processing power. So either prepare yourself to be a bit patient with your poor laptop or have more computing resources at your disposal.
Spend a lot of time verifying and understanding your data. Figure out the features that offer the highest values, discard the ones that don’t. Start small and build on your successes.