Data is everywhere, zettabytes (10^21 bytes) of it. But is the data good data?
Most of the time it is not. And bad data comes with big consequences. Today we will learn about data quality. and understand when the data is good data when the data is bad data.
Data Quality and Issues:
Data must maintain some qualities to be good data. If one or more of them are missing then It is bad data. But it’s always not possible to get good data. Understanding the data thoroughly is very important in working with data. If bad data goes unnoticed then it must bring no good. Let’s see the qualities of data.
It is not possible to always get complete data. Sometimes you have to drop rows with missing values or sometimes just let it be. But either way you must understand if there is a pattern. Here is why
“The most important data is the data you don’t have” – Abraham Wald
Is the information both correct and precise? Without accurate data whole data analysis is wrong and this will lead to wrong decisions. So the source and the process of data collection must be reliable. And after collecting the data must be verified and examined.
Something wrong in the table above. We humans can easily understand it when it is a small data set. But when the data set is overwhelmingly big or we directly feed it to the machine there will be a faulty model. So there must be a process of checking the uniformity.
Repetition of data will lead the analysis to wrong conclusion because the model will be over-fitted to some samples. You need to understand the columns to remove the duplicate rows.
When the data was collected and when the data is valid for. Is your information up to date? Imagine you use a database say for a student council election campaigning only to find that most of the contacts of that database are a couple of years old.
There’s a saying that Data scientists spend 80% of their time preparing data.
This stage is very important for good quality data.