What does “good information” mean? Can all data be used to conduct the study or prOvideo information for the parties? What steps should be taken to determine whether the information is “good” or not? Before performing or building a model for linear regressions, neural networks, logistic regression, decision tree, steps must be taken to ensure the correctness and accuracy of a particular set of data. You probably think, “Well … all this extra data processing just takes more time to build my model.” True! However, using the extra time to determine the accuracy of your information can be a great way to prevent errors in later stages of the job. Never forget that a progression error is a very real thing, and any error in the original data set can lead to more errors in your final analysis. Let’s start.
1. Check the source
No, it does not require a line of code in R, Python, or Julia. It does not require complex algorithms that use too much computing power. It is simply the power to search the web and dig deep into a potential source of information. While it may seem rather rudimentary, most of the data sets you can freely obtain from the scientific community, government agencies, or universities are usually pretty well checked. There are many calibration protocols in place to ensure that data users can process the data correctly and as accurately as possible.
How can you tell if a source is valid? First, look at the organization hosting the data. As mentioned above, are they a legitimate institution for disseminating information? The U.S. Geological Survey’s EarthExplorer tool is an online archive that I often use for LANDSAT satellite imagery. The USGS is an accredited site supported by the US Federal Government. What if I get my information The wild satellite world of the river. Joe’s site seems to have a lot of freely available information, but he just collected it in an archive and hosted it on his website. I really don’t know where he got the information from, I just know it’s there, and it can be or be inaccurate.
The general rule of thumb is: if the information is not on an accredited or authentic website, you must first guess its accuracy.
2. Metadata review
Metadata is part of information that everyone learns but quickly forgets. Metadata is hidden information and information for a data set that is usually loaded with it. You can think of metadata as a data README file. It explains properties such as variables detected in the data set, temporal coverage, accuracy (or precision), release date, creator, and so on.
When you drag data sets directly into Excel, R, Python, or Julia using wget or the REST API, metadata does not always appear. Sometimes the data is dragged directly into the data frame, matrix, vector, etc., and you don’t have time to do a complete accurate check. In other cases, you can download the data without receiving metadata at all, in which case you should be careful about the information in your possession. Try making extra hides to learn more about what you just brought into R. Can you find the metadata of the commonly used Iris dataset as a challenge?
Always check the metadata to understand not only the information you use but also its authenticity. Are there any missing fields? Missing details? If an important, obvious detail seems to be missing from a metadata field, additional hiding may be needed to find it. Always do a thorough background research by searching for and reading metadata.
3. (Comparative) descriptive statistics
Descriptive statistics are not in themselves sufficient to examine data quality. The reason is that the mean, median, minimum, and maximum value of the data set do not tell whether the data is from a reputable source or whether the data has been altered in any way. Missing values are not indicators of fraudulent information, many reputable data sources lack values. Zeros are not indicators of bad data, many observations can be zero or zero can be used as a means to show bad data (although in other circumstances it should be zero because it can affect the overall statistics of the data set, it is much more likely that you see N / A, Null or -9999).
However, if your data is fairly general, you can compare the two datasets to see if they are within each other’s boundaries. I refer to this as a comparative descriptive statistic. For example, the two datasets I use in precipitation studies are the North American Regional Reanalysis (NARR) and the Climate Forecasting System (CFS). Both of these data are rasters, have different cell sizes (resolutions), and are measured slightly differently. I can make observations of both time periods, run descriptive statistics, and see how similar they are. Typically, despite the difference in resolution, we see similar results.
If your own study has more than one data set, it is a good idea to grab both and compare the values. Major inconsistencies should be investigated, and if necessary, you can use this information to destroy the data.
I hope these three tips are helpful in identifying good information. Given the large amount of data available to data scientists today, it can be challenging to comb every set of data, but it is very essential to build useful and reliable models. Stay tuned with the following article to find out how to find good information quickly and easily!