It is the first and most important step to solve any analytics problem. Entire exercise of predictive modeling or visualization will reap beneficial results only if this step is meticulously and carefully completed.
With the amount of data generated around us, data preparation is immensely needed.
Any business entity generates huge amount of data through interactions. This data is
often enriched with other data sources before use. It has to be ensured that the data is modeling algorithm compliant. The data preparation process is an important step to convert raw data from multiple data sources into refined information assets which can be used for accurate analysis and valuable business insights.
This step in the modeling journey takes almost 70% of the total analysis time. Major impediments in the process - Business Rules, Unclean & counter intuitive data and Multiple data sources.
The specific nomenclature of data preparation stage varies by industry and company but the process framework remains same.
Complete process of analysis and insight generation starts with finding the right data. One reliable but time taking way of collecting relevant data is primary data collection like surveys, interviews, focus group discussions, etc. Analysts can also make use of existing data which accumulates with the company through real time business transactions. Major firms generally employ sophisticated data warehouses to store these data points on a daily basis. This data holds sensitive information about the business and thus confidential to that business entity.
This is an integral part of any machine learning project. The purpose of this step is to understand the tendencies of the data, to formulate assumptions and hypothesis of our analysis. This data discovery can be done using summary statistics and data visualizations. While there are many tools which can help, Python stands out due to its vast set of libraries and freely available UDFs (User Defined Functions). Two functions - describe( ) and profile_report( ) are such functions which can be used for such quick data analysis.
Quite often, the data itself has multiple problems like missing data, outliers, incorrect
values. Fixing such problems could be time consuming but very critical to build good predictive models. Data cleaning could be required if the incumbent data has -
Inconsistencies or anomalies - Every data table or repository is created with a specific purpose and therefore carries a certain format. While joining data from different sources, several inconsistencies might arise. It is important to clean categorical or continuous data from such irregularities before feeding it to any model.
Missing Values - This is a common occurrence in any real time data which could happen due to many reasons like un-availability of data, event did not take place, erroneous data entry, data value mistakenly deleted, etc. While there can be multiple reasons for omission, strategy for missing value imputation should be immaculate.
Outliers - An outlier is a data point that differs significantly from other observations. Outliers can be the result of poor data collection, or they can be genuinely good, anomalous data. These are 2 different scenarios, and must be approached accordingly. While there are several techniques to deal with outliers but the decision as to whether or not to remove outliers will be task-dependent.
Imbalanced Dataset - Data imbalance means an unequal distribution of classes within a dataset. This is relevant in the context of supervised machine learning involving two or more classes. Consider a dataset made up of 2 classes with one class including 99.7% instances while other with only 0.3% instances. Such examples are highly prevalent in models including credit card fraud, rare disease diagnostic models, etc. While such dataset is difficult to use, there are well-defined techniques to handle this.
Transforming data is one of the most important aspect of data preparation which requires better understanding of statistics and modeling. While the need for missing value imputation or outlier treatment is mostly evident, need for data transformation is not so obvious. Transforms are usually applied so that the data meets the assumptions of a statistical inference procedure being applied, or to re-code variables from one format to another, or better visual interpretation. Depending on the situation, techniques ranging from log-transformation to one-hot encoding are applied.
After all the above steps have been satisfactorily completed, data is then stored back into warehouse for later consumption. This data can be enriched by adding or connecting it with other related information to provide deeper insights. In some cases, the data thus prepared is channeled into a third party application - like R, Tableau, Power BI, or any such business intelligence tool for further processing or analysis to take place.