Data Analysis Process

 What are the main steps when performing Data Analysis?

    In my perspective there are five steps when conducting data analytics.


Collection

    When working for a corporation or even exploring data for personal use, gathering your data is don by several methods and how you gather your data is equally as important. Data can come from different databases, websites, computer files, or multiple different nodes/sources.

Additionally, methods for capturing data also has many methods such as: Surveys, Interviews & Focus Groups, Observations, Online & Transactional Tracking, Forms and Social Medial. When gathering your data it's important to ensure you have collected enough variables if you wish to predict out comes using statistical methods or machine learning algorithms.

Preparation

    Preparing our data is a vital step that needs to be done correctly before we even begin to explore or model our data. Removing any unwanted variables or values that just don't make sense in our dataset. Some examples of this are: 

  • NA values: These are values that dont have any answers to them. Even having a small percent of missing data can result in a wrong analysis and lead to wrong assumptions/interpretations. 

  • Wrong Format: Knowing the size of your dataset is important when using certain statistical algorithms. Restructuring columns and rows in order helps with analysis and visualization.
    








Exploration

    Once we have cleaned our data it is now time to explore the dataset and seek needed information. Naturally, a corporation might easily be able to determine Profits, Losses, employee hours & rates, etc. On the other hand it is useful to see future projections of these values, perhaps sectors that are hurting growth or identify hidden information. In this step it is natural to use basic ~ advanced statistical methods when exploring and "getting to know" our data. Ex: Finding min/max of a column, performing hypothesis tests to prove relationships and non-relationships.







Modeling

    Next, we will now dive in deep to building best fit models to describe our data using machine learning algorithms to make future predictions and give graphical representation to our hypothesis test (if created).
There are several algorithms that can be used to perform modeling with Supervised & Unsupervised data. Of course depending on what kind of data you have and what your needed outcome is will depend on the algorithm. (Algorithm Examples: KNN, Random Forrest, Linear Regression, etc.)









Interpretation 

    Our final step is to interpret our results. Here we seek to derive a meaning to our results and concur if the results are in line with what was expected. This can be a difficult step especially when working at places like Hedge Funds or Scientific Institutions where its necessary to have a technical writer on staff to help interpret results to management that usually aren't well versed or technical.





Comments