Having left my previous job in June earlier this year, I began my job hunting and before I knew it, it turned out to be a 4 months-long journey. During these 4 months, it was a mundane process every single day of logging into LinkedIn, scrolling job vacancy websites (Jobstreet, MyCareersFuture, etc.), and waiting for my phone to ring hoping it’s a recruiter or company telling me I’m being shortlisted for a job. …
One of the most common Machine Learning algorithms in the world of data science is Decision Trees because it’s easy to implement and understand even if you have limited knowledge of how Machine Learning works. An extension to the Decision Tree algorithm is Random Forests, which is simply growing multiple trees at once, and choosing the most common or average value as the final result. Both of them are classification algorithms that categorize the data into distinct classes. This article will introduce both algorithms in detail, and implementing them in Pyspark.
A decision tree classifier, as the name suggests, makes decisions using a tree-based model. This algorithm will consider all data features, chooses the one with the highest accuracy, performs binary split, and repeats recursively until it successfully splits the data in all leaves (or reaches the maximum depth). …
In one of my previous post, I’ve designed a simple Morse Code Decoder in Python which is capable of accepting user inputs and outputting them in their original alphanumerical form. One of the limitations of the decoder is that it does not allow the user to input sentences. Remember that we have chosen to represent each alphabet or number using a series of ‘0’ and ‘1’, where ‘0’ represents a dot, and ‘1’ represents a dash. Each alphabet or number is then separated by a ‘*’ in our decoder, as shown in the screenshot below.
Open data sources are one of the best gifts for data scientists or analysts as they allow them to draw valuable insights for free, without having to worry about the data licenses. Twitter is one of the most popular social media application in the world as it’s free, and also allow users to tweet on any topics that come to their mind. This article will focus on how can we use Twitter through R programming to extract valuable insights and communicate these findings to the relevant stakeholders using Tableau.
“How might we help the communication practitioners to get actionable insights from Twitter so that they can create more effective communication that caters to the needs & concerns of general…
Upon graduating from my master’s in Australia, I managed to clinch myself a job in a big tech firm as a data analyst. The remuneration package was awesome and the office perks were nothing but amazing. I get to wake up later than the typical office starting hour of 9 am, and was able to commute to and fro work after peak hours. The best part of it, I’m working on something that was my passion, data analytics. Life’s good isn’t it? Nope, things started to turn bad a few weeks into the new job.
I began to feel constantly tired and stressed about work. Panic attacks, difficulties in breathing, and sleepless nights soon followed. Work was the only thing on my mind after office hours and even on weekends. I started to beat myself for every little mistake and I couldn’t find the patience to talk to my loved ones, nor the joy to embark on my hobbies. And perhaps the most significant sign of all, the thought of committing suicide. …
In my previous post, we were exploring graphical approaches in Python to perform Exploratory Data Analysis (EDA). Line charts, regression lines and the fanciful motion charts were discussed on how they could be used to gather insights on Population, Income and Gender Equality in Education data. The data however, were relatively small of about 200 rows. In this post, we will explore bigger data of around 12 million records, and look into other ways to perform EDA in Python.
The dataset we will be using contains data on health and dental plans offered to individuals and small businesses through the US Health Insurance Marketplace. It was originally prepared and released by the Centers for Medicare & Medicaid Services (CMS) and was subsequently published on Kaggle. …
Exploratory Data Analysis (EDA) is one of the most important aspect in every data science or data analysis problem. It provides us greater understanding on our data and can possibly unravel hidden insights that aren’t that obvious to us. The first article I’ve wrote on Medium is also on performing EDA in R, you can check it out here. This post will focus more on graphical EDA in Python using matplotlib, regression line and even motion chart!
The dataset we are using for this article can be obtained from Gapminder, and drilling down into Population, Gender Equality in Education and Income. …
Tableau and R are two common data visualisation tools where the former is known for it’s simple and beginner-friendly functions, and the latter for it’s extensive user interaction possibilities. How do we decide which visualisation tool is easier to implement or more effective in conveying the key insights to the relevant stakeholders? This article will look into this, and hopefully arrive at a common consensus for all.
The dataset we will be using contains Coral Bleaching percentages located in Great Barrier Reef from 2010 to 2017. There’s a total of 5 different coral types, mainly Blue Corals, Hard Corals, Sea Fans, Sea Pens and Soft Corals. …
Having attended numerous data scientist job interviews, I was asked this particular question 75% of the time:
Can you tell me the key differences between Linear Regression and Logistic Regression?
To be honest, you can easily google the answer to this question as it’s really common in the world of data science, but I thought I should try writing a post to discuss the differences and list them down in order of importance so that you can just quote the few most important ones, we all know how stressful it is to prepare for a job interview, much more remembering the concepts and theories. …
Morse code is a method used in telecommunication where each alphabet, number and punctuation is represented by a series of dots/dashes/spaces. It was first invented by Samuel Morse in 1930s and it has been heavily used in the navy industry. This article will describe the process to build a simple Morse Code decoder in Python.
As seen in the image above, each alphabet and number is represented by a series of dots and dashes. …
About