Data Science - An Introduction
Referred Link - https://devopedia.org/data-science
SUMMARY
Data is no longer scarce. In fact, businesses have an abundance of data and its growing. This has given rise to the term Big Data. Data science enables businesses to discover valuable insights from data and apply that profitably.Data science is therefore complementary to Big Data.
Historically, statisticians had a mathematical focus. They evolved into data analysts who applied their expertise to solving business problems. They did this by visualizing data and searching for patterns. When dealing with vast amounts of data, there was a need to apply Machine Learning algorithms and programming. This is where a data scientist comes in.
A data scientist is really a first-class scientist who's curious, asks questions and makes hypotheses that can be tested with data.
DISCUSSION
What exactly is the definition of the term "Data Science"? - One possible definition is that "data science is a multifaceted discipline, which encompasses machine learning and other analytic processes, statistics and related branches of mathematics, increasingly borrows from high performance scientific computing, all in order to ultimately extract insight from data and use this new-found information to tell stories."
What's a typical Data Science process? - In science, one starts with a hypothesis, conducts experiments and makes observations to either prove or disprove the hypothesis. In Data Science, the scientific process is similar except that the use of data and algorithms becomes central to the process.The process starts with an interesting question, often aligned to business goals. Available data is then cleaned and filtered. This may also involve collecting new data as relevant to the question. Data is analyzed to discover patterns and outliers. A model is built and validated, often using machine learning algorithms. Model is often refined in an iterative manner. The final step is to communicate the results. The results may inspire the data scientist to ask and investigate further questions.
What's a typical Data Science pipeline? - The data science pipeline may be treated as that part of the data science process that deals specifically with data. It starts with the gathering of raw data, processing it, analyzing it via algorithms and finally visualizing the results. Thus, the pipeline basically transforms data into useful insights.An important aspect of this pipeline is data engineering. It can be broken down into three steps (not standardized):
- Data Wrangling: Raw data is cast to a form suitable for analysis. This could involve combining multiple datasets, removing inconsistencies, converting datasets to a common format, etc.
- Data Cleansing: Real-world data is messy with missing values, bad delimiters or inconsistent records. Cleansing ensures and even repairs data to syntactic and semantic correctness. Data points could be dropped if they cannot be repaired.
- Data Preparation: This makes the data suitable as input to algorithms. This may involve range normalization, conversion of categorical data to numerical values, etc.
Could you give some examples of questions that data science answers? Here are a few examples: Will this tire fail in the next 1000 miles? Is this bank transaction normal? What will be my sales for the next quarter? What sort of customers are not coming back to my store? Which printer models are failing the same way? As a self-driving car, should I now slow down or brake completely?In the real world, data science has been successfully used by LinkedIn to increase growth in user connections. Google uses it in a number of products. GE uses it to optimize service contracts. Netflix uses it to improve movie recommendation. Kaplan uses it to uncover effective learning strategies. All these are a result of data scientists asking the right questions.How is a data scientist different from a data analyst/architect/engineer? - A data scientist is multidisciplinary in terms of skills and expertise. She may embody some of the other related roles:
- Data Analyst: Collects relevant data, visualizes data with various tools and tries to find patterns and insights. Knows basic statistics. Has business/domain knowledge. Probably doesn't deal with big data.
- Data Architect: Architects a system to manage big data. Often this role is embodied within a data engineer since tools and technologies overlap. This role becomes redundant with MLaaS (Machine Learning as a Service).
- Data Engineer: Develops and manages infrastructure that deals with big data. Well versed with tools such as Hadoop, NoSQL and MapReduce. Sets up data pipelines.
Should a data scientist start with a problem statement or explore available data? If you're new to data science, without much domain knowledge, defining a problem statement can be difficult. In such a case, you could start with exploratory analysis. This can then guide you towards asking the right questions.Given enough data, exploration is likely to yield patterns and correlations. These could even occur due to measurement errors or data processing artifacts. But are these findings relevant? Asking the right questions and defining a problem statement will give better focus. Some even claim that lack of a problem definition could lead to disaster because you don't know what you're looking for.What skills must a data scientist have? - Data scientists are required to be multidisciplinary with knowledge and expertise in statistics, programming, application domain, analytics and communication. However, unicorns who have all of these are rare.Often specializing in a couple of areas with exposure to others is desirable.Companies don't rely on a single all-knowing data scientist; they form a data science team. A data science team may include the Chief Data Officer, business analyst, data analyst, data scientist, data architect, data engineer and application engineer. Some of these may be combined in a single person. For example, a single person may fulfil the roles of data architect and data engineer.
Should a data scientist learn cloud computing? Data science workflows typically happen on a local computer. There are however scenarios where cloud computing makes sense. The dataset could be too large for local memory; or local computational capability is insufficient for the problem; or the workflow's output feeds into a larger production environment.As a beginner, what should be my learning path to become a data scientist? One approach is to be practical and hands-on from the outset. Pick a topic in which you're passionate and curious. Research available datasets. Tweet and discuss so that you get clarity. Start coding. Explore. Analyze. Build data pipelines for large datasets. Communicate your results. Repeat this with other datasets and build a public portfolio. Along the way, pick up all the skills you need.You may instead prefer a more formal approach. You can learn the basics of languages such as R and Python. Follow this with additional packages/libraries particular to data science: (R) dplyr, ggplot2; (Python) NumPy, Pandas, matplotlib. Get introduced to statistics. From this foundation, start your journey into Machine Learning. To relate these to business goals, some recommend the book Data Science for Business by Provost and Fawcett. But you should put all this knowledge into practice by taking up projects with datasets and problems that interest you.Could you give some tips for a budding data scientist? The following tips might help:- Data science is about answering scientific questions with the help of data. Don't focus just on the aspect of handling data, dataset size or the tools.
- Understand the business, its products, customers and strategies. This will help you ask the right questions. Have constant interaction with business counterparts. Communicate with them in a language they can understand.
- Consider alternative approaches before selecting one that suits the problem. Likewise, select a suitable metric. Sometimes derived metrics may yield better prediction compared to available metrics.
- Understand the pros and cons of various ML algorithms before selecting one for your problem.
- Find a compromise between speed and perfection. On-time delivery should be preferred over extreme accuracy.
- Useful data is more important than lots of data. Use multiple data sources to better understand data and its discrepancies.
- Be connected with the data science community, be it via blogs, meetups, conferences or hackathons.
- Practice with open datasets. Learn from the solutions of others.
REFERENCES
- AMR. 2017. "Minimalistic Learning Path to Become a Data Scientist." Hackernoon, May 18. Accessed 2018-04-11.
- AltexSoft. 2017. "How to Structure a Data Science Team: Key Models and Roles to Consider." AltexSoft,May 10. Accessed 2018-04-11.
- Barber, Michael. 2018. "Data science concepts you need to know! -- Part 1." Towards Data Science, January 14. Accessed 2018-04-13.
- CISELab. 2018. "Data Science & Machine Learning." Department of Engineering, University of Sannio. Accessed 2018-04-12.
- Castrounis, Alex. 2017. "Cloud Computing and Architecture for Data Scientists." InnoArchiTech, October 1. Accessed 2018-04-11.
- Chambers, John. 2000. "Stages in the Evolution of S." Bell Labs Research, March 7. Accessed 2018-04-13.
- Collabera TACT. 2018. "The Difference between Data Analyst, Data Engineer and Data Scientist." Accessed 2018-04-13.
- Davenport, Thomas H. and D. J. Patil. 2012. "Data Scientist: The Sexiest Job of the 21st Century." HBR, October. Accessed 2018-04-12.
- Donoho, David. 2015. "50 years of Data Science." Based on a presentation at the Tukey Centennial workshop, Princeton NJ, Version 1.00, September 18. Accessed 2018-04-11.
- Foote, Keith D. 2016. "A Brief History of Data Science." Dataversity, December 14. Accessed 2018-04-11.
- Google Cloud. 2018. "Google Cloud Platform for Data Scientists." Accessed 2018-04-11.
- Granville, Vincent. 2014. "22 tips for better Data Science." Big Data Made Simple, November 12. Accessed 2018-04-11.
- Gualtieri, Mike. 2013. "What Is A Data Scientist?" YouTube, June 4. Accessed 2018-04-12.
- Gutierrez, Daniel D. 2014. "What Is Data Science?" Opera Solutions, May 6. Accessed 2018-04-12.
- Hayes, Bob. 2015. "Investigating Data Scientists, their Skills and Team Makeup." Business Broadway, September 23. Accessed 2018-04-12.
- Jones, M. Tim. 2018. "Data, structure, and the data science pipeline." An introduction to data science, Part 1, IBM developerWorks, February 1. Accessed 2018-04-11.
- Leek, Jeff. 2013. "The key word in 'Data Science' is not Data, it is Science." SimplyStats, December 12. Accessed 2018-04-11.
- Madhavan, Archana. 2017. "8 Data Science Skills That Every Employee Needs." Amplitude Blog, November 8. Accessed 2018-04-11.
- Marmitt, Sandy. 2017. "10 Tips for Data Scientists & Analytics Pros Navigating Today’s Market." Burtch Works, October 23. Accessed 2018-04-11.
- Mayo, Matthew. 2017. "Data Science Primer: Basic Concepts for Beginners." KDnuggets, August. Accessed 2018-04-12.
- McKinsey. 2009. "Hal Varian on how the Web challenges managers." McKinsey & Company, January. Accessed 2018-04-11.
- Microsoft Azure Docs. 2018. "Data Science for Beginners video 1: The 5 questions data science answers." Microsoft Azure Docs, March 1. Accessed 2018-04-11.
- Press, Gil. 2013. "A Very Short History Of Data Science." Forbes, May 28. Accessed 2018-04-11.
- Rogati, Monica. 2017. "What's The Best Path To Becoming A Data Scientist?" Forbes, January 20. Accessed 2018-04-11.
- Shaikh, Faizan. 2017. "8 Essential Tips for People starting a Career in Data Science." Analytics Vidhya, October 13. Accessed 2018-04-11.
- Smith, Barrett. 2016. "A Gentle Introduction to Data Science." Acquia Developer Center, August 9. Accessed 2018-04-12.
- Srivastava, Tavish. 2015. "13 Tips to make you awesome in Data Science / Analytics Jobs." Analytics Vidhya, October 27. Accessed 2018-04-11.
- Stern, Dan. 2017. "Teach Yourself Data Science: the learning path I used to get an analytics job at Jet.com." freeCodeCamp, November 12. Accessed 2018-04-11.
- The Economist. 2010. "Data, data everywhere." The Economist, February 25. Accessed 2018-04-11.
- UWDS. 2018. "UW Data Science Courses Feature an Interdisciplinary Curriculum." Data Science Program, University of Wisconsin. Accessed 2018-04-11.
- Vega, Manuel Andrés Pérez. 2017. "Choosing the right Machine Learning algorithms." Hexacta, November 27. Accessed 2018-04-11.
- Venturi, David. 2017. "The Best Intro to Data Science Courses — Class Central Career Guides." Class Central, January 25. Accessed 2018-04-11.
0 comments