Undergraduates who decide to take part in CSML’s programs are joining a storied tradition of innovation. Princeton University has played host to important developments in data science over the years, while also serving as fertile ground for exciting interdisciplinary research. On campus, students have had opportunities to apply data science on a wide range of topics, from the fundamentals of machine learning, social science to astronomy, just to name a few.
Many students who have participated in CSML courses and programs have obtained positions as data scientists at prominent organizations after graduation, or pursued graduate degrees that heavily involve statistics and machine learning. Other students have found that including data science as part of their general education has enriched their understanding of top issues driving the world and enhanced their future career prospects.
For more information on data science and CSML’s variety of activities for undergraduates, including its signature certificate program, please see below and browse the links on the left.
What is Data Science?
Data science is the study of revealing hidden information in data. Interesting datasets usually contain hidden patterns and regularities. These patterns represent actionable information. That’s what makes data interesting and valuable. To extract this information, data science employs models and computational algorithms to identify patterns among data variables. These patterns can be used to provide explanations for the data, compress the data, estimate the values of missing variables, quantify the confidence in such estimates, and draw conclusions from the data in a justified way.
Examples of data analysis problems include analyzing large quantities of text and images, modeling cellular-biological processes, pricing financial assets, evaluating the efficacy of public policy programs, and forecasting election outcomes. By its nature, the field of data science is interdisciplinary, merging contributions from a variety of disciplines to address numerous applied problems. In addition to its increasing importance in numerous application domains, the field of data science comes with its own challenges, such as the development of innovative methods and algorithms for drawing reliable conclusions from high-dimensional and heterogeneous data.
What are the Basic Components of Data Science?
Programming allows you to execute algorithms that take data as input and produce results that reveal information in the data. There are a number of easy to use computer languages that allow you to do this without having to first become an expert programmer. These include the popular programing languages “R” and “Python”.
Probability allows you to model and reason about uncertainty. Since data is generally noisy, it has inherent uncertainty. Nontrivial computations on this data yield results that also have inherent uncertainty. Probability helps quantify this uncertainty, and this gives a measure of confidence for conclusions drawn from the results.
Machine Learning provides a framework and numerous algorithms for learning patterns in data. For example, machine learning can use existing data to learn how to make predictions of one or more data variables, given the values of the other variables. An interesting special case is predicting a single categorical variable for each new unit of data (e.g. determining what is email spam or not.)
Statistics provides a framework for designing experiments to collect data, for removing the effects of confounding variables, and for modeling the results of computations on the data using partially unknown probabilistic models. This allows one to quantify uncertainty in the predictions made using the model.
Domain Knowledge provides insights into known constraints or relationships among variables in the data. This can guide the design of data collection and help select the most appropriate methods to subsequently process the data.