Undergraduates

The Center supports a variety of activities for undergraduates. Please see below and browse the links on the left.

What is Data Science?

Data science is the study of revealing information hidden in data. Interesting datasets usually contain hidden patterns and regularities. These patterns represent actionable information. That’s what makes data interesting and valuable. To extract this information, data science employs models and computational algorithms to identify patterns among the data variables. These patterns can be used to provide explanations for the data, compress the data, estimate the values of missing variables, quantify the confidence in such estimates, and draw conclusions from the data in a justified way.

Examples of data analysis problems include analyzing large quantities of text and images, modeling cell-biological processes, pricing financial assets, evaluating the efficacy of public policy programs, and forecasting election outcomes.  By its nature, the field of data science is interdisciplinary, merging contributions from a variety of disciplines to address numerous applied problems. In addition to its increasing importance in numerous application domains, the field of data science comes with its own challenges, such as the development of innovative methods and algorithms for drawing reliable conclusions from high-dimensional and heterogeneous data.  

What are the Basic Components of Data Science?

Programming allows you to execute algorithms that take data as input and produce results that reveal information in the data. There are a number of easy to use packages that allow you to do this without having to first become an expert programmer. These include the popular programing packages “R” and “Python”.

Probability allows you to model and reason about uncertainty.  Since data is generally noisy, it has inherent uncertainty. Nontrivial computations on this data yield results that also have inherent uncertainty. Probability helps quantify this uncertainty, and this gives a measure of confidence for conclusions drawn from the results.

Machine Learning provides a framework and numerous algorithms for learning patterns in data. For example, machine learning can use existing data to learn how to make predictions of one or more data variables, given the values of the other variables. An interesting special case is predicting a single categorical variable for each new unit of data  (e.g. determining what is email spam or not.)

Statistics provides a framework for designing experiments to collect data, for removing the affects of confounding variables, and for modeling the results of computations on the data using partially unknown probabilistic models. This allows one to quantifying uncertainty in the predictions made using the model.

Domain Knowledge provides insights into known constraints or relationships among the variables in the data. This can guide the design of data collection and help select the most appropriate methods to subsequently process the data.