Statistics Collaboratory


Selected Projects

Longitudinal data analysis of microorganism populations using GEE models
Bradley Terry model analysis to determine which advertising outlets for tutoring services is the most appealing
Logistic regression analysis for determining if a loan request lead progresses to the application stage
Predictive models for Undergraduate freshmen enrollment
Development of tolerance intervals for heteroscedastic regression models
Construction of confidence intervals for system reliability based on component-level data
Development of materials for a regression workshop
Development of probe hybridization classifiers that allow regions of neutrality when otherwise clear-cut evidence for a classification label fails to exist
Predictive models for plant characteristics as a function of microorganism population counts
Evaluation of the effectiveness of three curriculums to the Making Every Dollar Count curriculums
Design of sequential sampling plans for pest management applications that incorporate spatial correlations in insect counts
Longitudinal data analysis of recurrent wasp activities using a marginal Cox model
Sample size determination for right censored survival analysis model with application to caloric diet studies with mice
Development of tolerance intervals for heteroscedastic regression models
Construction of confidence intervals for system reliability based on component-level data
Development of materials for a regression workshop
Spatial analysis of water quality with applications to potential use of wastewater for golf course irrigation
Suvey analysis of the teens perceptions regarding money management
Tests for Heteroscedasticity in an Equipment Aging Model

A heteroscedastic linear regression model is developed from plausible assumptions that describe an equipment aging process. A test for the significance of the heteroscedasticity is derived and a simulation study is used to evaluate the power of the test and compare it with several other applicable tests that were designed under different contexts. Tolerance intervals within the context of the heteroscedastic model are derived, thus generalizing well-known tolerance intervals for ordinary least squares regression. Use of the model and its associated analyses is illustrated with an application where hundreds of electronic components are continuously monitored by an automated system that flags components that are suspected of accelerated aging.

Nonparametric Change-Point Algorithm with Applications to Network Surveillance

 A change-point detection algorithm is developed for applications to data network monitoring where various and numerous performance and reliability metrics are available to aid with early identification of realized or impending failures. Challenges addressed include: 1) the need for a nonparametric technique so that a wide variety of metrics (including discrete metrics) can be included in the monitoring process, 2) the need to handle time varying distributions for the metrics that reflect natural cycles in work load and traffic patterns, and 3) the need to be computationally efficient with the massive amounts of data that are available for processing.

Longitudinal Data Analysis with Alternative Covariance Structures

A control versus treatment experiment with repeated measures on mice was analyzed with a mixed model using different covariance structures. AIC was used to select the best fitting structure and the treatment effect was estimated using estimated generalized least squares. The effect of time was characterized by a low degree polynomial..

Cost-Benefit study for a government program to help family move out of poverty

The local government has a program to help households leave poverty permanently. The collaboratory helps enhance an existing cost-benefit study and improve the accuracy of the estimation of costs and benefits by incorporating probability models. The success rate of a family to graduate from the program will be modeled based on its demographics data using logistic regression. The costs and benefits of the program can then be estimated by using the model with the program operating expense and expected savings per graduate family.

Fairway nitrogen leaching study

A set of experiments were conducted on fairway lysimeter facility to measure the amount of nitrogen leached via groundwater. The experiments involved soil-type, irrigation-amount and fertility treatments, and were run in warm and cool season respectively. The collaboratory helped in running ANOVA’s to test the effect of soil type, irrigation amount, fertility and seasonality as well as their interactions on nitrogen leaching.

Analysis of olfactory responses of wasp to different plant species hosting scale egg masses

Wasps are natural enemy of the ‘scale’ using their egg masses as a food sources for their own eggs. The experimental apparatus was a Y-shaped tube structure with a given plant species hosting scale egg mass in one side and a plant without egg masses in the other side. Wasps were inserted into the tube and their response to select the left or right half of the Y-shape was recorded. The tests whether a wasp initially chooses the arm connected to a plant with egg masses (i.e., Do wasps use chemical volatiles in the host location process?) and the choice varies among plant species were conducted. A comparison of the duration of time the wasps spent with the plant with the scale eggs was also compared across different plant species.

Distribution of replaced vine in the Weaver vineyard

The vines were planted equally spaced in the Weaver vineyard. Some vines were replaced due to a certain disease caused by Glassy Winged Sharp Shooter insect (GWSS) and the researchers wanted to know the distribution of the replaced vine. The vine replacement was first modeled as a function of the distance of the vine to the entrance using the logistic regression and the effect of the vineyard entrance on the vine replacement was excluded. The vineyard was then divided into several sectors and the sectors that have significant higher infection rate were also located by logistic regression.

Comparison of survival rate of mosquitoes injected with immune deficient genes

The survival analysis methods were used to determine if injecting immune deficient genes has impact on the lifetimes of mosquitoes. A Cox proportional hazards model was built to test if the lifetime distributions for the treatment groups were different from a control group.

Spatial/Temporal Analysis of the 1999-2004 South Gila Depth to Water Table and 1995-2002 Groundwater Discharge Data

ARS Project No. 5310-13610-013-13R
By S.M. Lesch, D.L. Corwin, D.L. Suarez, and A. Chakravartty

The statistical analysis presented in this report indicates that the average groundwater level has exhibited a stable, seasonal cycle over the last five years. The specific average monthly estimates exhibit twice yearly cycles that closely correspond to the seasonal South Gila lettuce and winter cropping patterns. Additionally, the majority of groundwater variation can be classified as spatial in nature and the current network of wells already appears to be sufficient for defining the general spatial pattern across the district. Unfortunately, the current network also appears to be highly inadequate for predicting short term, spatially referenced deviations in the depth to water table pattern. Furthermore, detailed analysis of the groundwater data from each well suggests that a small number of wells may be malfunctioning and/or excessively responding to nearby surface water sources.

Experimental Design help for growth chamber experiments

A set of similar experiments were run in growth chambers, some having replication, some unreplicated. The Collaboratory reviewed the experiments and helped work out the ANOVA design for each type of experiment. The ANOVA's were then coded into SAS for use by the client on the current experiments as well as additional similar experiments.

Market analysis of factors affecting store sales

The Collaboratory conducted a multivariate analysis of demographic and geographic variables for use in predicting sales for a restaurant chain. After identification of a significant regression for predicting sales from selected variables, a discriminate analysis was run to build a predictive equation for sorting potential new store sites into Underperforming and Over performing groups.

Hypothesis testing and cluster analysis of neck bone measurements

Discriminate analysis was run to identify measurements on neck bones that could be used to identify race, gender, and age of bodies. Significant results were found for some of the factors with predictive equations estimated. Predictive ability of the equations was significantly better than chance assignment to groups; however some difficulty was encountered in achieving low error rates.

Selecting means comparisons tests in ANOVA

In this project the Collaboratory helped a researcher select the correct means comparison test for use in an ANOVA on several treatments. Pre-specified tests as well as post hoc tests were discussed including Tukey, Fisher, Dunnetts and Bonferroni. Output differences and interpretation between SAS, Minitab, and Statview were also covered to aid in helping the client select the appropriate test for their situation.

Running and interpreting results from mixed model ANOVA's

 In this project the Collaboratory assisted in running and interpretation of mixed model ANOVA's for a vehicle emissions study. After log transformation and outlier analysis the ANOVA's were run on 12 clean burning fuels. Results indicated differences in emissions between the fuels. A quadratic effect mixed model ANOVA was then run on the data and significant interactions were found between some of the components of the fuels.

Cluster analysis of 9,000 clones of soil bacteria and virus populations from four extraction methods

Department: Plant Pathology
Client: Liz Bent

Individual soil samples can contain thousands of different colonial varieties of bacteria and fungus species. With the development of advanced methods of identification of the specific clones within soil samples methods were needed to test for differences in populations between samples. In this project samples were tested for differences in the frequency distribution of the bacteria and fungus. In addition, discriminate analysis was used to identify sets of frequently occurring clones that could be used to identify four treatments.

Analysis of effects of targeted data collection in on-road emissions model variability

Department: CE-CERT
Client: Matt Barth

Vehicle emissions comprise a significant portion of air pollution emissions in urban areas in the United States. Accurate measurement of mobile source emissions has been recognized as an important part of the fight to clean the air. With the advent of small accurate emissions monitors that can be installed in vehicles on the road the traditional data collection designs were found to be inadequate. This study explored the reductions in variability that could be obtained with the addition of a small data collection protocol to the current EPA test sequence.

Comparison of survivor functions

Department: Entomology
Client: Bob Luck and Carlos Coviella

Survival analysis methods were used to determine if colored dust impact. The lifetimes of pests. The hope is that marking dust does not impact Lifetimes, so that it could be used to track the movement of the pests Over time. Different groups of pests were marked with different colors Of marking dust, and a Cox proportional hazards model was used to determine. If the lifetime distributions for the treatment groups were different from a control group.
Multivariate analysis of language evaluation data

Department: Sociology
Client: Begona Echeverria

In the Basque region of Spain there are three language variants spoken, Spanish, and two types of Basque. In this project, listeners were asked to evaluate speakers based on their perceived social and personal traits. The analysis of this data focused on testing for differences in perceived status and other traits between the three languages and on a principal components analysis of the personality traits…
Supplemental analysis of thesis data

Department: Argosy College
Client: Steven Lee

 Learning disabled students at a local community college were surveyed for their opinions and usage of college programs provided for their assistance. Survey findings indicated overall satisfaction with the services provided and common expectations within the learning disabled student group.
Testing the Efficacy of Thinking in Chinese When Writing in English

The hypothesis of the investigator was that if Chinese students were to speak their ideas out loud in their native language before writing them down in English, then they would write better English compositions. An experiment was run where two groups of students, differing in their level of English command, were asked to write two pieces. One piece was a description of a particular event; the other piece was a personal letter. The written pieces were evaluated by an examiner using a 6-point scale. The amount of time the students spoke out loud in Chinese, and the number of times they switched back and forth between Chinese and English were observed during the process by the investigator. A polytomous regression model relating the score of the written pieces to the covariates was developed for each group of students. A comparison of the models fit for each group was made.

Efficacy of Web-based Curriculum for Teaching English Writing to Chinese Students

A web-based teaching curriculum for Chinese students learning to write in English was compared to the traditional classroom lecture format. Two groups of students were randomly assigned to the web-based method and the traditional method. Pre and Post test writing scores on a standardized test were analyzed using Analysis of Covariance techniques. Correlations between the student’s improvement in writing ability and their satisfaction with the course were studied.

Comparison of Two Methods for Teaching English as a Second Language

Two alternative approaches for teaching English as a second language to prospective UCR graduate student teaching assistants are compared. One of the approaches was used by the UCR Learning Center, the other by the UCR Extension Center. Groups of approximately 30 students who had not yet achieved a clear pass on the speak test were assigned to each Center during the Fall 2003 quarter. Gender, area of study, and level of previous experience with English were used to balance the assignment of the students to the two Centers. The student’s most recent speak test score prior to the beginning of the instructional period was used as a PRE score. At the end of the quarter, each student was given the speak test again, and that score was used as a POST score. The Learning Center did all of the testing and scoring for this experiment. When scoring the POST tests, the identity of the student was withheld from the raters to avoid any unconscious bias. The data from this experiment was analyzed using analysis of covariance, treating the PRE score as the covariate and the POST score as the dependent variable.

Fitting a Non-Linear Model Relating Concentration to Reaction Time

Non-linear regression was used to fit an expected mean value function of the form EMBED Equation.DSMT4 to samples of EMBED Equation.DSMT4 for several different experiments. Starting values were obtained from an analysis of the general shape of the non-linear function. Confidence intervals for b were provided as the main quantity of interest to the investigator.

Analysis of Wasp Preferences Pertaining to the Age of Scale Eggs

Wasps are a natural enemy of the ‘scale’, using their egg masses as a food sources for their own eggs. This experiment studied whether or not the age of the scale eggs had any influence on the preference of the wasp to choose them. A Y-shaped tube structure with scale eggs of a given age in one side and no scale eggs in the other was the experimental apparatus. Wasps were inserted into the tube and their preference to select the left or right half of the Y-shape was recorded. The proportion of wasps that chose to go toward the scale eggs was recorded and a test of whether the underlying probability a wasp chooses the scale eggs depends on age was conducted. A comparison of the duration of time the wasps spent with the scale eggs was also compared across age groups.

Quantifying the Effect of Preventive Maintenance

The reliability of a component can be improved by employing preventive maintenance, provided the hazard function of the component is increasing. A tutorial of how to quantify the improvement for age-preventive maintenance and periodic-preventive maintenance was provided to the client. It was shown how an optimal replacement time can be derived if the ratio of the costs of planned to unplanned replacements is known. Extensions to parallel and series systems were covered.

Effectiveness of Plant Virus Diagnostics in the Prescience of Multiple Viruses

Well known host inoculation methods to detect the presence of particular viruses in particular plant type were examined to test their robustness when the plant simultaneously had a second, potentially masking, virus. Trials were run to estimate the probability that the diagnostic continues to successfully detect the target virus in the presence of the second virus. Confidence intervals for the true probability of detection were constructed using normal intervals, Clopper-Pearson intervals and Agresti-Coull intervals.

Sample Size Calculation for Estimation of Binomial Success Parameter

Alternative methods for determining the sample size needed to deliver the required precision in the estimate of a binomial success parameter were reviewed with the client. Chebyshev, Chernoff, and Normal-theory bounds were examined.

Queuing Analysis for Student Recreational Center

The waiting time distribution for users of treadmills and elliptical trainers was estimated. The distribution was derived from consideration of the group of machines as a queuing system. Hourly data was collected on the number of machines that were busy, and also on the duration each user utilized the machine. The data was used to estimate the parameters of the waiting time distributions. Comparisons of the waiting time at different times of the day were made.

Calibrating Two Measuring Devices

Methods for comparing the accuracy and precision of two different measuring devices were examined. A regression approach to the problem was compared to the use of a concordance correlation approach.

Assessing Inter-Rater Reliability

Two different raters, A and B, evaluated n different objects and placed the objects into one of k categories. Different measures of agreement were used to assess how different the raters were in their classification approach. Overall proportion of agreement, the proportion of specific agreement and Cohen’s Kappa variation of these two measures were used.

Examination of Dating Preferences

Analysis of variance techniques were used to ascertain how previous relationship history impacts the attractiveness of a potential dating partner. Experience with physical and emotional connections with members of the opposite sex were examined with respect to whether they enhance or mute the attractiveness of someone for long-term dating and/or casual dating.

Testing the Equality of LD50 Values in Heterogeneous Bioassay Models

Logistic and probit regressions are the classic models for analyzing data from bioassay experiments. Three different types of models are usually tested for adequacy. The common line model has identical intercepts and slopes for all the compounds being studied. The parallel assay has a common slope, but different intercepts. The heterogeneous assay allows both the intercepts and the slopes to vary across the different compounds. A major inference question is which of the alternative models best describes the data. A frequently used summary measure of the fitted model is the set of LD50 values which represent the concentration levels at which 50% of the population is expected to be killed. Within the context of a parallel assay, a hypothesis test of equal LD50 values is equivalent to comparing the fit of the model to a common line model, and this hypothesis test is implemented in most software packages (including SAS and Minitab) that provide routines for analyzing bioassay experiments. Testing the equality of LD50 values in the context of a heterogeneous assay is not a standard test in the software packages. We develop a likelihood ratio test for this hypothesis and implement an algorithm to evaluate the test in an R program.

Evaluation of interrelationships between economic variables and foreign aid funding sources in developing countries.

Several international agencies provide economic development funds around the world with many of the grants tied to environmental projects. The object of this project was to help identify initial steps in exploratory data analysis for a data set comprised of economic aid funding levels for several fund sources as well as economic growth measures. Log transforms were found to linearize the relationship between two major funds and overall economic growth. In addition, the data were broken down by region and the influence of outlier observations was evaluated within each region.

California Air Resources Board Particulate Matter Literature Review

In this project, the College of Engineering, Center for environmental Research and Technology (CE-CERT) is compiling a literature database on particulate matter measurement with a focus on diesel vehicle emissions. For this project the collaboratory provided library search assistance with emphasis on evaluation on the bias and precision of measurement instrumentation and methodologies. In addition, the Collaboratory provided assistance on evaluation of statistical size adequacy of the studies under review.

US Environmental Protection Agency Development of a Mobile Emissions Model for Heavy-Duty Diesel Vehicles

Modeling efforts of second-by-second emissions of heavy-duty diesel vehicles are underway at several institutions across the US. In this project the Collaboratory is assisting CE-CERT in statistical analysis of on-road HDD emissions data as well as model validation on intermediate variables such as RPM and fuel use along with overall validation of the model on independent emissions data.

US Environmental Protection Agency Development of a Mobile Emissions Model for Ultra Low Emission Vehicles

Ultra low emission vehicles produce unique problems both in terms of emissions measurement and emissions modeling. CE-CERT is collecting on-road emissions data for Ultra Low Emissions Vehicles and Super Ultra Low Emission Vehicles as part of it’s Study of Extremely Low Emitting Vehicles (SELEV) program. In this project the Collaboratory is providing data analysis in support of the EPA modeling efforts of these low emitting vehicles. Model development sample size estimation as well as estimation of functional forms for relating emissions to driving behavior were developed as part of this project. The Collaboratory will also assist in the model validation as it is developed.

Study of Extremely Low Emitting Vehicles

The SELEV project is now in it’s fourth year. In this phase of the project the on-road data collection efforts will be evaluated for bias and precision. In addition, on-road data is being used to estimate degradation rates of emissions control systems on ULEV and SULEV vehicles. Because of the expense of the vehicle testing program, current data is being evaluated in order to estimate on-road variability both within vehicles and between vehicles within the ULEV and SULEV vehicle types. The current sampling plan will be evaluated and modified as necessary.

Estimating Mobile Source Emissions in Three National Parks

Accurate estimation of mobile source emissions within National Parks in the United States requires accurate vehicle fleet data as well as accurate driving behavior data. This is due to the non-standard vehicle fleets and driving that are typically found within National Parks. The vehicle within National Parks tend to be much newer and have a higher proportion of recreational vehicles and vehicles towing trailers than the typical urban vehicle fleet. In addition, the driving within the parks is frequently far different than standard city driving. In this project the Collaboratory is assisting CE-CERT in the analysis and modeling of vehicle fleet data and driving behavior data.

An analysis of student surveys from two community colleges regarding factors affecting student retention.

Students at two community colleges were surveyed about various factors and their influence on the student’s decision to continue in college. Instructor language skills were a primary variable of interest with student personal issues such as child care, work schedule, cost etc. of secondary interest. Significant differences between the schools were found on two individual questions. However, a principal components analysis of the student responses showed far greater student-to-student variability than school-to-school differences on the overall responses.

Estimation of Brownian Motion With Drift

A computer science data streaming application had been modeled using Brownian Motion with a drift parameter. It had been proposed to estimate the drift parameter and the scale parameter using equally spaced samples. The Collaboratory helped the researchers respond to a journal referee's question about how the parameters could be estimated if the sampling epochs were not equally spaced. We showed how maximum likelihood and weighted least squares methods could be used, and in fact reduce to their previously proposed method in the case of equally spaces sampling points.

Tolerance Intervals in the Context of Linear Regression

An analysis by the California EPA that computed a threshold on the amount of methyl bromide (pesticide) that farmers could use without causing excess concentrations in ambiet air was crticially reviewed. The analysis used tolerance interval computations to arrive at the threshold. Our feedback to the client addressed the appropriateness of tolerance intervals for that context and also several statistical Issues underlying the application the was described in the report.


More Information

General Campus Information

University of California, Riverside
900 University Ave.
Riverside, CA 92521
Tel: (951) 827-1012

Career OpportunitiesUCR Libraries
Campus StatusDirections to UCR

Collaboratory Information

Statistical Consulting Collaboratory
1438 Olmsted Hall

Tel: (951) 827-6062
Fax: (951) 827-3286
E-mail: karen.xu@ucr.edu