Speed dating dataset
I describe the contents of each folder below: This procedure includes various checks, imputations, type changes… 3.
This is because it was conducted in Colombia Business School speed dating the conclusion of the analysis is based on the specific sample, speed dating warwickshire universal.
Sometimes, they did not add up the numbers correctly.
Plot go out repartition ggplot go. All variables were also run in Ggaly: Demographic Data Next we went into the demographic outlook of the people in our data set.
We ultimatley found four models which should a significant p-value for linear regression:. The speed-dates lasted four minutes each. Participants didn't attend multiple events, so rather than using information about previous events to predict what will happen at an event, we predict participants' decisions on a given date based on information about other dates at the same event. We decided to look into age, gender and race. R — Creates average ratings for each participant on a date.
According to our DQR, there is one missing id in our dataset. Next we went into the demographic outlook of the people in our data set. Prior to the date.
The folder contains some visualizations that are helpful in understanding the dataset, but is currently very incomplete. In addition, a questionnaire provided key personal data: In rating attributes, users were asked to distribute point to six variables.
Whilst this does still clearly mean that the assumption of homoscedasticity does not hold to some extent, it is not severe enough to undermine the entire linear regression model. It is a very good help to understand and clean the data.
I instead used logistic regression.
I still need to clean this up: The other dates that we used usually included dates that in fact happened after the date for which we predicted decisions, and there are a priori reasons to question whether the results would generalize to situations where one was trying to predict the future exclusively based on knowledge of the past.
Followup survey taken day after the date.
One of the problems the data set has is that it is biased. My core intent was to predict when two speed 25 speed dating toronto participants will express interest in exchanging contact with one another, assuming that one had data on them from previous similar events. In fact, simply excluding these ratings introduces structure in the dataset that results in the features being contaminated with the participants' ratings of each other in a subtle way, and we introduce a stochastic element to eliminate the contamination.
Reload to refresh your session. Background We also wanted to know more about the background of the people in our dataset. Speed dating macon work This contains code that I used to select features.
Descriptive Analysis One of the problems the data set has is that it is biased. This contains code that I used to select features. How many unique carrer do we have? The next step was to build and experiment with numerous linear regression models looking at each independent variable alone and then built up using numerous combinations of these variables.
A person has also an unique identifier within the wave: Dev AIC 1 Impossible to impute that! Whilst the data set itself is quite large records it is worth mentioning that it has a high likelihood of being biased as all participants were Colombia Business School students and thus any conclusions from this project may not necessarily be generalizable.
R — Uses collaborative filtering to generate guesses for participants' ratings of each other, by excluding one date from an event at a time. The methodology is largely implicit in the code, which is coherently organized, up to some minor changes that I still need to make.
This procedure includes various checks, imputations, type changes…. General Data We first took a quick look into some of the general data we have.
It is interesting to notice that some people wanted to meet again their partner even if there was not a match between them - speed dating dataset than case in this dataset. The surveys conducted to reflect on the people themself and their partner has 3 different timelines:.
Whilst we must be clear then that the current linear regression model does violate assumption 3 we will choose to continue with the results regardless until the model can be improved. R", but I think that the algorithm is suboptimal to a significant degree, even amongst polynomial time algorithms, and to the extent that it works, its effectiveness may derive from the fact that for a given event, the probabilities that the model generates reflect information about what happened on other dates at the event.