Fine-Tuning Gold Questions in Crowdsourcing Tasks using Probabilistic and Siamese Neural Network Models



Published Jun 25, 2019
  • José María González Pinto
  • Kinda El Maarry
  • Wolf-Tilo Balke


The economic benefits of crowdsourcing have furthered its wide-spread use over the past decade. However, increasing numbers of fraudulent workers threaten to undermine the emerging crowdsourcing economy: requestors face the choice of either risk-ing low-quality results or having to pay extra money for quality safeguards such as gold questions or majority voting. The more safeguards injected into the workload, the lower are the risks im-posed by fraudulent workers, yet the higher are the costs. So, how many of them are actually needed? Is there a generally applicable number or percentage? This paper uses deep learning techniques to identify custom-tailored numbers of gold questions per worker for individually managing the cost/quality balance. Our new method follows real-life experiences: the more we know about workers before assigning a task, the clearer our belief or disbelief in this worker’s reliability gets. Employing probabilistic models, namely Bayesian belief networks and certainty factor models, our method creates worker profiles reflecting different a-priori belief values, and we prove that the actual number of gold questions per worker can indeed be assessed. Our evaluation on real-world crowdsourc-ing datasets demonstrates our method's efficiency in saving money while maintaining high-quality results.
Abstract 131 | PDF Downloads 74 Datasets Downloads 0


[1] Aberer, K. and Despotovic, Z. 2001. Managing trust in a peer-2-peer information system. Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM’01). (2001), 310–317.

[2] BROMLEY, J. et al. 1993. SIGNATURE VERIFICATION USING A “SIAMESE” TIME DELAY NEURAL NETWORK. International Journal of Pattern Recognition and Artificial Intelligence. (1993).

[3] Daltayanni, M. et al. 2015. WorkerRank: Using Employer Implicit Judgements To Infer Worker Reputation.pdf. WSDM 15M (Shanghai, China, 2015).

[4] Das, A. et al. 2016. Together we stand: Siamese Networks for Similar Question Retrieval. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2016).

[5] Davidson, S.B. et al. 2013. Using the Crowd for Top-k and Group-by Queries. Proceedings of the 16th Int. Conf. on Database Theory (ICDT’13) (Genoa, Italy, 2013), 225–236.

[6] Dawid, A.P. and Skene, A.M. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal
Statistical Society Series C Applied Statistics. JSTOR.

[7] Franklin, M. et al. 2011. CrowdDB: Answering queries with crowdsourcing. ACM SIGMOD Int. Conf. on Management of Data (Athens, Greece, 2011).

[8] Goodfellow, I. et al. 2016. Deep Learning. MIT Press. 521, 7553 (2016), 800.

[9] Hadsell, R. et al. 2006. Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2006).

[10] Hinton, G.E. et al. 2012. Improving neural networks by preventing co-adaptation of feature detectors.

[11] Hossain, M. 2012. Users’ motivation to participate in online crowdsourcing platforms. ICIMTR 2012 - 2012 International Conference on Innovation, Management and Technology Research (2012), 310–315.

[12] Ignjatovic, A. et al. 2008. An Analytic Approach to Reputation Ranking of Participants in Online Transactions. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (2008).

[13] Kamvar, S.D. et al. 2003. The Eigentrust algorithm for reputation management in P2P networks. 12th International Conference on World Wide Web (WWW ). (2003), 640.

[14] Kazai, G. 2011. In Search of Quality in Crowdsourcing for Search Engine Evaluation. SIGIR Forum. 44, 2 (2011), 165–176.

[15] Koch, G. et al. 2015. Siamese Neural Networks for One-shot Image Recognition. International Conference on Machine Learning. (2015).

[16] Krieg, M.L. 2001. A tutorial on Bayesian belief networks.

[17] Lofi, C. et al. Skyline Queries in Crowd-Enabled Databases.
International Conference on Extending Database Technology (EDBT) (Genoa, Italy).

[18] Maarry, K. El et al. 2015. Realizing Impact Sourcing by Adaptive Gold Questions: A Socially Responsible Measure for Workers’ Trustworthiness. Web-Age Information Management: 16th International Conference, WAIM 2015, Qingdao, China, June 8-10, 2015. Proceedings. X.L. Dong et al., eds. Springer, Cham. 17–29.

[19] Maarry, K. El et al. 2014. Skill ontology-based model for Quality Assurance in Crowdsourcing. UnCrowd 2014: DASFAA Workshop on Uncertain and Crowdsourced Data, Bali, Indonesia (2014), 12.

[20] El Maarry, K. and Balke, W.-T. 2018. Quest for the Gold Par: Minimizing the Number of Gold Questions to Distinguish Between the Good and the Bad. Proceedings of the 10th ACM Conference on Web Science. (2018), 185–194.

[21] Maarry, K. El and Balke, W.-T. 2015. Retaining Rough Diamonds: Towards a Fairer Elimination of Low-Skilled Workers. Database Systems for Advanced Applications: 20th International Conference, DASFAA 2015, Hanoi, Vietnam, April 20-23, 2015, Proceedings, Part II. M. Renz et al., eds. Springer, Cham. 169–185.

[22] Mellouli, T. Complex Certainty Factors for Rule Based Systems – Detecting Inconsistent Argumentations.

[23] Mueller, J. and Aditya Thyagarajan 2016. Siamese Recurrent Architectures for Learning Sentence Similarity. Proceedings of the 30th Conference on Artificial Intelligence (AAAI 2016). (2016).

[24] Nieke, C. et al. 2014. TopCrowd. 33rd Int. Conf. on conceptual Modeling (ER) (Atlanta, GA, USA, 2014), 122–135.

[25] Noorian, Z. and Ulieru, M. 2010. The State of the Art in Trust and Reputation Systems: A Framework for Comparison. Journal of theoretical and applied electronic commerce research.

[26] Pearl, J. 1985. Bayesian Networks: A model of self-activated memory for evidential reasoning. 7th Conf. of the Cognitive Science Societ (Irvine, California, USA, 1985), 329–334.

[27] Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers.

[28] Raykar, V.C. et al. 2010. Learning From Crowds. Journal of Machine Learning Research.

[29] Ross, J. et al. 2010. Who are the Turkers? Worker Demographics in Amazon Mechanical Turk. Chi Ea 2010. (2010), 2863–2872.

[30] Scoring Workers in Crowdsourcing: How Many Control Questions are Enough? 2013. .

[31] Shen, C. et al. 2017. Deep Siamese Network with Multi-level Similarity Perception for Person Re-identification. Proceedings of the 2017 ACM on Multimedia Conference - MM ’17. (2017), 1942–1950.

[32] Shortliffe, E.H. and Buchanan, B.G. 1975. A model of inexact reasoning in medicine. Mathematical Biosciences. 23, 3–4 (1975), 351–379.

[33] Stadler, F. 2004. Induction and deduction in the sciences. Kluwer Academic Publishers.

[34] Tran, N.K. and Niederée, C. 2018. A Neural Network-based Framework for Non-factoid Question Answering. {WWW} (Companion Volume) (2018).

[35] Whitehill, J. et al. 2009. Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. Advances in Neural Information Processing Systems. 22, 1 (2009), 1–9.

[36] Yu, B. and Singh, M.P. 2003. Detecting Deception in Reputation Management. Proceedings of the second international joint conference on Autonomous agents and multiagent systems (2003), 73–80.

[37] Zheng, Y. et al. 2015. QASCA : A Quality-Aware Task Assignment System for Crowdsourcing Applications. SIGMOD 2015 (2015), 1031–1046.