Log data from DNS resolvers contain rich information that is quite useful for various research use cases such as estimating the popularity of websites. Log data from approximately 39k resolvers has been collected and stored on HDFS. The data is so huge and not optimally structured that it takes a lot of time and resources to search records of interest from the the log. In this project, we investigate techniques to port the log data to a new format so that it speeds up the query time and takes less resources both to store the data and to query the data. We investigated bzip compression, reformatting/pruning unessential records and partitioning the records into separate buckets and from our experiments, we found that using a combination of reformatting/pruning records with partitioning and efficiently sorting the records based on multiple fields speeds up the domain query by 6 times and takes approximately 8 times less resources to query in comparison with unstructured data. Also, the new data takes about 9 times less space on disk.
The authors of these documents have submitted their reports to this technical report series for the purpose of non-commercial dissemination of scientific work. The reports are copyrighted by the authors, and their existence in electronic format does not imply that the authors have relinquished any rights. You may copy a report for scholarly, non-commercial purposes, such as research or instruction, provided that you agree to respect the author's copyright. For information concerning the use of this document for other than research or instructional purposes, contact the authors. Other information concerning this technical report series can be obtained from the Computer Science and Engineering Department at the University of California at San Diego, email@example.com.
[ Search ]