3. Week

WHAT HAVE WE DONE THIS WEEK?

At the beginning of this week we were working on our prototype to get it ready for our first meeting with Schurter in Santa Rosa. We tested our software with a file of 950 tuples to make sure it works and we are able to deliver some example data to Schurter. In addition to that we worked on our exception handling.

Together with prof. Marfurt we drove up to the office of Schurter in Santa Clara at Wednesday September 30^th. In our meeting we showed the prototype and discussed our next steps. These were the main findings of our meeting:

Our application meets their expectations
Change the search engine from Bing to Google
No implementation of LinkedIn because of old-fashioned industry codes
Additional proof of concept for the further categorization based on the data our application will deliver
Further improvements of our search logic
Improving the performance of the data gathering process

Back at Santa Clara we started to adapt the new requirements to our project.

WHAT ARE WE GOING TO DO NEXT?

As mentioned above we will change the search engine from Bing to Google. To include Google to our software we have to get familiar with the Google search API.

Furthermore, we have to find a new way to improve our search logic. One attempt would be to implement a ranking system for each result we get from the search engines. It is important, that this result analyzer is independent from which search engine we use, so we could make sure the results will be improved and Schurter will get better data. Another approach is, that we will include a “Blacklist” which will ignore useless URLs like Facebook, Wikipedia and so on.

In addition to that we will have to come up with a proof of concept for a categorization approach. After our software has gathered all Meta Tags from the clients URLs, Schurter would like to categorize the clients into different industries and as an ultimate goal they would like to know which products are built with the components the retailers sell to them. At the meeting we elaborated that finding such fine granulated information cannot be achieved is exceptionally hard as it is unsure whether it is written down in the meta data or not.

At the moment the lookup of the URLs is not fast enough. It would take more than 22 hours to process the whole file of 53.000 tuples. So we have to come up with a faster way to process the data. After some research we decided to use a multithreaded attempt to solve this problem. We will figure out if this attempt will work next week.

SCHURTER Meeting 30.9.15.pptx