Tuesday, July 20, 2010

What is the best way to index thousands of documents for use in a ASP NET 2.0 web site?

I want to be able to add/delete documents, (mainly PDF's) allow my users to add documents in a members based service. What is the best way to index these documents as they are uploaded/donated/or already exist in certain folders? Basically I am looking for a program of some sort that allows me to index them into a SQL DB or XML file where I can then write code to allow searching of the indexed information and pull up the correct documents. The basic need is what is the best tool to use to index the content so I have a db to pull from in search queries based on keywords so I can show the most relevant documents to their search text. Is there a program or plug in of somekind that already indexes PDF's etc and places them into a db or xml format so I can use this info to create my own custom search script/results layout so they can open these docs if they find what they need?

What is the best way to index thousands of documents for use in a ASP NET 2.0 web site?
google and others do it! use google on your own server!
Reply:This is a wide topic.





The documents can be stored in a variety of ways. One of them is storing them as binary data in MS SQL. This offers advantages and disadvantages. The advantages are offered in terms of security and the fact that SQL Server offers transaction support. The disadvantage is that retrieving the data will be slow since SQL Server pages data in 8 KBs, hence retrieving a document from SQL Server will result in lots of input / output being generated.





Another approaches is store the documents on an FTP and simply store references of your documents in the DB. FTP's are much faster %26amp; they are built exactly to support file transport.





There are lots of products that offer document management %26amp; workflow management. You can also try to use one instead of trying to reinvent the wheel.





Hope this helps.
Reply:Microsoft Indexing service. a default service found on all windows 2003 server OS packages would be the best way to go.


No comments:

Post a Comment